Un moyen facile d’analyser une URL en multi-plateforme C++?

Question

J'ai besoin d'analyser une URL pour obtenir le protocole, l'hôte, le chemin d'accès et la requête dans une application que j'écris en C++. L'application est destinée à être multi-plateforme. Je suis surpris de ne rien trouver qui fasse cela dans les librairies boost ou POCO . Est-ce que c'est évident que je ne cherche pas? Des suggestions sur les bibliothèques open source appropriées? Ou est-ce quelque chose que je dois juste faire moi-même? Ce n'est pas très compliqué, mais cela me semble une tâche si commune que je suis surpris qu'il n'y ait pas de solution commune.

Dean Michael · Accepted Answer

Une bibliothèque est proposée pour l'inclusion Boost et vous permet d'analyser facilement l'URI HTTP. Il utilise Boost.Spirit et est également publié sous la licence logicielle Boost. La bibliothèque est cpp-netlib dont vous pouvez trouver la documentation à l'adresse http://cpp-netlib.github.com/ - vous pouvez télécharger la dernière version à partir de http://github.com/cpp. -netlib/cpp-netlib/downloads .

Le type pertinent que vous souhaitez utiliser est boost::network::http::uri et est documenté ici .

wilhelmtell · Answer

Terriblement désolé, je ne pouvais pas m'en empêcher. : s

url.hh

#ifndef URL_HH_ #define URL_HH_ #include <string> struct url { url(const std::string& url_s); // omitted copy, ==, accessors, ... private: void parse(const std::string& url_s); private: std::string protocol_, Host_, path_, query_; }; #endif /* URL_HH_ */

url.cc

#include "url.hh" #include <string> #include <algorithm> #include <cctype> #include <functional> using namespace std; // ctors, copy, equality, ... void url::parse(const string& url_s) { const string prot_end("://"); string::const_iterator prot_i = search(url_s.begin(), url_s.end(), prot_end.begin(), prot_end.end()); protocol_.reserve(distance(url_s.begin(), prot_i)); transform(url_s.begin(), prot_i, back_inserter(protocol_), ptr_fun<int,int>(tolower)); // protocol is icase if( prot_i == url_s.end() ) return; advance(prot_i, prot_end.length()); string::const_iterator path_i = find(prot_i, url_s.end(), '/'); Host_.reserve(distance(prot_i, path_i)); transform(prot_i, path_i, back_inserter(Host_), ptr_fun<int,int>(tolower)); // Host is icase string::const_iterator query_i = find(path_i, url_s.end(), '?'); path_.assign(path_i, query_i); if( query_i != url_s.end() ) ++query_i; query_.assign(query_i, url_s.end()); }

main.cc

// ... url u("HTTP://stackoverflow.com/questions/2616011/parse-a.py?url=1"); cout << u.protocol() << '	' << u.Host() << ...

Tom · Answer

La version Wstring ci-dessus, ajouté d'autres champs dont j'avais besoin. Pourrait certainement être raffiné, mais assez bon pour mes fins.

#include <string> #include <algorithm> // find struct Uri { public: std::wstring QueryString, Path, Protocol, Host, Port; static Uri Parse(const std::wstring &uri) { Uri result; typedef std::wstring::const_iterator iterator_t; if (uri.length() == 0) return result; iterator_t uriEnd = uri.end(); // get query start iterator_t queryStart = std::find(uri.begin(), uriEnd, L'?'); // protocol iterator_t protocolStart = uri.begin(); iterator_t protocolEnd = std::find(protocolStart, uriEnd, L':'); //"://"); if (protocolEnd != uriEnd) { std::wstring prot = &*(protocolEnd); if ((prot.length() > 3) && (prot.substr(0, 3) == L"://")) { result.Protocol = std::wstring(protocolStart, protocolEnd); protocolEnd += 3; // :// } else protocolEnd = uri.begin(); // no protocol } else protocolEnd = uri.begin(); // no protocol // Host iterator_t hostStart = protocolEnd; iterator_t pathStart = std::find(hostStart, uriEnd, L'/'); // get pathStart iterator_t hostEnd = std::find(protocolEnd, (pathStart != uriEnd) ? pathStart : queryStart, L':'); // check for port result.Host = std::wstring(hostStart, hostEnd); // port if ((hostEnd != uriEnd) && ((&*(hostEnd))[0] == L':')) // we have a port { hostEnd++; iterator_t portEnd = (pathStart != uriEnd) ? pathStart : queryStart; result.Port = std::wstring(hostEnd, portEnd); } // path if (pathStart != uriEnd) result.Path = std::wstring(pathStart, queryStart); // query if (queryStart != uriEnd) result.QueryString = std::wstring(queryStart, uri.end()); return result; } // Parse }; // uri

Tests/Utilisation

Uri u0 = Uri::Parse(L"http://localhost:80/foo.html?&q=1:2:3"); Uri u1 = Uri::Parse(L"https://localhost:80/foo.html?&q=1"); Uri u2 = Uri::Parse(L"localhost/foo"); Uri u3 = Uri::Parse(L"https://localhost/foo"); Uri u4 = Uri::Parse(L"localhost:8080"); Uri u5 = Uri::Parse(L"localhost?&foo=1"); Uri u6 = Uri::Parse(L"localhost?&foo=1:2:3"); u0.QueryString, u0.Path, u0.Protocol, u0.Host, u0.Port....

Elliot Cameron · Answer

Pour être complet, il y en a un écrit en C que vous pourriez utiliser (avec un petit emballage, sans doute): http://uriparser.sourceforge.net/

[Compatible RFC et supporte Unicode]

Voici un emballage très basique que j'ai utilisé pour récupérer simplement les résultats d'une analyse.

#include <string> #include <uriparser/Uri.h> namespace uriparser { class Uri //: boost::noncopyable { public: Uri(std::string uri) : uri_(uri) { UriParserStateA state_; state_.uri = &uriParse_; isValid_ = uriParseUriA(&state_, uri_.c_str()) == URI_SUCCESS; } ~Uri() { uriFreeUriMembersA(&uriParse_); } bool isValid() const { return isValid_; } std::string scheme() const { return fromRange(uriParse_.scheme); } std::string Host() const { return fromRange(uriParse_.hostText); } std::string port() const { return fromRange(uriParse_.portText); } std::string path() const { return fromList(uriParse_.pathHead, "/"); } std::string query() const { return fromRange(uriParse_.query); } std::string fragment() const { return fromRange(uriParse_.fragment); } private: std::string uri_; UriUriA uriParse_; bool isValid_; std::string fromRange(const UriTextRangeA & rng) const { return std::string(rng.first, rng.afterLast); } std::string fromList(UriPathSegmentA * xs, const std::string & delim) const { UriPathSegmentStructA * head(xs); std::string accum; while (head) { accum += delim + fromRange(head->text); head = head->next; } return accum; } }; }

Michael Mc Donnell · Answer

La classe d'URI de POCO peut analyser les URL pour vous. L'exemple suivant est une version abrégée de celle de Diapositives URI et UID POCO :

#include "Poco/URI.h" #include <iostream> int main(int argc, char** argv) { Poco::URI uri1("http://www.appinf.com:88/sample?example-query#frag"); std::string scheme(uri1.getScheme()); // "http" std::string auth(uri1.getAuthority()); // "www.appinf.com:88" std::string Host(uri1.getHost()); // "www.appinf.com" unsigned short port = uri1.getPort(); // 88 std::string path(uri1.getPath()); // "/sample" std::string query(uri1.getQuery()); // "example-query" std::string frag(uri1.getFragment()); // "frag" std::string pathEtc(uri1.getPathEtc()); // "/sample?example-query#frag" return 0; }

Tom Makin · Answer

La bibliothèque Poco a maintenant une classe pour disséquer les URI et renvoyer l'hôte, les segments de chemin et la chaîne de requête, etc.

https://pocoproject.org/pro/docs/Poco.URI.html

velcrow · Answer

//Sudo apt-get install libboost-all-dev; #install boost //g++ urlregex.cpp -lboost_regex; #compile #include <string> #include <iostream> #include <boost/regex.hpp> using namespace std; int main(int argc, char* argv[]) { string url="https://www.google.com:443/webhp?gws_rd=ssl#q=cpp"; boost::regex ex("(http|https)://([^/ :]+):?([^/ ]*)(/?[^ #?]*)\x3f?([^ #]*)#?([^ ]*)"); boost::cmatch what; if(regex_match(url.c_str(), what, ex)) { cout << "protocol: " << string(what[1].first, what[1].second) << endl; cout << "domain: " << string(what[2].first, what[2].second) << endl; cout << "port: " << string(what[3].first, what[3].second) << endl; cout << "path: " << string(what[4].first, what[4].second) << endl; cout << "query: " << string(what[5].first, what[5].second) << endl; cout << "fragment: " << string(what[6].first, what[6].second) << endl; } return 0; }

Sun · Answer

La bibliothèque Folly de Facebook peut faire le travail facilement. Utilisez simplement la classe Uri :

#include <folly/Uri.h> int main() { folly::Uri folly("https://code.facebook.com/posts/177011135812493/"); folly.scheme(); // https folly.Host(); // code.facebook.com folly.path(); // posts/177011135812493/ }

Ralf · Answer

http://code.google.com/p/uri-grammar/ qui, comme la netlib de Dean Michael, utilise l’esprit boost pour analyser un URI. Je suis tombé dessus à Exemple d’analyseur d’expression simple utilisant Boost :: Spirit?

Sergey K. · Answer

Cette bibliothèque est très petite et légère: https://github.com/corporateshark/LUrlParser

Cependant, il s'agit uniquement d'analyser, pas de normalisation/validation d'URL.

Vivit · Answer

Vous pouvez essayer la bibliothèque open-source appelée C++ REST SDK (créée par Microsoft, distribuée sous la licence Apache 2.0). Il peut être construit pour plusieurs plates-formes, y compris Windows, Linux, OSX, iOS, Android). Il existe une classe appelée web::uri dans laquelle vous insérez une chaîne et pouvez récupérer des composants d’URL individuels. Voici un exemple de code (testé sur Windows):

#include <cpprest/base_uri.h> #include <iostream> #include <ostream> web::uri sample_uri( L"http://dummyuser@localhost:7777/dummypath?dummyquery#dummyfragment" ); std::wcout << L"scheme: " << sample_uri.scheme() << std::endl; std::wcout << L"user: " << sample_uri.user_info() << std::endl; std::wcout << L"Host: " << sample_uri.Host() << std::endl; std::wcout << L"port: " << sample_uri.port() << std::endl; std::wcout << L"path: " << sample_uri.path() << std::endl; std::wcout << L"query: " << sample_uri.query() << std::endl; std::wcout << L"fragment: " << sample_uri.fragment() << std::endl;

La sortie sera:

scheme: http user: dummyuser Host: localhost port: 7777 path: /dummypath query: dummyquery fragment: dummyfragment

Il existe également d’autres méthodes faciles à utiliser, par exemple: pour accéder aux paires d'attributs/valeurs individuelles à partir de la requête, divisez le chemin en composants, etc.

Mike Ellery · Answer

Voici la librairie google-url récemment publiée:

http://code.google.com/p/google-url/

La bibliothèque fournit une API d’analyse d’URL de bas niveau ainsi qu’une abstraction de niveau supérieur appelée GURL. Voici un exemple utilisant cela:

#include <googleurl\src\gurl.h> wchar_t url[] = L"http://www.facebook.com"; GURL parsedUrl (url); assert(parsedUrl.DomainIs("facebook.com"));

Deux petites plaintes que j'ai avec elle: (1) il veut utiliser ICU par défaut pour traiter différents encodages de chaîne et (2) il fait quelques hypothèses sur la journalisation (mais je pense qu'ils peuvent être désactivés). En d’autres termes, la bibliothèque n’est pas complètement autonome telle qu’elle existe, mais je pense que c’est toujours une bonne base pour commencer, surtout si vous utilisez déjà ICU.

Mr. Jones · Answer

Puis-je proposer une autre solution autonome basée sur std :: regex:

const char* SCHEME_REGEX = "((http[s]?)://)?"; // match http or https before the :// const char* USER_REGEX = "(([^@/:\s]+)@)?"; // match anything other than @ / : or whitespace before the ending @ const char* Host_REGEX = "([^@/:\s]+)"; // mandatory. match anything other than @ / : or whitespace const char* PORT_REGEX = "(:([0-9]{1,5}))?"; // after the : match 1 to 5 digits const char* PATH_REGEX = "(/[^:#?\s]*)?"; // after the / match anything other than : # ? or whitespace const char* QUERY_REGEX = "(\?(([^?;&#=]+=[^?;&#=]+)([;|&]([^?;&#=]+=[^?;&#=]+))*))?"; // after the ? match any number of x=y pairs, seperated by & or ; const char* FRAGMENT_REGEX = "(#([^#\s]*))?"; // after the # match anything other than # or whitespace bool parseUri(const std::string &i_uri) { static const std::regex regExpr(std::string("^") + SCHEME_REGEX + USER_REGEX + Host_REGEX + PORT_REGEX + PATH_REGEX + QUERY_REGEX + FRAGMENT_REGEX + "$"); std::smatch matchResults; if (std::regex_match(i_uri.cbegin(), i_uri.cend(), matchResults, regExpr)) { m_scheme.assign(matchResults[2].first, matchResults[2].second); m_user.assign(matchResults[4].first, matchResults[4].second); m_Host.assign(matchResults[5].first, matchResults[5].second); m_port.assign(matchResults[7].first, matchResults[7].second); m_path.assign(matchResults[8].first, matchResults[8].second); m_query.assign(matchResults[10].first, matchResults[10].second); m_fragment.assign(matchResults[15].first, matchResults[15].second); return true; } return false; }

J'ai ajouté des explications pour chaque partie de l'expression régulière. Cela vous permet de choisir exactement les parties pertinentes à analyser pour l'URL que vous souhaitez obtenir. Rappelez-vous simplement de modifier les index de groupe d'expressions régulières souhaités en conséquence.

sdgfsdh · Answer

Une petite dépendance que vous pouvez utiliser est uriparser , qui a récemment été déplacée vers GitHub.

Vous pouvez trouver un exemple minimal dans leur code: https://github.com/uriparser/uriparser/blob/63384be4fb8197264c55ff53a135110ecd557c4/tool/uriparse.c

Ce sera plus léger que Boost ou Poco. Le seul problème est que c'est C.

Il y a aussi un paquet Buckaroo :

buckaroo add github.com/buckaroo-pm/uriparser

Matthew Flaschen · Answer

QT a QUrl pour cela. GNOME a SoupURI in libsoup , que vous trouverez probablement un peu plus léger.

Fabiano Tarlao · Answer

J'ai développé une solution "orientée objet", une classe C++, qui fonctionne avec une expression rationnelle telle que les solutions @ Mr.Jones et @velcrow. Ma classe Url effectue l'URL/uri 'analyse syntaxique'.

Je pense que j'ai amélioré velcrow regex pour être plus robuste et inclut également la partie nom d'utilisateur.

Suivant la première version de mon idée, j'ai publié le même code, amélioré, dans mon projet de source ouverte sous licence GPL3 - Cpp URL Parser .

#ifdef/ndef partie de ballonnement omise, suit Url.h

#include <string> #include <iostream> #include <boost/regex.hpp> using namespace std; class Url { public: boost::regex ex; string rawUrl; string username; string protocol; string domain; string port; string path; string query; string fragment; Url(); Url(string &rawUrl); Url &update(string &rawUrl); };

C'est le code du fichier d'implémentation Url.cpp:

#include "Url.h" Url::Url() { this -> ex = boost::regex("(ssh|sftp|ftp|smb|http|https):\/\/(?:([^@ ]*)@)?([^:?# ]+)(?::(\d+))?([^?# ]*)(?:\?([^# ]*))?(?:#([^ ]*))?"); } Url::Url(string &rawUrl) : Url() { this->rawUrl = rawUrl; this->update(this->rawUrl); } Url &Url::update(string &rawUrl) { this->rawUrl = rawUrl; boost::cmatch what; if (regex_match(rawUrl.c_str(), what, ex)) { this -> protocol = string(what[1].first, what[1].second); this -> username = string(what[2].first, what[2].second); this -> domain = string(what[3].first, what[3].second); this -> port = string(what[4].first, what[4].second); this -> path = string(what[5].first, what[5].second); this -> query = string(what[6].first, what[6].second); this -> fragment = string(what[7].first, what[7].second); } return *this; }

Exemple d'utilisation:

string urlString = "http://gino@ciao.it:67/ciao?roba=ciao#34"; Url *url = new Url(urlString); std::cout << " username: " << url->username << " URL domain: " << url->domain; std::cout << " port: " << url->port << " protocol: " << url->protocol;

Vous pouvez également mettre à jour l'objet Url pour représenter (et analyser) une autre URL:

url.update("http://gino@nuovociao.it:68/nuovociao?roba=ciaoooo#")

J'apprends le C++ tout à l'heure, alors je ne suis pas sûr d'avoir suivi les meilleures pratiques en matière de C++ à 100% ... Tous les conseils sont appréciés.

P.s: regardons Cpp URL Parser, il y a des améliorations.

S'amuser

Larytet · Answer

Il existe encore une autre bibliothèque https://snapwebsites.org/project/libtld qui gère tous les domaines de premier niveau possibles et les shema URI.