Options pour le web scraping - version C ++ uniquement

Question

Je recherche une bonne bibliothèque C++ pour le web scraping.
Il doit être C/C++ et rien sinon, ne me dirigez pas vers Options pour le scraping HTML ou autre SO questions/réponses où C++ n'est même pas mentionné.

Kyle Simek · Answer

libcurl pour télécharger le fichier html
libtidy pour convertir en xml valide
libxml pour analyser/naviguer dans le xml

Halcyon · Answer

Utilisez mon analyseur C/C++ myhtml ici ; mort simple, très rapide. Pas de dépendances sauf C99. Et a des sélecteurs CSS intégrés (exemple ici )

StereoMatching · Answer

Je recommande Qt5.6.2, cette puissante bibliothèque nous offre

API réseau asynchrone de haut niveau, intuitive et asynchrone comme QNetworkAccessManager, QNetworkReply, QNetworkProxy, etc.
Classe regex puissante comme QRegularExpression
Moteur Web décent comme QtWebEngine
Interface graphique robuste et mature comme QWidgets
La plupart des API Qt5 sont bien conçues, le signal et le slot facilitent également l'écriture de codes asynchrones
Excellent support Unicode
Bibliothèque de système de fichiers riche en fonctionnalités. Que ce soit pour créer, supprimer, renommer ou trouver un chemin standard pour enregistrer des fichiers, c'est un jeu d'enfant dans Qt5
L'API asynchrone de QNetworkAccessManager facilite la création de nombreuses demandes de téléchargement à la fois
Traversez les principales plates-formes de bureau, Windows, Mac OS et Linux, écrivez une fois compilé n'importe où, une base de code uniquement.
Facile à déployer sur Windows et Mac (Linux? Peut-être que linuxdeployqt peut nous éviter des tonnes de problèmes)
Facile à installer sur Windows, Mac et Linux
Etc

J'ai déjà écrit une application de récupération d'images par Qt5, cette application peut gratter presque toutes les images recherchées par Google, Bing et Yahoo.

Pour en savoir plus, veuillez visiter mon projet github . J'ai écrit un aperçu de haut niveau sur la façon de récupérer des données par Qt5 sur mes blogs (il est trop long à publier en cas de dépassement de la pile).

DanielB · Answer

// download winhttpclient.h // -------------------------------- #include <winhttp\WinHttpClient.h> using namespace std; typedef unsigned char byte; #define foreach BOOST_FOREACH #define reverse_foreach BOOST_REVERSE_FOREACH bool substrexvealue(const std::wstring& html,const std::string& tg1,const std::string& tg2,std::string& value, long& next) { long p1,p2; std::wstring wtmp; std::wstring wtg1(tg1.begin(),tg1.end()); std::wstring wtg2(tg2.begin(),tg2.end()); p1=html.find(wtg1,next); if(p1!=std::wstring::npos) { p2=html.find(wtg2,next); if(p2!=std::wstring::npos) { p1+=wtg1.size(); wtmp=html.substr(p1,p2-p1-1); value=std::string(wtmp.begin(),wtmp.end()); boost::trim(value); next=p1+1; } } return p1!=std::wstring::npos; } bool extractvalue(const std::wstring& html,const std::string& tag,std::string& value, long& next) { long p1,p2,p3; std::wstring wtmp; std::wstring wtag(tag.begin(),tag.end()); p1=html.find(wtag,next); if(p1!=std::wstring::npos) { p2=html.find(L">",p1+wtag.size()-1); p3=html.find(L"<",p2+1); wtmp=html.substr(p2+1,p3-p2-1); value=std::string(wtmp.begin(),wtmp.end()); boost::trim(value); next=p1+1; } return p1!=std::wstring::npos; } bool GetHTML(const std::string& url,std::wstring& header,std::wstring& hmtl) { std::wstring wurl = std::wstring(url.begin(),url.end()); bool ret=false; try { WinHttpClient client(wurl.c_str()); std::string url_protocol=url.substr(0,5); std::transform(url_protocol.begin(), url_protocol.end(), url_protocol.begin(), (int (*)(int))std::toupper); if(url_protocol=="HTTPS") client.SetRequireValidSslCertificates(false); client.SetUserAgent(L"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0"); if(client.SendHttpRequest()) { header = client.GetResponseHeader(); hmtl = client.GetResponseContent(); ret=true; } }catch(...) { header=L"Error"; hmtl=L""; } return ret; } int main() { std::string url = "http://www.google.fr"; std::wstring header,html; GetHTML(url,header,html)); }