Extraire l'attribut `src` de la balise` img` à l'aide de BeautifulSoup

Question

<div class="someClass"> <a href="href"> <img alt="some" src="some"/> </a> </div>

J'utilise bs4 et je ne peux pas utiliser a.attrs['src'] pour obtenir le src, mais je peux obtenir href. Que devrais-je faire?

Abu Shoeb · Answer

Vous pouvez utiliser BeautifulSoup pour extraire l'attribut src d'un html img tag. Dans mon exemple, la htmlText contient la balise img elle-même, mais elle peut également être utilisée pour une URL avec urllib2.

Pour les URL

from BeautifulSoup import BeautifulSoup as BSHTML import urllib2 page = urllib2.urlopen('http://www.youtube.com/') soup = BSHTML(page) images = soup.findAll('img') for image in images: #print image source print image['src'] #print alternate text print image['alt']

pour les textes avec balise img

from BeautifulSoup import BeautifulSoup as BSHTML htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """ soup = BSHTML(htmlText) images = soup.findAll('img') for image in images: print image['src']

mx0 · Answer

Un lien n'a pas d'attribut src vous devez cibler la balise img réelle.

import bs4 html = """<div class="someClass"> <a href="href"> <img alt="some" src="some"/> </a> </div>""" soup = bs4.BeautifulSoup(html, "html.parser") # this will return src attrib from img tag that is inside 'a' tag soup.a.img['src'] >>> 'some' # if you have more then one 'a' tag for a in soup.find_all('a'): if a.img: print(a.img['src']) >>> 'some'