Ma page Web ressemble à ceci:
<p>
<strong class="offender">YOB:</strong> 1987<br/>
<strong class="offender">RACE:</strong> WHITE<br/>
<strong class="offender">GENDER:</strong> FEMALE<br/>
<strong class="offender">HEIGHT:</strong> 5'05''<br/>
<strong class="offender">WEIGHT:</strong> 118<br/>
<strong class="offender">EYE COLOR:</strong> GREEN<br/>
<strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>
Je veux extraire les informations pour chaque individu et obtenir YOB:1987
, RACE:WHITE
, etc...
Ce que j'ai essayé c'est:
subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')
Mais cela ne me donne que les valeurs de YOB:
, RACE:
, etc...
Existe-t-il un moyen de récupérer les données en YOB:1987
, RACE:WHITE
format?
Parcourez tous les <strong>
tags et utilisation next_sibling
pour obtenir ce que vous voulez. Comme ça:
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
Démo:
from bs4 import BeautifulSoup
html = '''
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''
soup = BeautifulSoup(html)
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
Cela vous donne:
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
Je pense que vous pouvez l'obtenir en utilisant subc1.text
.
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
Ou si vous voulez explorer, vous pouvez utiliser .contents
:
>>> p = soup.find('p')
>>> from pprint import pprint
>>> pprint(p.contents)
[u'\n',
<strong class="offender">YOB:</strong>,
u' 1987',
<br/>,
u'\n',
<strong class="offender">RACE:</strong>,
u' WHITE',
<br/>,
u'\n',
<strong class="offender">GENDER:</strong>,
u' FEMALE',
<br/>,
u'\n',
<strong class="offender">HEIGHT:</strong>,
u" 5'05''",
<br/>,
u'\n',
<strong class="offender">WEIGHT:</strong>,
u' 118',
<br/>,
u'\n',
<strong class="offender">EYE COLOR:</strong>,
u' GREEN',
<br/>,
u'\n',
<strong class="offender">HAIR COLOR:</strong>,
u' BROWN',
<br/>,
u'\n']
et filtrer les éléments nécessaires de la liste:
>>> data = dict(Zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]]))
>>> pprint(data)
{u'EYE COLOR:': u'GREEN',
u'GENDER:': u'FEMALE',
u'HAIR COLOR:': u'BROWN',
u'HEIGHT:': u"5'05''",
u'RACE:': u'WHITE',
u'WEIGHT:': u'118',
u'YOB:': u'1987'}