Comment convertir un fichier XML en Nice pandas dataframe?

Question

Supposons que j'ai un XML comme celui-ci:

<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com"> <documents count="N"> <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...] ]]> </document> </documents> </author>

Je voudrais lire ce fichier XML et le convertir en un pandas DataFrame:

key type language feature web data e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] 19e71144c50a8b9160b3cvdf2324f0955e906fce XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] 21d4af9021a174f61b8erf284606c74d9e42 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...] 28a45eb2460823499763d70vdf9ca00ddbb665 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]

C’est ce que j’ai déjà essayé, mais j’obtiens des erreurs et il existe probablement un moyen plus efficace de le faire:

from lxml import objectify import pandas as pd path = 'file_path' xml = objectify.parse(open(path)) root = xml.getroot() root.getchildren()[0].getchildren() df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data')) for i in range(0,len(xml)): obj = root.getchildren()[i].getchildren() row = dict(Zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text])) row_s = pd.Series(row) row_s.name = i df = df.append(row_s)

Quelqu'un pourrait-il me fournir une meilleure approche de ce problème?

JaminSore · Accepted Answer

Vous pouvez facilement utiliser xml (de la bibliothèque standard Python) pour convertir en pandas.DataFrame. Voici ce que je ferais ( lors de la lecture d'un fichier, remplacez xml_data par le nom de votre fichier ou de l'objet du fichier):

import pandas as pd import xml.etree.ElementTree as ET import io def iter_docs(author): author_attr = author.attrib for doc in author.iter('document'): doc_dict = author_attr.copy() doc_dict.update(doc.attrib) doc_dict['data'] = doc.text yield doc_dict xml_data = io.StringIO(u'''\ <author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com"> <documents count="N"> <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] ]]> </document> <document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...] ]]> </document> </documents> </author> ''') etree = ET.parse(xml_data) #create an ElementTree object doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

S'il y a plusieurs auteurs dans votre document d'origine ou si la racine de votre XML n'est pas un author, j'ajouterais le générateur suivant:

def iter_author(etree): for author in etree.iter('author'): for row in iter_docs(author): yield row

et changez doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) en doc_df = pd.DataFrame(list(iter_author(etree)))

Jetez un coup d'oeil au ElementTree tutoriel fourni dans la bibliothèque xml documentation .

Jai Prakash · Answer

Voici un autre moyen de convertir un fichier XML en trame de données pandas. Par exemple, j’ai analysé le xml à partir d’une chaîne, mais cette logique est également valable pour la lecture de fichier.

import pandas as pd import xml.etree.ElementTree as ET xml_str = '<?xml version="1.0" encoding="utf-8"?>
<response>
 <head>
 <code>
 200
 </code>
 </head>
 <body>
 <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>
 <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>
 </body>
</response>' etree = ET.fromstring(xml_str) dfcols = ['id', 'name'] df = pd.DataFrame(columns=dfcols) for i in etree.iter(tag='data'): df = df.append( pd.Series([i.get('id'), i.get('name')], index=dfcols), ignore_index=True) df.head()

Naveen Kaushik · Answer

Vous pouvez également convertir en créant un dictionnaire d'éléments puis en convertissant directement en un cadre de données:

import xml.etree.ElementTree as ET import pandas as pd # Contents of test.xml # <?xml version="1.0" encoding="utf-8"?> <tags> <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" /> <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" /> <row Id="3" TagName="elicitation" Count="10" /> <row Id="5" TagName="open-source" Count="16" /> </tags> root = ET.parse('test.xml').getroot() tags = {"tags":[]} for elem in root: tag = {} tag["Id"] = elem.attrib['Id'] tag["TagName"] = elem.attrib['TagName'] tag["Count"] = elem.attrib['Count'] tags["tags"]. append(tag) df_users = pd.DataFrame(tags["tags"]) df_users.head()