R: convertir des données XML en trame de données

Question

Pour un devoir, j'essaie de convertir un fichier XML en un bloc de données dans R. J'ai essayé beaucoup de choses différentes, et j'ai cherché des idées sur Internet mais j'ai échoué. Voici mon code jusqu'à présent:

library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc) dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) data.frame(t(dataFrame),row.names=NULL)

La sortie que j'obtiens est comme un vecteur géant de nombres. J'essaie d'organiser les données dans un bloc de données, mais je ne sais pas comment ajuster correctement mon code pour l'obtenir.

hrbrmstr · Accepted Answer

Il peut ne pas être aussi verbeux que le package XML mais xml2 n'a pas de fuites de mémoire et se concentre sur l'extraction de données au laser. J'utilise trimws qui est un vraiment ajout récent au noyau R.

library(xml2) pg <- read_xml("http://www.ggobi.org/book/data/olive.xml") # get all the <record>s recs <- xml_find_all(pg, "//record") # extract and clean all the columns vals <- trimws(xml_text(recs)) # extract and clean (if needed) the area names labs <- trimws(xml_attr(recs, "label")) # mine the column names from the two variable descriptions # this XPath construct lets us grab either the <categ…> or <real…> tags # and then grabs the 'name' attribute of them cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or self::realvariable]"), "name") # this converts each set of <record> columns to a data frame # after first converting each row to numeric and assigning # names to each column (making it easier to do the matrix to data frame conv) dat <- do.call(rbind, lapply(strsplit(vals, "\ +"), function(x) { data.frame(rbind(setNames(as.numeric(x),cols))) })) # then assign the area name column to the data frame dat$area_name <- labs head(dat) ## region area palmitic palmitoleic stearic oleic linoleic linolenic ## 1 1 1 1075 75 226 7823 672 NA ## 2 1 1 1088 73 224 7709 781 31 ## 3 1 1 911 54 246 8113 549 31 ## 4 1 1 966 57 240 7952 619 50 ## 5 1 1 1051 67 259 7771 672 50 ## 6 1 1 911 49 268 7924 678 51 ## arachidic eicosenoic area_name ## 1 60 29 North-Apulia ## 2 61 29 North-Apulia ## 3 63 29 North-Apulia ## 4 78 35 North-Apulia ## 5 80 46 North-Apulia ## 6 70 44 North-Apulia

MISE À JOUR

Je préfère faire le dernier morceau de cette façon maintenant:

library(tidyverse) strsplit(vals, "[[:space:]]+") %>% map_df(~as_data_frame(as.list(setNames(., cols)))) %>% mutate(area_name=labs)

Parfait · Answer

Excellentes réponses ci-dessus! Pour les futurs lecteurs, chaque fois que vous rencontrez un XML complexe nécessitant une importation R, envisagez de restructurer le document XML à l'aide de XSLT (un langage de programmation déclaratif à usage spécial qui manipule Contenu XML dans divers besoins d'utilisation finale). Ensuite, utilisez simplement la fonction xmlToDataFrame() de R à partir du package XML.

Malheureusement, R n'a pas de package XSLT dédié disponible sur CRAN-R sur tous les systèmes d'exploitation. La liste SXLT semble être un package Linux et ne peut pas être utilisée sous Windows. Voir sans réponse SO questions ici et ici . Je comprends que @hrbrmstr (ci-dessus) maintient un projet GitHub XSLT Néanmoins, presque tous les langages à usage général utilisent des processeurs XSLT, notamment Java, C #, Python, PHP, Perl et VB.

Ci-dessous se trouve l'open-source Python route et parce que le document XML est assez nuancé, deux XSLT sont utilisés (bien sûr, les gourous XSLT peuvent les combiner en un seul mais essayé car je ne pourrais pas obtenir ça marche.

PREMIER XSLT (en utilisant un modèle récursif )

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Identity Transform --> <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record/text()" name="tokenize"> <xsl:param name="text" select="."/> <xsl:param name="separator" select="' '"/> <xsl:choose> <xsl:when test="not(contains($text, $separator))"> <data> <xsl:value-of select="normalize-space($text)"/> </data> </xsl:when> <xsl:otherwise> <data> <xsl:value-of select="normalize-space(substring-before($text, $separator))"/> </data> <xsl:call-template name="tokenize"> <xsl:with-param name="text" select="substring-after($text, $separator)"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="description|variables|categoricalvariable|realvariable"> </xsl:template>

DEUXIÈME XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- Identity Transform --> <xsl:template match="records"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record"> <record> <area_name><xsl:value-of select="@label"/></area_name> <area><xsl:value-of select="data[1]"/></area> <region><xsl:value-of select="data[2]"/></region> <palmitic><xsl:value-of select="data[3]"/></palmitic> <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic> <stearic><xsl:value-of select="data[5]"/></stearic> <oleic><xsl:value-of select="data[6]"/></oleic> <linoleic><xsl:value-of select="data[7]"/></linoleic> <linolenic><xsl:value-of select="data[8]"/></linolenic> <arachidic><xsl:value-of select="data[9]"/></arachidic> <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic> </record> </xsl:template> </xsl:stylesheet>

Python (en utilisant le module lxml)

import lxml.etree as ET cd = os.path.dirname(os.path.abspath(__file__)) # FIRST TRANSFORMATION dom = ET.parse('http://www.ggobi.org/book/data/olive.xml') xslt = ET.parse(os.path.join(cd, 'Olive.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close() # SECOND TRANSFORMATION dom = ET.parse(os.path.join(cd, 'Olive_py.xml')) xslt = ET.parse(os.path.join(cd, 'Olive2.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close()

[~ # ~] r [~ # ~]

library(XML) # LOADING TRANSFORMED XML INTO R DATA FRAME doc<-xmlParse("Olive_py.xml") xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record")) View(xmldf)

Sortie

area_name area region palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic North-Apulia 1 1 1075 75 226 7823 672 na 60 North-Apulia 1 1 1088 73 224 7709 781 31 61 29 North-Apulia 1 1 911 54 246 8113 549 31 63 29 North-Apulia 1 1 966 57 240 7952 619 50 78 35 North-Apulia 1 1 1051 67 259 7771 672 50 80 46 ...

(un léger nettoyage sur le tout premier enregistrement est nécessaire car un espace supplémentaire a été ajouté après "na" dans le document xml, donc arachidic et eicosenoic ont été déplacés vers l'avant)

Rich Scriven · Answer

Voici ce que j'ai trouvé. Il correspond au fichier csv d'huile d'olive qui est également disponible sur la même page. Ils montrent X comme premier nom de colonne, mais je ne le vois pas dans le xml donc je l'ai juste ajouté manuellement.

Il sera probablement préférable de le diviser en sections, puis d'assembler le bloc de données final une fois que nous aurons toutes les pièces. Nous pouvons également utiliser le [.XML* raccourcis pour XPath et les autres [[ fonctions d'accesseur de commodité.

library(XML) url <- "http://www.ggobi.org/book/data/olive.xml" ## parse the xml document and get the top-level XML node doc <- xmlParse(url) top <- xmlRoot(doc) ## create the data frame df <- cbind( ## get all the labels for the first column (groups) X = unlist(doc["//record//@label"], use.names = FALSE), read.table( ## get all the records as a character vector text = xmlValue(top[["data"]][["records"]]), ## get the column names from 'variables' col.names = xmlSApply(top[["data"]][["variables"]], xmlGetAttr, "name"), ## assign the NA values to 'na' in the records na.strings = "na" ) ) ## result head(df) # X region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic # 1 North-Apulia 1 1 1075 75 226 7823 672 NA 60 29 # 2 North-Apulia 1 1 1088 73 224 7709 781 31 61 29 # 3 North-Apulia 1 1 911 54 246 8113 549 31 63 29 # 4 North-Apulia 1 1 966 57 240 7952 619 50 78 35 # 5 North-Apulia 1 1 1051 67 259 7771 672 50 80 46 # 6 North-Apulia 1 1 911 49 268 7924 678 51 70 44 ## clean up free(doc); rm(doc, top); gc()