Je suis complètement nouveau dans le package R et tm, alors veuillez excuser ma stupide question ;-) Comment puis-je afficher le texte d'un corpus de texte brut dans le package R tm?
J'ai chargé un corpus avec 323 fichiers de texte brut dans un corpus:
src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)
Mais quand j'appelle le corpus avec:
corpus[[1]]
J'obtiens toujours une sortie comme celle-ci au lieu du texte du corpus lui-même:
<<PlainTextDocument>>
Metadata: 7
Content: chars: 144
Content: chars: 141
Content: chars: 224
Content: chars: 75
Content: chars: 105
Comment afficher le texte du corpus?
Merci!
UPDATE Exemple reproductible: je l'ai essayé avec l'exemple de texte intégré:
> data("crude")
> crude
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
> crude[1]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata: 15
Content: chars: 527
Comment imprimer le texte des documents?
MISE À JOUR 2: Informations sur la session:
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-1 NLP_0.1-7
loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32 tools_3.1.3
Vous pouvez essayer de convertir votre texte de corpus en une trame de données et d'accéder au texte requis à partir de la trame de données elle-même. J'ai utilisé l'exemple de données intégré "brut" (du paquet tm) comme exemple.
data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)
dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
Cela fonctionne dans le mien, pour imprimer le texte du contenu, avec la dernière version de tm,
corpus[[1]]$content
Remarque: Plus ou moins comme suggéré par Ricky dans le commentaire précédent. Désolé, je voulais écrire un commentaire, seul mon représentant a seulement 25 ans (il faut au moins 50 représentants pour commenter).
Voici une manière simple et directe d'afficher le texte d'un corpus:
strwrap(corpus[[1]])
Pour les données brutes, cela produira
[1] "Diamond Shamrock Corp said that effective today it had cut its contract"
[2] "prices for crude oil by 1.50 dlrs a barrel. The reduction brings its posted"
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."
[4] "\"The price reduction today was made in the light of falling oil product"
[5] "prices and a weak crude oil market,\" a company spokeswoman said. Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"
[7] "posted, prices over the last two days citing weak oil markets. Reuter"
Je peux confirmer qu'à partir de tm 0.6-1, l'inspection ne s'imprime pas correctement. Vous pouvez le coupler avec le package qdap que je maintiens pour convertir facilement en data.frame comme suit:
library(qdap)
as.data.frame(crude)
Pour le rendre plus semblable à l'ancien comportement d'inspection, vous pouvez utiliser:
as.data.frame(crude) %>%
with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))
Cela ressemble à ceci:
Diamond Shamrock Corp said that effective today it had cut its
contract prices for crude oil by 1.50 dlrs a barrel. The reduction
brings its posted price for West Texas Intermediate to 16.00 dlrs a
barrel, the copany said. "The price reduction today was made in the
light of falling oil product prices and a weak crude oil market," a
company spokeswoman said. Diamond is the latest in a line of U.S. oil
companies that have cut its contract, or posted, prices over the last
two days citing weak oil markets. Reuter
OPEC may be forced to meet before a scheduled June session to
readdress its production cutting agreement if the organization wants
to halt the current slide in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy as OPEC
thought. They may need an emergency meeting to sort out the
problems," said Daniel Yergin, director of Cambridge Energy Research
Associates, CERA. Analysts and oil industry sources said the problem
OPEC faces is excess oil supply in world oil markets. "OPEC's problem
is not a price problem but a production issue and must be addressed
in that way," said Paul Mlotok, oil analyst with Salomon Brothers
Inc. He said the market's earlier optimism about OPE
.
.
.
De la vignette tm, cela fonctionne:
writeLines(as.character(doc.corpus[[8]]))
Où "8" est le numéro d'élément que vous souhaitez
Nous pouvons obtenir le content
de chaque élément du corpus.
data("crude")
out <- sapply(crude, function(x){x$content})
out
# optionally export
writeCorpus(out, "outputdir/", filenames = "corpus.txt")
> inspect(crude[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
$`reut-00001.xml`
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter