Comment créer une liste de fréquence de chaque mot d'un fichier?

Question

J'ai un fichier comme ça:

This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

Je voudrais générer une liste de deux colonnes. La première colonne indique quels mots apparaissent, la deuxième colonne indique la fréquence à laquelle ils apparaissent, par exemple:

this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1

Pour simplifier ce travail, avant de traiter la liste, je supprimerai toute ponctuation et changerai tout le texte en minuscule.
Sauf s'il existe une solution simple, words et Word peuvent compter comme deux mots distincts.

Jusqu'à présent, j'ai ceci:

sed -i "s/ /
/g" ./file1.txt # put all words on a new line while read line do count="$(grep -c $line file1.txt)" echo $line"@"$count >> file2.txt # add Word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines

Pour une raison quelconque, ceci n’affiche que "0" après chaque mot.

Comment puis-je générer une liste de chaque mot qui apparaît dans un fichier, ainsi que des informations sur la fréquence?

eduffy · Accepted Answer

Pas sed et grep, mais tr, sort, uniq et awk:

% (tr ' ' '
' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1

Bohdan · Answer

uniq -c fait déjà ce que vous voulez, triez simplement l'entrée:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '
' | sort | uniq -c

sortie:

 6 a 7 d 7 s

Rony · Answer

Contenu du fichier d'entrée

$ cat inputFile.txt This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

Utiliser sed | sort | uniq

$ sed 's/\.//g;s/$.*$/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c 1 a 2 appear 1 file 1 is 1 many 1 more 2 of 1 once 1 one 1 only 2 some 1 than 2 the 1 this 1 time 1 with 3 words

uniq -ic comptera et ignorera la casse, mais la liste des résultats aura This au lieu de this.

Sheharyar · Answer

Utilisons AWK!

Cette fonction répertorie la fréquence de chaque mot survenant dans le fichier fourni par ordre décroissant:

function wordfrequency() { awk ' BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { Word = tolower($i) words[Word]++ } } END { for (w in words) printf("%3d %s
", words[w], w) } ' | sort -rn }

Vous pouvez l'appeler sur votre fichier comme ceci:

$ cat your_file.txt | wordfrequency

Source: Rubrique AWK

potong · Answer

Cela pourrait fonctionner pour vous:

tr '[:upper:]' '[:lower:]' <file | tr -d '[:punct:]' | tr -s ' ' '\n' | sort | uniq -c | sed 's/ *$[0-9]*$ $.*$/\2@\1/'

John Red · Answer

Faisons-le en Python 3!

"""Counts the frequency of each Word in the given text; words are defined as entities separated by whitespaces; punctuations and other symbols are ignored; case-insensitive; input can be passed through stdin or through a file specified as an argument; prints highest frequency words first""" # Case-insensitive # Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/ import sys # Find if input is being given through stdin or from a file lines = None if len(sys.argv) == 1: lines = sys.stdin else: lines = open(sys.argv[1]) D = {} for line in lines: for Word in line.split(): Word = ''.join(list(filter( lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\|:;\"'<>,.?/", Word))) Word = Word.lower() if Word in D: D[Word] += 1 else: D[Word] = 1 for Word in sorted(D, key=D.get, reverse=True): print(Word + ' ' + str(D[Word]))

Appelons ce script "frequency.py" et ajoutons une ligne à "~/.bash_aliases":

alias freq="python3 /path/to/frequency.py"

Maintenant, pour trouver les mots de fréquence dans votre fichier "content.txt", vous devez:

freq content.txt

Vous pouvez également y diriger la sortie:

cat content.txt | freq

Et même analyser le texte de plusieurs fichiers:

cat content.txt story.txt article.txt | freq

Si vous utilisez Python 2, remplacez simplement

''.join(list(filter(args...))) avec filter(args...)
python3 avec python
print(whatever) avec print whatever

Jerin A Mathews · Answer

Vous pouvez utiliser tr pour cela, lancez simplement

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

Exemple de sortie pour un fichier texte de noms de villes:

3026 Toronto 2006 Montréal 1117 Edmonton 1048 Calgary 905 Ottawa 724 Winnipeg 673 Vancouver 495 Brampton 489 Mississauga 482 London 467 Hamilton

Dennis Williamson · Answer

Le tri nécessite GNU AWK (gawk). Si vous avez un autre AWK sans asort(), ceci peut être facilement ajusté puis transmis à sort.

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

Réparti sur plusieurs lignes:

awk '{ gsub(/\./, ""); for (i = 1; i <= NF; i++) { w = tolower($i); count[w]++; words[w] = w } } END { qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]] }' inputfile

Dani Konoplya · Answer

 awk '{ BEGIN{Word[""]=0;} { for (el =1 ; el <= NF ; ++el) {Word[$el]++ } } END { for (i in Word) { if (i !="") { print Word[i],i; } } }' file.txt | sort -nr

GL2014 · Answer

#!/usr/bin/env bash declare -A map words="$1" [[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;} while read line; do for Word in $line; do ((map[$Word]++)) done; done < <(cat $words ) for key in ${!map[@]}; do echo "the Word $key appears ${map[$key]} times" done|sort -nr -k5