Génération de N-grammes à partir d'une phrase

Question

Comment générer un n-gramme d'une chaîne comme:

String Input="This is my car."

Je veux générer n-gram avec cette entrée:

Input Ngram size = 3

La sortie devrait être:

This is my car This is is my my car This is my is my car

Donnez-vous une idée en Java de la façon de l'implémenter ou de la disponibilité d'une bibliothèque.

J'essaie d'utiliser ce NGramTokenizer mais cela donne des n-grammes de séquence de caractères et je veux n-grammes de séquence de Word.

Shashikant Kore · Accepted Answer

Vous recherchez ShingleFilter .

Mise à jour: le lien pointe vers la version 3.0.2. Cette classe peut être dans un package différent dans la version plus récente de Lucene.

aioobe · Answer

Je crois que cela ferait ce que vous voulez:

import Java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i = start; i < end; i++) sb.append((i > start ? " " : "") + words[i]); return sb.toString(); } public static void main(String[] args) { for (int n = 1; n <= 3; n++) { for (String ngram : ngrams(n, "This is my car.")) System.out.println(ngram); System.out.println(); } } }

Sortie:

This is my car. This is is my my car. This is my is my car.

Une solution "à la demande" implémentée en tant qu'itérateur:

class NgramIterator implements Iterator<String> { String[] words; int pos = 0, n; public NgramIterator(int n, String str) { this.n = n; words = str.split(" "); } public boolean hasNext() { return pos < words.length - n + 1; } public String next() { StringBuilder sb = new StringBuilder(); for (int i = pos; i < pos + n; i++) sb.append((i > pos ? " " : "") + words[i]); pos++; return sb.toString(); } public void remove() { throw new UnsupportedOperationException(); } }

Landei · Answer

Ce code retourne un tableau de toutes les chaînes de la longueur donnée:

public static String[] ngrams(String s, int len) { String[] parts = s.split(" "); String[] result = new String[parts.length - len + 1]; for(int i = 0; i < parts.length - len + 1; i++) { StringBuilder sb = new StringBuilder(); for(int k = 0; k < len; k++) { if(k > 0) sb.append(' '); sb.append(parts[i+k]); } result[i] = sb.toString(); } return result; }

Par exemple.

System.out.println(Arrays.toString(ngrams("This is my car", 2))); //--> [This is, is my, my car] System.out.println(Arrays.toString(ngrams("This is my car", 3))); //--> [This is my, is my car]

Dung TQ · Answer

 public static void CreateNgram(ArrayList<String> list, int cutoff) { try { NGramModel ngramModel = new NGramModel(); POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin")); PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent"); POSTaggerME tagger = new POSTaggerME(model); perfMon.start(); for(int i = 0; i<list.size(); i++) { String inputString = list.get(i); ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString)); String line; while ((line = lineStream.read()) != null) { String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line); String[] tags = tagger.tag(whitespaceTokenizerLine); POSSample sample = new POSSample(whitespaceTokenizerLine, tags); perfMon.incrementCounter(); String words[] = sample.getSentence(); if(words.length > 0) { for(int k = 2; k< 4; k++) { ngramModel.add(new StringList(words), k, k); } } } } ngramModel.cutoff(cutoff, Integer.MAX_VALUE); Iterator<StringList> it = ngramModel.iterator(); while(it.hasNext()) { StringList strList = it.next(); System.out.println(strList.toString()); } perfMon.stopAndPrintFinalResult(); }catch(Exception e) { System.out.println(e.toString()); } }

Voici mes codes pour créer le n-gramme. Dans ce cas, n = 2, 3. n-grammes de séquence de mots dont la valeur est inférieure à la valeur limite ignoreront le jeu de résultats. L'entrée est une liste de phrases, puis elle est analysée à l'aide d'un outil OpenNLP.

tozCSS · Answer

/** * * @param sentence should has at least one string * @param maxGramSize should be 1 at least * @return set of continuous Word n-grams up to maxGramSize from the sentence */ public static List<String> generateNgramsUpto(String str, int maxGramSize) { List<String> sentence = Arrays.asList(str.split("[\W+]")); List<String> ngrams = new ArrayList<String>(); int ngramSize = 0; StringBuilder sb = null; //sentence becomes ngrams for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) { String Word = (String) it.next(); //1- add the Word itself sb = new StringBuilder(Word); ngrams.add(Word); ngramSize=1; it.previous(); //2- insert prevs of the Word and add those too while(it.hasPrevious() && ngramSize<maxGramSize){ sb.insert(0,' '); sb.insert(0,it.previous()); ngrams.add(sb.toString()); ngramSize++; } //go back to initial position while(ngramSize>0){ ngramSize--; it.next(); } } return ngrams; }

Appel:

long startTime = System.currentTimeMillis(); ngrams = ToolSet.generateNgramsUpto("This is my car.", 3); long stopTime = System.currentTimeMillis(); System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size()); System.out.println(ngrams.toString());

Sortie:

Mon temps = 1 ms avec ngramsize = 9 [Ceci est, ceci est, mon, est mon, Ceci. est ma voiture, ma voiture est ma voiture]

Jagesh Maharjan · Answer

Regarde ça:

public static void main(String[] args) { NGram nGram = new NGram(); String[] tokens = "this is my car".split(" "); int i = tokens.length; List<String> ngrams = new ArrayList<>(); while (i >= 1){ ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>())); i--; } System.out.println(ngrams); } private List<String> getNGram(String[] tokens, int n, List<String> ngrams) { StringBuilder strbldr = new StringBuilder(); if (tokens.length < n) { return ngrams; }else { for (int i=0; i<n; i++){ strbldr.append(tokens[i]).append(" "); } ngrams.add(strbldr.toString().trim()); String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length); return getNGram(newTokens, n, ngrams); } }

Fonction récursive simple, meilleur temps d'exécution.

M Sach · Answer

public static void main(String[] args) { String[] words = "This is my car.".split(" "); for (int n = 0; n < 3; n++) { List<String> list = ngrams(n, words); for (String ngram : list) { System.out.println(ngram); } System.out.println(); } } public static List<String> ngrams(int stepSize, String[] words) { List<String> ngrams = new ArrayList<String>(); for (int i = 0; i < words.length-stepSize; i++) { String initialWord = ""; int internalCount = i; int internalStepSize = i + stepSize; while (internalCount <= internalStepSize && internalCount < words.length) { initialWord = initialWord+" " + words[internalCount]; ++internalCount; } ngrams.add(initialWord); } return ngrams; }