J'ai besoin de comparer des documents stockés dans une base de données et de trouver un score de similitude entre 0 et 1.
La méthode que je dois utiliser doit être très simple. Implémentation d'une version Vanilla de n-grammes (où il est possible de définir le nombre de grammes à utiliser), avec une implémentation simple de tf-idf et de la similitude cosinus.
Y a-t-il un programme qui peut le faire? Ou devrais-je commencer à écrire cela à partir de zéro?
Découvrez le package NLTK: http://www.nltk.org il a tout ce dont vous avez besoin
Pour la similitude cosinus:
def cosine_distance(u, v):
"""
Returns the cosine of the angle between vectors v and u. This is equal to
u.v / |u||v|.
"""
return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
Pour les ngrammes:
def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
"""
A utility that produces a sequence of ngrams from a sequence of items.
For example:
>>> ngrams([1,2,3,4,5], 3)
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use ingram for an iterator version of this function. Set pad_left
or pad_right to true in order to get additional ngrams:
>>> ngrams([1,2,3,4,5], 2, pad_right=True)
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
@param sequence: the source data to be converted into ngrams
@type sequence: C{sequence} or C{iterator}
@param n: the degree of the ngrams
@type n: C{int}
@param pad_left: whether the ngrams should be left-padded
@type pad_left: C{boolean}
@param pad_right: whether the ngrams should be right-padded
@type pad_right: C{boolean}
@param pad_symbol: the symbol to use for padding (default is None)
@type pad_symbol: C{any}
@return: The ngrams
@rtype: C{list} of C{Tuple}s
"""
if pad_left:
sequence = chain((pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (pad_symbol,) * (n-1))
sequence = list(sequence)
count = max(0, len(sequence) - n + 1)
return [Tuple(sequence[i:i+n]) for i in range(count)]
pour tf-idf, vous devrez d'abord calculer la distribution, j'utilise Lucene pour cela, mais vous pouvez très bien faire quelque chose de similaire avec NLTK, utilisez FreqDist:
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term
si vous aimez le pylucène, cela vous dira comment calculer tf.idf
# reader = lucene.IndexReader(FSDirectory.open(index_loc))
docs = reader.numDocs()
for i in xrange(docs):
tfv = reader.getTermFreqVector(i, fieldname)
if tfv:
rec = {}
terms = tfv.getTerms()
frequencies = tfv.getTermFrequencies()
for (t,f,x) in Zip(terms,frequencies,xrange(maxtokensperdoc)):
df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
tmap.setdefault(t, len(tmap))
rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF
# and normalize the values using cosine normalization
if cosine_normalization:
denom = sum([x**2 for x in rec.values()])**0.5
for k,v in rec.items():
rec[k] = v / denom
Si vous êtes intéressé, j'ai fait des séries de tutoriels ( Partie I et Partie II ) parlant de tf-idf et utilisant le Scikits.learn (sklearn) Python.
Partie a une similitude cosinus.
Voici une réponse avec juste python
+ numpy
, en bref:
cosinus:
def cosine_sim(u,v):
return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))
Ngrams:
def ngrams(sentence, n):
return Zip(*[sentence.split()[i:] for i in range(n)])
TF-IDF (c'est un peu bizarre mais ça marche):
def tfidf(corpus, vocab):
"""
INPUT:
corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]),
('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]),
('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]
vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence',
'sheep', 'this']
OUTPUT:
[[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300],
[0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0],
[0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]
"""
def termfreq(matrix, doc, term):
try: return matrix[doc][term] / float(sum(matrix[doc].values()))
except ZeroDivisionError: return 0
def inversedocfreq(matrix, term):
try:
return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
except ZeroDivisionError: return 0
matrix = [{k:v for k,v in Zip(vocab, i[1])} for i in corpus]
tfidf = defaultdict(dict)
for doc,_ in enumerate(matrix):
for term in matrix[doc]:
tf = termfreq(matrix,doc,term)
idf = inversedocfreq(matrix, term)
tfidf[doc][term] = tf*idf
return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]
Voici la longue réponse avec les tests:
import numpy as np
from math import sqrt, log
from itertools import chain, product
from collections import defaultdict
def cosine_sim(u,v):
return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))
def ngrams(sentence, n):
return Zip(*[sentence.split()[i:] for i in range(n)])
def tfidf(corpus, vocab):
"""
INPUT:
corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]),
('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]),
('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]
vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence',
'sheep', 'this']
OUTPUT:
[[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300],
[0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0],
[0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]
"""
def termfreq(matrix, doc, term):
try: return matrix[doc][term] / float(sum(matrix[doc].values()))
except ZeroDivisionError: return 0
def inversedocfreq(matrix, term):
try:
return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
except ZeroDivisionError: return 0
matrix = [{k:v for k,v in Zip(vocab, i[1])} for i in corpus]
tfidf = defaultdict(dict)
for doc,_ in enumerate(matrix):
for term in matrix[doc]:
tf = termfreq(matrix,doc,term)
idf = inversedocfreq(matrix, term)
tfidf[doc][term] = tf*idf
return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]
def corpus2vectors(corpus):
def vectorize(sentence, vocab):
return [sentence.split().count(i) for i in vocab]
vectorized_corpus = []
vocab = sorted(set(chain(*[i.lower().split() for i in corpus])))
for i in corpus:
vectorized_corpus.append((i, vectorize(i, vocab)))
return vectorized_corpus, vocab
def create_test_corpus():
sent1 = "this is a foo bar"
sent2 = "foo bar bar black sheep"
sent3 = "this is a sentence"
all_sents = [sent1,sent2,sent3]
corpus, vocab = corpus2vectors(all_sents)
return corpus, vocab
def test_cosine():
corpus, vocab = create_test_corpus()
for sentx, senty in product(corpus, corpus):
print sentx[0]
print senty[0]
print "cosine =", cosine_sim(sentx[1], senty[1])
print
def test_ngrams():
corpus, vocab = create_test_corpus()
for sentx in corpus:
print sentx[0]
print ngrams(sentx[0],2)
print ngrams(sentx[0],3)
print
def test_tfidf():
corpus, vocab = create_test_corpus()
print corpus
print vocab
print tfidf(corpus, vocab)
print "Testing cosine..."
test_cosine()
print
print "Testing ngrams..."
test_ngrams()
print
print "Testing tfidf..."
test_tfidf()
print
[en dehors]:
Testing cosine...
this is a foo bar
this is a foo bar
cosine = 1.0
this is a foo bar
foo bar bar black sheep
cosine = 0.507092552837
this is a foo bar
this is a sentence
cosine = 0.67082039325
foo bar bar black sheep
this is a foo bar
cosine = 0.507092552837
foo bar bar black sheep
foo bar bar black sheep
cosine = 1.0
foo bar bar black sheep
this is a sentence
cosine = 0.0
this is a sentence
this is a foo bar
cosine = 0.67082039325
this is a sentence
foo bar bar black sheep
cosine = 0.0
this is a sentence
this is a sentence
cosine = 1.0
Testing ngrams...
this is a foo bar
[('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')]
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')]
foo bar bar black sheep
[('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')]
[('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')]
this is a sentence
[('this', 'is'), ('is', 'a'), ('a', 'sentence')]
[('this', 'is', 'a'), ('is', 'a', 'sentence')]
Testing tfidf...
[('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]
['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this']
[[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]
Dans le cas où vous êtes toujours intéressé par ce problème, j'ai fait quelque chose de très similaire en utilisant Lucene Java et Jython. Voici quelques extraits de mon code.
Lucene prétraitement des documents et des requêtes à l'aide de soi-disant analyseurs. Celui-ci utilise le filtre n-gram intégré de Lucene:
class NGramAnalyzer(Analyzer):
'''Analyzer that yields n-grams for minlength <= n <= maxlength'''
def __init__(self, minlength, maxlength):
self.minlength = minlength
self.maxlength = maxlength
def tokenStream(self, field, reader):
lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
return NGramTokenFilter(lower, self.minlength, self.maxlength)
Pour transformer une liste de ngrams
en Document
:
doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))
Pour stocker un document dans un index:
wr = IndexWriter(index_dir, NGramAnalyzer(), True,
IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)
Construire des requêtes est un peu plus difficile car le QueryParser
de Lucene attend un langage de requête avec des opérateurs spéciaux, des guillemets, etc., mais il peut être contourné (comme expliqué en partie ici ).
Pour notre cours de recherche d'informations, nous utilisons du code écrit par notre professeur en Java. Désolé, pas de port python. "Il est publié à des fins éducatives et de recherche uniquement sous la licence GNU General Public License."
Vous pouvez consulter la documentation http://userweb.cs.utexas.edu/~mooney/ir-course/doc/
Mais plus précisément, consultez: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html
Vous pouvez le télécharger http://userweb.cs.utexas.edu/users/mooney/ir-course/