Matrice de cooccurrence à partir d'une liste de mots en Python

Question

J'ai une liste de noms comme:

names = ['A', 'B', 'C', 'D']

et une liste de documents mentionnant certains de ces noms.

document =[['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]

Je voudrais obtenir une sortie sous forme de matrice de co-occurrences comme:

 A B C D A 0 2 1 1 B 2 0 2 1 C 1 2 0 1 D 1 1 1 0

Il existe une solution ( Création de la matrice de co-occurrence ) pour ce problème en R, mais je ne pouvais pas le faire en Python. Je pense le faire dans les pandas, mais aucun progrès!

Malik Brahimi · Accepted Answer

Évidemment, cela peut être étendu à vos besoins, mais cela effectue l'opération générale à l'esprit:

import math for a in 'ABCD': for b in 'ABCD': count = 0 for x in document: if a != b: if a in x and b in x: count += 1 else: n = x.count(a) if n >= 2: count += math.factorial(n)/math.factorial(n - 2)/2 print '{} x {} = {}'.format(a, b, count)

Morgoth · Answer

from collections import OrderedDict document = [['A', 'B'], ['C', 'B'], ['A', 'B', 'C', 'D']] names = ['A', 'B', 'C', 'D'] occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names) # Find the co-occurrences: for l in document: for i in range(len(l)): for item in l[:i] + l[i + 1:]: occurrences[l[i]][item] += 1 # Print the matrix: print(' ', ' '.join(occurrences.keys())) for name, values in occurrences.items(): print(name, ' '.join(str(i) for i in values.values()))

Sortie;

 A B C D A 0 2 1 1 B 2 0 2 1 C 1 2 0 1 D 1 1 1 0

Brad Campbell · Answer

Voici une autre solution utilisant itertools et la classe Counter du module collections.

import numpy import itertools from collections import Counter document =[['A', 'B'], ['C', 'B'],['A', 'B', 'C', 'D']] # Get all of the unique entries you have varnames = Tuple(sorted(set(itertools.chain(*document)))) # Get a list of all of the combinations you have expanded = [Tuple(itertools.combinations(d, 2)) for d in document] expanded = itertools.chain(*expanded) # Sort the combinations so that A,B and B,A are treated the same expanded = [Tuple(sorted(d)) for d in expanded] # count the combinations c = Counter(expanded) # Create the table table = numpy.zeros((len(varnames),len(varnames)), dtype=int) for i, v1 in enumerate(varnames): for j, v2 in enumerate(varnames[i:]): j = j + i table[i, j] = c[v1, v2] table[j, i] = c[v1, v2] # Display the output for row in table: print(row)

La sortie (qui pourrait facilement être transformée en un DataFrame) est:

[0 2 1 1] [2 0 2 1] [1 2 0 1] [1 1 1 0]

Mockingbird · Answer

Une autre option consiste à utiliser le constructeur csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]) de scipy.sparse.csr_matrix où data, row_ind et col_ind satisfont la relation a[row_ind[k], col_ind[k]] = data[k].

L'astuce consiste à générer row_ind et col_ind en effectuant une itération sur les documents et en créant une liste de n-uplets (doc_id, Word_id). data serait simplement un vecteur de ceux de la même longueur.

Multiplier la matrice docs-words par sa transposée vous donnerait la matrice de co-occurrences.

De plus, ceci est efficace en termes de temps d'exécution et d'utilisation de la mémoire, il devrait donc également traiter de gros corpus.

import numpy as np import itertools from scipy.sparse import csr_matrix def create_co_occurences_matrix(allowed_words, documents): print(f"allowed_words:
{allowed_words}") print(f"documents:
{documents}") Word_to_id = dict(Zip(allowed_words, range(len(allowed_words)))) documents_as_ids = [np.sort([Word_to_id[w] for w in doc if w in Word_to_id]).astype('uint32') for doc in documents] row_ind, col_ind = Zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)])) data = np.ones(len(row_ind), dtype='uint32') # use unsigned int for better memory utilization max_Word_id = max(itertools.chain(*documents_as_ids)) + 1 docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_Word_id)) # efficient arithmetic operations with CSR * CSR words_cooc_matrix = docs_words_matrix.T * docs_words_matrix # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix words_cooc_matrix.setdiag(0) print(f"words_cooc_matrix:
{words_cooc_matrix.todense()}") return words_cooc_matrix, Word_to_id

Exemple d'exécution:

allowed_words = ['A', 'B', 'C', 'D'] documents = [['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']] words_cooc_matrix, Word_to_id = create_co_occurences_matrix(allowed_words, documents)

Sortie:

allowed_words: ['A', 'B', 'C', 'D'] documents: [['A', 'B'], ['C', 'B', 'K'], ['A', 'B', 'C', 'D', 'Z']] words_cooc_matrix: [[0 2 1 1] [2 0 2 1] [1 2 0 1] [1 1 1 0]]

titipata · Answer

Vous pouvez également utiliser des astuces matricielles pour trouver également la matrice de cooccurrence. J'espère que cela fonctionne bien lorsque vous avez un vocabulaire plus important.

import scipy.sparse as sp voc2id = dict(Zip(names, range(len(names)))) rows, cols, vals = [], [], [] for r, d in enumerate(document): for e in d: if voc2id.get(e) is not None: rows.append(r) cols.append(voc2id[e]) vals.append(1) X = sp.csr_matrix((vals, (rows, cols)))

Maintenant, vous pouvez trouver la matrice de cooccurrence par simple multiplication X.T avec X

Xc = (X.T * X) # coocurrence matrix Xc.setdiag(0) print(Xc.toarray())