Comment calculer la mesure de similarité de distance de deux chaînes?

Question

Je dois calculer la similarité entre 2 chaînes. Alors qu'est-ce que je veux dire exactement? Laissez-moi vous expliquer avec un exemple:

Le vrai mot: hospital
Mot erroné: haspita

Mon objectif est maintenant de déterminer le nombre de caractères nécessaires pour modifier le mot erroné afin d'obtenir le mot réel. Dans cet exemple, je dois modifier 2 lettres. Alors, quel serait le pourcentage? Je prends toujours la longueur de la vraie Parole. Donc, il devient 2/8 = 25% donc ces 2 chaînes DSM données sont 75%.

Comment puis-je atteindre cet objectif avec la performance comme facteur clé?

dasblinkenlight · Accepted Answer

Ce que vous recherchez s'appelle edit distance ou Levenshtein distance . L'article de Wikipédia explique comment il est calculé et dispose d'un joli morceau de pseudocode en bas pour vous aider à coder cet algorithme en C # très facilement.

Voici une implémentation du premier site lié ci-dessous:

private static int CalcLevenshteinDistance(string a, string b) { if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) { return 0; } if (String.IsNullOrEmpty(a)) { return b.Length; } if (String.IsNullOrEmpty(b)) { return a.Length; } int lengthA = a.Length; int lengthB = b.Length; var distances = new int[lengthA + 1, lengthB + 1]; for (int i = 0; i <= lengthA; distances[i, 0] = i++); for (int j = 0; j <= lengthB; distances[0, j] = j++); for (int i = 1; i <= lengthA; i++) for (int j = 1; j <= lengthB; j++) { int cost = b[j - 1] == a[i - 1] ? 0 : 1; distances[i, j] = Math.Min ( Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1), distances[i - 1, j - 1] + cost ); } return distances[lengthA, lengthB]; }

Joshua Honig · Answer

Je viens d’aborder exactement le même problème il ya quelques semaines. Depuis que quelqu'un demande maintenant, je vais partager le code. Dans mes tests exhaustifs, mon code est environ 10 fois plus rapide que l'exemple C # de Wikipedia, même lorsqu'aucune distance maximale n'est spécifiée. Lorsqu'une distance maximale est fournie, ce gain de performance augmente de 30x à 100x +. Notez quelques points clés pour la performance:

Si vous devez encore et encore comparer les mêmes mots, commencez par les convertir en tableaux de nombres entiers. L'algorithme de Damerau-Levenshtein comprend de nombreuses comparaisons>, <, == et ints beaucoup plus rapidement que chars.
Il inclut un mécanisme de court-circuit permettant de quitter si la distance dépasse un maximum prévu.
Utilisez un ensemble tournant de trois tableaux plutôt qu'une matrice massive comme dans toutes les implémentations que je vois ailleurs
Assurez-vous que vos tableaux découpent la largeur de Word la plus courte.

Code (cela fonctionne exactement de la même manière si vous remplacez int[] par String dans les déclarations de paramètre:

/// <summary> /// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of /// integers, where each integer represents the code point of a character in the source string. /// Includes an optional threshhold which can be used to indicate the maximum allowable distance. /// </summary> /// <param name="source">An array of the code points of the first string</param> /// <param name="target">An array of the code points of the second string</param> /// <param name="threshold">Maximum allowable distance</param> /// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns> public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) { int length1 = source.Length; int length2 = target.Length; // Return trivial case - difference in string lengths exceeds threshhold if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; } // Ensure arrays [i] / length1 use shorter length if (length1 > length2) { Swap(ref target, ref source); Swap(ref length1, ref length2); } int maxi = length1; int maxj = length2; int[] dCurrent = new int[maxi + 1]; int[] dMinus1 = new int[maxi + 1]; int[] dMinus2 = new int[maxi + 1]; int[] dSwap; for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; } int jm1 = 0, im1 = 0, im2 = -1; for (int j = 1; j <= maxj; j++) { // Rotate dSwap = dMinus2; dMinus2 = dMinus1; dMinus1 = dCurrent; dCurrent = dSwap; // Initialize int minDistance = int.MaxValue; dCurrent[0] = j; im1 = 0; im2 = -1; for (int i = 1; i <= maxi; i++) { int cost = source[im1] == target[jm1] ? 0 : 1; int del = dCurrent[im1] + 1; int ins = dMinus1[i] + 1; int sub = dMinus1[im1] + cost; //Fastest execution for min value of 3 integers int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del); if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2]) min = Math.Min(min, dMinus2[im2] + cost); dCurrent[i] = min; if (min < minDistance) { minDistance = min; } im1++; im2++; } jm1++; if (minDistance > threshold) { return int.MaxValue; } } int result = dCurrent[maxi]; return (result > threshold) ? int.MaxValue : result; }

Où Swap est:

static void Swap<T>(ref T arg1,ref T arg2) { T temp = arg1; arg1 = arg2; arg2 = temp; }

Anastasiosyal · Answer

Il existe un grand nombre d'algorithmes de distance de similarité de chaîne pouvant être utilisés. Certains énumérés ici (mais pas exhaustivement sont énumérés):

Une bibliothèque contenant l’implémentation de toutes ces applications est appelée SimMetrics , Qui comporte à la fois des implémentations Java et c #.

Ivan Kochurkin · Answer

Voici mon implémentation de Damerau Levenshtein Distance, qui renvoie non seulement le coefficient de similarité, mais également les emplacements d'erreur dans Word corrigé (cette fonctionnalité peut être utilisée dans les éditeurs de texte). Aussi, mon implémentation supporte différents poids d'erreur (substitution, suppression, insertion, transposition).

public static List<Mistake> OptimalStringAlignmentDistance( string Word, string correctedWord, bool transposition = true, int substitutionCost = 1, int insertionCost = 1, int deletionCost = 1, int transpositionCost = 1) { int w_length = Word.Length; int cw_length = correctedWord.Length; var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1]; var result = new List<Mistake>(Math.Max(w_length, cw_length)); if (w_length == 0) { for (int i = 0; i < cw_length; i++) result.Add(new Mistake(i, CharMistakeType.Insertion)); return result; } for (int i = 0; i <= w_length; i++) d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None); for (int j = 0; j <= cw_length; j++) d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None); for (int i = 1; i <= w_length; i++) { for (int j = 1; j <= cw_length; j++) { bool equal = correctedWord[j - 1] == Word[i - 1]; int delCost = d[i - 1, j].Key + deletionCost; int insCost = d[i, j - 1].Key + insertionCost; int subCost = d[i - 1, j - 1].Key; if (!equal) subCost += substitutionCost; int transCost = int.MaxValue; if (transposition && i > 1 && j > 1 && Word[i - 1] == correctedWord[j - 2] && Word[i - 2] == correctedWord[j - 1]) { transCost = d[i - 2, j - 2].Key; if (!equal) transCost += transpositionCost; } int min = delCost; CharMistakeType mistakeType = CharMistakeType.Deletion; if (insCost < min) { min = insCost; mistakeType = CharMistakeType.Insertion; } if (subCost < min) { min = subCost; mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution; } if (transCost < min) { min = transCost; mistakeType = CharMistakeType.Transposition; } d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType); } } int w_ind = w_length; int cw_ind = cw_length; while (w_ind >= 0 && cw_ind >= 0) { switch (d[w_ind, cw_ind].Value) { case CharMistakeType.None: w_ind--; cw_ind--; break; case CharMistakeType.Substitution: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution)); w_ind--; cw_ind--; break; case CharMistakeType.Deletion: result.Add(new Mistake(cw_ind, CharMistakeType.Deletion)); w_ind--; break; case CharMistakeType.Insertion: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion)); cw_ind--; break; case CharMistakeType.Transposition: result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition)); w_ind -= 2; cw_ind -= 2; break; } } if (d[w_length, cw_length].Key > result.Count) { int delMistakesCount = d[w_length, cw_length].Key - result.Count; for (int i = 0; i < delMistakesCount; i++) result.Add(new Mistake(0, CharMistakeType.Deletion)); } result.Reverse(); return result; } public struct Mistake { public int Position; public CharMistakeType Type; public Mistake(int position, CharMistakeType type) { Position = position; Type = type; } public override string ToString() { return Position + ", " + Type; } } public enum CharMistakeType { None, Substitution, Insertion, Deletion, Transposition }

Ce code fait partie de mon projet: Yandex-Linguistics.NET .

J'ai écrit quelques tests et il me semble que cette méthode fonctionne.

Mais les commentaires et remarques sont les bienvenus.

Gordon Linoff · Answer

Voici une approche alternative:

C'est trop long pour un commentaire.

Une méthode typique pour trouver une similarité est la distance de Levenshtein, et il ne fait aucun doute qu’une bibliothèque avec code est disponible.

Malheureusement, cela nécessite une comparaison avec chaque chaîne. Vous pourrez peut-être écrire une version spécialisée du code pour court-circuiter le calcul si la distance est supérieure à un seuil. Vous devrez quand même effectuer toutes les comparaisons.

Une autre idée consiste à utiliser une variante des trigrammes ou des n-grammes. Ce sont des séquences de n caractères (ou n mots ou n séquences génomiques ou n quelconques). Conservez un mappage des trigrammes en chaînes et choisissez ceux qui ont le plus grand chevauchement. Un choix typique de n est "3", d'où son nom.

Par exemple, l'anglais aurait ces trigrammes:

Eng
ngl
gli
lis
ish

Et l'Angleterre aurait:

Eng
ngl
gla
lan
et

Eh bien, 2 sur 7 (ou 4 sur 10) correspondent. Si cela fonctionne pour vous, vous pouvez indexer la table trigram/string et obtenir une recherche plus rapide.

Vous pouvez également combiner cela avec Levenshtein pour réduire le nombre de comparaisons à celles qui ont un nombre minimal de n-grammes en commun.

GHC · Answer

Voici une implémentation de VB.net:

Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer Dim cost(v1.Length, v2.Length) As Integer If v1.Length = 0 Then Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2 ElseIf v2.Length = 0 Then Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1 Else 'setup the base costs for inserting the correct characters For v1Count As Integer = 0 To v1.Length cost(v1Count, 0) = v1Count Next v1Count For v2Count As Integer = 0 To v2.Length cost(0, v2Count) = v2Count Next v2Count 'now work out the cheapest route to having the correct characters For v1Count As Integer = 1 To v1.Length For v2Count As Integer = 1 To v2.Length 'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required) 'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and cost(v1Count, v2Count) = Math.Min( cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1), Math.Min( cost(v1Count - 1, v2Count) + 1, cost(v1Count, v2Count - 1) + 1 ) ) Next v2Count Next v1Count 'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix 'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length) Return cost(v1.Length, v2.Length) End If End Function