Un meilleur algorithme de classement de similarité pour les chaînes de longueur variable

Question

Je recherche un algorithme de similarité de chaîne qui donne de meilleurs résultats sur les chaînes de longueur variable que celles habituellement suggérées (distance de levenshtein, soundex, etc.).

Par exemple,

Compte tenu de la chaîne A: "Robert",

Puis chaîne B: "Amy Robertson"

serait un meilleur match que

String C: "Richard"

En outre, cet algorithme doit de préférence être indépendant de la langue (fonctionne également dans des langues autres que l'anglais).

marzagao · Accepted Answer

Simon White de Catalysoft a écrit un article sur un algorithme très intelligent qui compare les paires de personnages adjacents et qui fonctionne très bien pour mes besoins:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon a une version de l'algorithme Java et ci-dessous, j'en ai écrit une version PL/Ruby (tirée de la version standard Ruby réalisée dans l'entrée correspondante du forum) commentaire de Mark Wong-VanHaren) afin que je puisse l’utiliser dans mes requêtes PostgreSQL:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar) RETURNS float8 AS ' str1.downcase! pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject { |pair| pair.include? " "} str2.downcase! pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject { |pair| pair.include? " "} union = pairs1.size + pairs2.size intersection = 0 pairs1.each do |p1| 0.upto(pairs2.size-1) do |i| if p1 == pairs2[i] intersection += 1 pairs2.slice!(i) break end end end (2.0 * intersection) / union ' LANGUAGE 'plruby';

Fonctionne comme un charme!

Michael La Voie · Answer

réponse de marzagao c'est génial. Je l'ai converti en C # alors j'ai pensé le poster ici:

Lien Pastebin

/// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public class SimilarityTool { /// <summary> /// Compares the two strings based on letter pair matches /// </summary> /// <param name="str1"></param> /// <param name="str2"></param> /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns> public double CompareStrings(string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual Word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each Word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; } }

John Rutledge · Answer

Voici une autre version de marzagao , celle-ci écrite en Python:

def get_bigrams(string): """ Take a string and return a list of bigrams. """ s = string.lower() return [s[i:i+2] for i in list(range(len(s) - 1))] def string_similarity(str1, str2): """ Perform bigram comparison between two strings and return a percentage match in decimal form. """ pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) union = len(pairs1) + len(pairs2) hit_count = 0 for x in pairs1: for y in pairs2: if x == y: hit_count += 1 break return (2.0 * hit_count) / union if __== "__main__": """ Run a test using the example taken from: http://www.catalysoft.com/articles/StrikeAMatch.html """ w1 = 'Healed' words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold'] for w2 in words: print('Healed --- ' + w2) print(string_similarity(w1, w2)) print()

quantum · Answer

Une version plus courte de John Rutledge answer:

def get_bigrams(string): ''' Takes a string and returns a list of bigrams ''' s = string.lower() return {s[i:i+2] for i in xrange(len(s) - 1)} def string_similarity(str1, str2): ''' Perform bigram comparison between two strings and return a percentage match in decimal form ''' pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

Igal Alkon · Answer

Voici mon PHP implémentation de l'algorithme StrikeAMatch proposé, par Simon White. Les avantages (comme il est indiqué dans le lien) sont les suivants:

n reflet fidèle de la similitude lexicale - les chaînes avec de petites différences doivent être reconnues comme étant similaires. En particulier, un chevauchement important de la sous-chaîne doit indiquer un niveau élevé de similitude entre les chaînes.
ne robustesse aux changements d’ordre des mots - deux chaînes contenant les mêmes mots, mais dans un ordre différent, doivent être reconnues comme étant similaires. Par ailleurs, si une chaîne est simplement une anagramme aléatoire des caractères contenus dans l’autre, elle doit (généralement) être reconnue comme différente.
Indépendance de la langue - l'algorithme devrait fonctionner non seulement en anglais, mais dans de nombreuses langues.

<?php /** * LetterPairSimilarity algorithm implementation in PHP * @author Igal Alkon * @link http://www.catalysoft.com/articles/StrikeAMatch.html */ class LetterPairSimilarity { /** * @param $str * @return mixed */ private function wordLetterPairs($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each Word for ($w = 0; $w < count($words); $w++) { // Find the pairs of characters $pairsInWord = $this->letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p++) { $allPairs[] = $pairsInWord[$p]; } } return $allPairs; } /** * @param $str * @return array */ private function letterPairs($str) { $numPairs = mb_strlen($str)-1; $pairs = array(); for ($i = 0; $i < $numPairs; $i++) { $pairs[$i] = mb_substr($str,$i,2); } return $pairs; } /** * @param $str1 * @param $str2 * @return float */ public function compareStrings($str1, $str2) { $pairs1 = $this->wordLetterPairs(strtoupper($str1)); $pairs2 = $this->wordLetterPairs(strtoupper($str2)); $intersection = 0; $union = count($pairs1) + count($pairs2); for ($i=0; $i < count($pairs1); $i++) { $pair1 = $pairs1[$i]; $pairs2 = array_values($pairs2); for($j = 0; $j < count($pairs2); $j++) { $pair2 = $pairs2[$j]; if ($pair1 === $pair2) { $intersection++; unset($pairs2[$j]); break; } } } return (2.0*$intersection)/$union; } }

bchatham · Answer

Cette discussion a été vraiment utile, merci. J'ai converti l'algorithme en VBA pour une utilisation avec Excel et écrit quelques versions d'une fonction de feuille de calcul, une pour la comparaison simple d'une paire de chaînes, l'autre pour comparer une chaîne à une plage/un tableau de chaînes. La version de strSimLookup renvoie la dernière meilleure correspondance sous forme de chaîne, d'index de tableau ou de mesure de similarité.

Cette implémentation produit les mêmes résultats que ceux indiqués dans l'exemple Amazon sur le site Web de Simon White, à quelques exceptions mineures près: les correspondances les moins marquantes; Je ne sais pas trop où la différence s'infiltre, pourrait être la fonction Split de VBA, mais je n'ai pas enquêté car cela fonctionne très bien pour mes besoins.

'Implements functions to rate how similar two strings are on 'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar) 'Source: http://www.catalysoft.com/articles/StrikeAMatch.html 'Author: Bob Chatham, bob.chatham at gmail.com '9/12/2010 Option Explicit Public Function stringSimilarity(str1 As String, str2 As String) As Variant 'Simple version of the algorithm that computes the similiarity metric 'between two strings. 'NOTE: This verision is not efficient to use if you're comparing one string 'with a range of other values as it will needlessly calculate the pairs for the 'first string over an over again; use the array-optimized version for this case. Dim sPairs1 As Collection Dim sPairs2 As Collection Set sPairs1 = New Collection Set sPairs2 = New Collection WordLetterPairs str1, sPairs1 WordLetterPairs str2, sPairs2 stringSimilarity = SimilarityMetric(sPairs1, sPairs2) Set sPairs1 = Nothing Set sPairs2 = Nothing End Function Public Function strSimA(str1 As Variant, rRng As Range) As Variant 'Return an array of string similarity indexes for str1 vs every string in input range rRng Dim sPairs1 As Collection Dim sPairs2 As Collection Dim arrOut As Variant Dim l As Long, j As Long Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 l = rRng.Count ReDim arrOut(1 To l) For j = 1 To l Set sPairs2 = New Collection WordLetterPairs CStr(rRng(j)), sPairs2 arrOut(j) = SimilarityMetric(sPairs1, sPairs2) Set sPairs2 = Nothing Next j strSimA = Application.Transpose(arrOut) End Function Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant 'Return either the best match or the index of the best match 'depending on returnTYype parameter) between str1 and strings in rRng) ' returnType = 0 or omitted: returns the best matching string ' returnType = 1 : returns the index of the best matching string ' returnType = 2 : returns the similarity metric Dim sPairs1 As Collection Dim sPairs2 As Collection Dim metric, bestMetric As Double Dim i, iBest As Long Const RETURN_STRING As Integer = 0 Const RETURN_INDEX As Integer = 1 Const RETURN_METRIC As Integer = 2 If IsMissing(returnType) Then returnType = RETURN_STRING Set sPairs1 = New Collection WordLetterPairs CStr(str1), sPairs1 bestMetric = -1 iBest = -1 For i = 1 To rRng.Count Set sPairs2 = New Collection WordLetterPairs CStr(rRng(i)), sPairs2 metric = SimilarityMetric(sPairs1, sPairs2) If metric > bestMetric Then bestMetric = metric iBest = i End If Set sPairs2 = Nothing Next i If iBest = -1 Then strSimLookup = CVErr(xlErrValue) Exit Function End If Select Case returnType Case RETURN_STRING strSimLookup = CStr(rRng(iBest)) Case RETURN_INDEX strSimLookup = iBest Case Else strSimLookup = bestMetric End Select End Function Public Function strSim(str1 As String, str2 As String) As Variant Dim ilen, iLen1, ilen2 As Integer iLen1 = Len(str1) ilen2 = Len(str2) If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1 strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen)) End Function Sub WordLetterPairs(str As String, pairColl As Collection) 'Tokenize str into words, then add all letter pairs to pairColl Dim Words() As String Dim Word, nPairs, pair As Integer Words = Split(str) If UBound(Words) < 0 Then Set pairColl = Nothing Exit Sub End If For Word = 0 To UBound(Words) nPairs = Len(Words(Word)) - 1 If nPairs > 0 Then For pair = 1 To nPairs pairColl.Add Mid(Words(Word), pair, 2) Next pair End If Next Word End Sub Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant 'Helper function to calculate similarity metric given two collections of letter pairs. 'This function is designed to allow the pair collections to be set up separately as needed. 'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection 'if this is not the desired behavior. 'Also assumes that collections will be deallocated somewhere else Dim Intersect As Double Dim Union As Double Dim i, j As Long If sPairs1.Count = 0 Or sPairs2.Count = 0 Then SimilarityMetric = CVErr(xlErrNA) Exit Function End If Union = sPairs1.Count + sPairs2.Count Intersect = 0 For i = 1 To sPairs1.Count For j = 1 To sPairs2.Count If StrComp(sPairs1(i), sPairs2(j)) = 0 Then Intersect = Intersect + 1 sPairs2.Remove j Exit For End If Next j Next i SimilarityMetric = (2 * Intersect) / Union End Function

user4256874 · Answer

Je suis désolé, la réponse n'a pas été inventée par l'auteur. C'est un algorithme bien connu qui a été présenté pour la première fois par Digital Equipment Corporation et qui est souvent appelé bardage.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

Andy · Answer

Une version plus rapide PHP de l'algorithme:

/** * * @param $str * @return mixed */ private static function wordLetterPairs ($str) { $allPairs = array(); // Tokenize the string and put the tokens/words into an array $words = explode(' ', $str); // For each Word for ($w = 0; $w < count($words); $w ++) { // Find the pairs of characters $pairsInWord = self::letterPairs($words[$w]); for ($p = 0; $p < count($pairsInWord); $p ++) { $allPairs[$pairsInWord[$p]] = $pairsInWord[$p]; } } return array_values($allPairs); } /** * * @param $str * @return array */ private static function letterPairs ($str) { $numPairs = mb_strlen($str) - 1; $pairs = array(); for ($i = 0; $i < $numPairs; $i ++) { $pairs[$i] = mb_substr($str, $i, 2); } return $pairs; } /** * * @param $str1 * @param $str2 * @return float */ public static function compareStrings ($str1, $str2) { $pairs1 = self::wordLetterPairs(mb_strtolower($str1)); $pairs2 = self::wordLetterPairs(mb_strtolower($str2)); $union = count($pairs1) + count($pairs2); $intersection = count(array_intersect($pairs1, $pairs2)); return (2.0 * $intersection) / $union; }

Pour les données dont je disposais (environ 2300 comparaisons), ma durée d'exécution était de 0,58 seconde avec une solution de Igal Alkon versus 0,35 seconde avec la mienne.

fabiolimace · Answer

J'ai traduit l'algorithme de Simon White en PL/pgSQL. Ceci est ma contribution.

<!-- language: lang-sql --> create or replace function spt1.letterpairs(in p_str varchar) returns varchar as $$ declare v_numpairs integer := length(p_str)-1; v_pairs varchar[]; begin for i in 1 .. v_numpairs loop v_pairs[i] := substr(p_str, i, 2); end loop; return v_pairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.wordletterpairs(in p_str varchar) returns varchar as $$ declare v_allpairs varchar[]; v_words varchar[]; v_pairsinword varchar[]; begin v_words := regexp_split_to_array(p_str, '[[:space:]]'); for i in 1 .. array_length(v_words, 1) loop v_pairsinword := spt1.letterpairs(v_words[i]); if v_pairsinword is not null then for j in 1 .. array_length(v_pairsinword, 1) loop v_allpairs := v_allpairs || v_pairsinword[j]; end loop; end if; end loop; return v_allpairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY) returns anyarray as $$ select array(select unnest($1) intersect select unnest($2)) $$ language 'sql'; --=================================================================== create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar) returns float as $$ declare v_pairs1 varchar[]; v_pairs2 varchar[]; v_intersection integer; v_union integer; begin v_pairs1 := wordletterpairs(upper(p_str1)); v_pairs2 := wordletterpairs(upper(p_str2)); v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1); return (2.0 * v_intersection / v_union); end; $$ language 'plpgsql';

Ignobilis · Answer

Une version en belle Scala:

 def pairDistance(s1: String, s2: String): Double = { def strToPairs(s: String, acc: List[String]): List[String] = { if (s.size < 2) acc else strToPairs(s.drop(1), if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2))) } val lst1 = strToPairs(s1.toUpperCase, List()) val lst2 = strToPairs(s2.toUpperCase, List()) (2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size) }

Anonymous · Answer

métriques de similarité des chaînes contient un aperçu de nombreuses métriques différentes utilisées dans la comparaison de chaînes ( Wikipedia a également un aperçu). Une grande partie de ces métriques est implémentée dans une bibliothèque simmetrics .

Encore un autre exemple de métrique, non inclus dans la vue d'ensemble donnée est par exemple distance de compression (tentative d'approximation de complexité de Kolmogorov ), qui peut être utilisé pour des textes un peu plus longs que celui que vous avez présenté.

Vous pourriez également envisager de vous pencher sur un sujet beaucoup plus vaste: Traitement du langage naturel . Ces R packages peuvent vous aider à démarrer rapidement (ou du moins à donner quelques idées).

Et une dernière modification - recherchez les autres questions sur ce sujet à SO, il y en a plusieurs qui sont liées.

Rainer · Answer

Voici la version R:

get_bigrams <- function(str) { lstr = tolower(str) bigramlst = list() for(i in 1:(nchar(str)-1)) { bigramlst[[i]] = substr(str, i, i+1) } return(bigramlst) } str_similarity <- function(str1, str2) { pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) unionlen = length(pairs1) + length(pairs2) hit_count = 0 for(x in 1:length(pairs1)){ for(y in 1:length(pairs2)){ if (pairs1[[x]] == pairs2[[y]]) hit_count = hit_count + 1 } } return ((2.0 * hit_count) / unionlen) }

scripts · Answer

Affichage réponse de marzagao en C99, inspiré par ceux-ci algorithmes

double dice_match(const char *string1, const char *string2) { //check fast cases if (((string1 != NULL) && (string1[0] == '\0')) || ((string2 != NULL) && (string2[0] == '\0'))) { return 0; } if (string1 == string2) { return 1; } size_t strlen1 = strlen(string1); size_t strlen2 = strlen(string2); if (strlen1 < 2 || strlen2 < 2) { return 0; } size_t length1 = strlen1 - 1; size_t length2 = strlen2 - 1; double matches = 0; int i = 0, j = 0; //get bigrams and compare while (i < length1 && j < length2) { char a[3] = {string1[i], string1[i + 1], '\0'}; char b[3] = {string2[j], string2[j + 1], '\0'}; int cmp = strcmpi(a, b); if (cmp == 0) { matches += 2; } i++; j++; } return matches / (length1 + length2); }

Quelques tests basés sur le article original :

#include <stdio.h> void article_test1() { char *string1 = "FRANCE"; char *string2 = "FRENCH"; printf("====%s====
", __func__); printf("%2.f%% == 40%%
", dice_match(string1, string2) * 100); } void article_test2() { printf("====%s====
", __func__); char *string = "Healed"; char *ss[] = {"Heard", "Healthy", "Help", "Herded", "Sealed", "Sold"}; int correct[] = {44, 55, 25, 40, 80, 0}; for (int i = 0; i < 6; ++i) { printf("%2.f%% == %d%%
", dice_match(string, ss[i]) * 100, correct[i]); } } void multicase_test() { char *string1 = "FRaNcE"; char *string2 = "fREnCh"; printf("====%s====
", __func__); printf("%2.f%% == 40%%
", dice_match(string1, string2) * 100); } void gg_test() { char *string1 = "GG"; char *string2 = "GGGGG"; printf("====%s====
", __func__); printf("%2.f%% != 100%%
", dice_match(string1, string2) * 100); } int main() { article_test1(); article_test2(); multicase_test(); gg_test(); return 0; }

gruppler · Answer

Mon implémentation JavaScript prend une chaîne ou un tableau de chaînes, ainsi qu'un étage optionnel (l'étage par défaut est 0,5). Si vous lui transmettez une chaîne, le résultat sera vrai ou faux, que le score de similarité de la chaîne soit supérieur ou égal au sol. Si vous lui passez un tableau de chaînes, il retournera un tableau de ces chaînes dont le score de similarité est supérieur ou égal à l'étage, trié par score.

Exemples:

'Healed'.fuzzy('Sealed'); // returns true 'Healed'.fuzzy('Help'); // returns false 'Healed'.fuzzy('Help', 0.25); // returns true 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']); // returns ["Sealed", "Healthy"] 'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0); // returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]

C'est ici:

(function(){ var default_floor = 0.5; function pairs(str){ var pairs = [] , length = str.length - 1 , pair; str = str.toLowerCase(); for(var i = 0; i < length; i++){ pair = str.substr(i, 2); if(!/\s/.test(pair)){ pairs.Push(pair); } } return pairs; } function similarity(pairs1, pairs2){ var union = pairs1.length + pairs2.length , hits = 0; for(var i = 0; i < pairs1.length; i++){ for(var j = 0; j < pairs1.length; j++){ if(pairs1[i] == pairs2[j]){ pairs2.splice(j--, 1); hits++; break; } } } return 2*hits/union || 0; } String.prototype.fuzzy = function(strings, floor){ var str1 = this , pairs1 = pairs(this); floor = typeof floor == 'number' ? floor : default_floor; if(typeof(strings) == 'string'){ return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase(); }else if(strings instanceof Array){ var scores = {}; strings.map(function(str2){ scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase()); }); return strings.filter(function(str){ return scores[str] >= floor; }).sort(function(a, b){ return scores[b] - scores[a]; }); } }; })();

Et voici une version simplifiée pour votre commodité:

(function(){function g(a){var b=[],e=a.length-1,d;a=a.toLowerCase();for(var c=0;c<e;c++)d=a.substr(c,2),/\s/.test(d)||b.Push(d);return b}function h(a,b){for(var e=a.length+b.length,d=0,c=0;c<a.length;c++)for(var f=0;f<a.length;f++)if(a[c]==b[f]){b.splice(f--,1);d++;break}return 2*d/e||0}String.prototype.fuzzy=function(a,b){var e=this,d=g(this);b="number"==typeof b?b:0.5;if("string"==typeof a)return 1<e.length&&1<a.length&&h(d,g(a))>=b||e.toLowerCase()==a.toLowerCase();if(a instanceof Array){var c={};a.map(function(a){c[a]=1<e.length?h(d,g(a)):1*(e.toLowerCase()==a.toLowerCase())});return a.filter(function(a){return c[a]>=b}).sort(function(a,b){return c[b]-c[a]})}}})();

Frank Rundatz · Answer

En me basant sur l’impressionnante version C # de Michael La Voie, selon la demande visant à en faire une méthode d’extension, voici ce que j’ai proposé. Le principal avantage de cette méthode est que vous pouvez trier une liste générique en fonction du pourcentage de correspondance. Par exemple, considérons que vous avez un champ de chaîne nommé "Ville" dans votre objet. Un utilisateur recherche "Chester" et vous souhaitez renvoyer les résultats dans l'ordre décroissant de correspondance. Par exemple, vous voulez que les correspondances littérales de Chester apparaissent avant Rochester. Pour ce faire, ajoutez deux nouvelles propriétés à votre objet:

 public string SearchText { get; set; } public double PercentMatch { get { return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper()); } }

Ensuite, sur chaque objet, définissez le SearchText sur ce que l'utilisateur a recherché. Ensuite, vous pouvez trier facilement avec quelque chose comme:

 zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);

Voici la légère modification pour en faire une méthode d'extension:

 /// <summary> /// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// </summary> public static double PercentMatchTo(this string str1, string str2) { List<string> pairs1 = WordLetterPairs(str1.ToUpper()); List<string> pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// <summary> /// Gets all letter pairs for each /// individual Word in the string /// </summary> /// <param name="str"></param> /// <returns></returns> private static List<string> WordLetterPairs(string str) { List<string> AllPairs = new List<string>(); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each Word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// <summary> /// Generates an array containing every /// two consecutive letters in the input string /// </summary> /// <param name="str"></param> /// <returns></returns> private static string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; }

matthewpalmer · Answer

Une version de Haskell - n'hésitez pas à suggérer des modifications car je n'ai pas fait beaucoup de Haskell.

import Data.Char import Data.List -- Convert a string into words, then get the pairs of words from that phrase wordLetterPairs :: String -> [String] wordLetterPairs s1 = concat $ map pairs $ words s1 -- Converts a String into a list of letter pairs. pairs :: String -> [String] pairs [] = [] pairs (x:[]) = [] pairs (x:ys) = [x, head ys]:(pairs ys) -- Calculates the match rating for two strings matchRating :: String -> String -> Double matchRating s1 s2 = (numberOfMatches * 2) / totalLength where pairsS1 = wordLetterPairs $ map toLower s1 pairsS2 = wordLetterPairs $ map toLower s2 numberOfMatches = fromIntegral $ length $ pairsS1 `intersect` pairsS2 totalLength = fromIntegral $ length pairsS1 + length pairsS2

Shaun Lebron · Answer

Clojure:

(require '[clojure.set :refer [intersection]]) (defn bigrams [s] (->> (split s #"\s+") (mapcat #(partition 2 1 %)) (set))) (defn string-similarity [a b] (let [a-pairs (bigrams a) b-pairs (bigrams b) total-count (+ (count a-pairs) (count b-pairs)) match-count (count (intersection a-pairs b-pairs)) similarity (/ (* 2 match-count) total-count)] similarity))

s01ipsist · Answer

L'algorithme du coefficient Dice (réponse de Simon White/marzagao) est implémenté dans Ruby dans la méthode pair_distance_similar dans la gem amatch

https://github.com/flori/amatch

Cette gemme contient également des implémentations d'un certain nombre d'algorithmes d'appariement et de comparaison de chaînes approximatifs: distance d'édition Levenshtein, distance d'édition Sellers, distance de Hamming, longueur de sous-séquence commune la plus longue, longueur de sous-chaîne commune la plus longue, métrique de distance de paire, métrique de Jaro-Winkler .

Engr. Hasanuzzaman Sumon · Answer

Je recherchais une pure mise en œuvre de Ruby de l'algorithme indiqué par la réponse de @ marzagao. Malheureusement, le lien indiqué par @marzagao est cassé. Dans la réponse @ s01ipsist, il a indiqué que Ruby gem amatch où la mise en œuvre n'est pas purement Ruby. J'ai donc cherché un peu et trouvé une gemme fuzzy_match qui a une implémentation pure Ruby (bien que cette gem utilise amatch) à ici . J'espère que cela aidera quelqu'un comme moi.

tehvan · Answer

Qu'en est-il de la distance de Levenshtein, divisée par la longueur de la première chaîne (ou alternativement divisée par ma longueur min/max/moyenne des deux chaînes)? Cela a fonctionné pour moi jusqu'à présent.

user779420 · Answer

Hé les gars, j'ai essayé cela en javascript, mais je suis novice, tout le monde sait comment faire plus vite?

function get_bigrams(string) { // Takes a string and returns a list of bigrams var s = string.toLowerCase(); var v = new Array(s.length-1); for (i = 0; i< v.length; i++){ v[i] =s.slice(i,i+2); } return v; } function string_similarity(str1, str2){ /* Perform bigram comparison between two strings and return a percentage match in decimal form */ var pairs1 = get_bigrams(str1); var pairs2 = get_bigrams(str2); var union = pairs1.length + pairs2.length; var hit_count = 0; for (x in pairs1){ for (y in pairs2){ if (pairs1[x] == pairs2[y]){ hit_count++; } } } return ((2.0 * hit_count) / union); } var w1 = 'Healed'; var Word =['Heard','Healthy','Help','Herded','Sealed','Sold'] for (w2 in Word){ console.log('Healed --- ' + Word[w2]) console.log(string_similarity(w1,Word[w2])); }

chema989 · Answer

Voici une autre version de Similarity basée dans l’index de Sørensen – Dice (réponse de marzagao), celle-ci écrite en C++ 11:

/* * Similarity based in Sørensen–Dice index. * * Returns the Similarity between _str1 and _str2. */ double similarity_sorensen_dice(const std::string& _str1, const std::string& _str2) { // Base case: if some string is empty. if (_str1.empty() || _str2.empty()) { return 1.0; } auto str1 = upper_string(_str1); auto str2 = upper_string(_str2); // Base case: if the strings are equals. if (str1 == str2) { return 0.0; } // Base case: if some string does not have bigrams. if (str1.size() < 2 || str2.size() < 2) { return 1.0; } // Extract bigrams from str1 auto num_pairs1 = str1.size() - 1; std::unordered_set<std::string> str1_bigrams; str1_bigrams.reserve(num_pairs1); for (unsigned i = 0; i < num_pairs1; ++i) { str1_bigrams.insert(str1.substr(i, 2)); } // Extract bigrams from str2 auto num_pairs2 = str2.size() - 1; std::unordered_set<std::string> str2_bigrams; str2_bigrams.reserve(num_pairs2); for (unsigned int i = 0; i < num_pairs2; ++i) { str2_bigrams.insert(str2.substr(i, 2)); } // Find the intersection between the two sets. int intersection = 0; if (str1_bigrams.size() < str2_bigrams.size()) { const auto it_e = str2_bigrams.end(); for (const auto& bigram : str1_bigrams) { intersection += str2_bigrams.find(bigram) != it_e; } } else { const auto it_e = str1_bigrams.end(); for (const auto& bigram : str2_bigrams) { intersection += str1_bigrams.find(bigram) != it_e; } } // Returns similarity coefficient. return (2.0 * intersection) / (num_pairs1 + num_pairs2); }