Trouver une sous-chaîne commune entre deux chaînes

Question

Je voudrais comparer 2 chaînes et garder la correspondance, séparant les cas où la comparaison échoue.

Donc si j'ai 2 cordes -

string1 = apples string2 = appleses answer = apples

Un autre exemple, comme la chaîne pourrait avoir plus d'un mot.

string1 = Apple pie available string2 = Apple pies answer = Apple pie

Je suis sûr qu'il existe un moyen simple de faire cela en Python mais je ne peux pas y arriver, aucune aide ou explication appréciée.

thefourtheye · Accepted Answer

C'est ce qu'on appelle le plus long problème de sous-chaîne commune. Je présente ici une solution simple, facile à comprendre mais inefficace. Il faudra beaucoup de temps pour produire une sortie correcte pour les grandes chaînes, car la complexité de cet algorithme est O (N ^ 2).

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): match = "" for j in range(len2): if (i + j < len1 and string1[i + j] == string2[j]): match += string2[j] else: if (len(match) > len(answer)): answer = match match = "" return answer print longestSubstringFinder("Apple pie available", "Apple pies") print longestSubstringFinder("apples", "appleses") print longestSubstringFinder("bapples", "cappleses")

Sortie

Apple pie apples apples

RickardSjogren · Answer

Pour être complet, difflib dans la bibliothèque standard fournit des charges d’utilitaires de comparaison de séquences. Par exemple find_longest_match qui trouve la plus longue sous-chaîne commune lorsqu'elle est utilisée sur des chaînes. Exemple d'utilisation:

from difflib import SequenceMatcher string1 = "Apple pie available" string2 = "come have some Apple pies" match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2)) print(match) # -> Match(a=0, b=15, size=9) print(string1[match.a: match.a + match.size]) # -> Apple pie print(string2[match.b: match.b + match.size]) # -> Apple pie

Eric · Answer

def common_start(sa, sb): """ returns the longest common substring from the beginning of sa and sb """ def _iter(): for a, b in Zip(sa, sb): if a == b: yield a else: return return ''.join(_iter())

>>> common_start("Apple pie available", "Apple pies") 'Apple pie'

Ou d'une manière légèrement plus étrange:

def stop_iter(): """An easy way to break out of a generator""" raise StopIteration def common_start(sa, sb): return ''.join(a if a == b else stop_iter() for a, b in Zip(sa, sb))

Qui pourrait être plus lisible comme

def terminating(cond): """An easy way to break out of a generator""" if cond: return True raise StopIteration def common_start(sa, sb): return ''.join(a for a, b in Zip(sa, sb) if terminating(a == b))

jonas · Answer

On pourrait aussi considérer os.path.commonprefix qui fonctionne sur les caractères et peut donc être utilisé pour n’importe quelle chaîne.

import os common = os.path.commonprefix(['Apple pie available', 'Apple pies']) assert common == 'Apple pie'

SergeyR · Answer

Identique à Evo , mais avec un nombre arbitraire de chaînes à comparer:

def common_start(*strings): """ Returns the longest common substring from the beginning of the `strings` """ def _iter(): for z in Zip(*strings): if z.count(z[0]) == len(z): # check all elements in `z` are the same yield z[0] else: return return ''.join(_iter())

user7733798 · Answer

Correction de bugs avec la réponse du premier:

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): for j in range(len2): lcs_temp=0 match='' while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]): match += string2[j+lcs_temp] lcs_temp+=1 if (len(match) > len(answer)): answer = match return answer print longestSubstringFinder("dd Apple pie available", "Apple pies") print longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000") print longestSubstringFinder("bapples", "cappleses") print longestSubstringFinder("apples", "apples")

Birei · Answer

Essayer:

import itertools as it ''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], Zip(string1, string2)))

Il fait la comparaison depuis le début des deux chaînes.

Rali Tsanova · Answer

Ce n’est pas le moyen le plus efficace de le faire, mais c’est ce que je pourrais proposer et cela fonctionne. Si quelqu'un peut l'améliorer, veuillez le faire. Cela crée une matrice et met 1 où les caractères correspondent. Ensuite, il balaye la matrice pour trouver la diagonale la plus longue de 1, en gardant une trace de l'endroit où elle commence et se termine. Ensuite, il renvoie la sous-chaîne de la chaîne d'entrée avec les positions de début et de fin comme arguments.

Remarque: Ceci ne trouve qu'une seule sous-chaîne commune la plus longue. S'il y en a plus d'un, vous pouvez créer un tableau pour stocker les résultats et le renvoyer. En outre, il est sensible à la casse, ainsi (Tarte aux pommes, Tarte aux pommes) renverra une tarte aux points.

def longestSubstringFinder(str1, str2): answer = "" if len(str1) == len(str2): if str1==str2: return str1 else: longer=str1 shorter=str2 Elif (len(str1) == 0 or len(str2) == 0): return "" Elif len(str1)>len(str2): longer=str1 shorter=str2 else: longer=str2 shorter=str1 matrix = numpy.zeros((len(shorter), len(longer))) for i in range(len(shorter)): for j in range(len(longer)): if shorter[i]== longer[j]: matrix[i][j]=1 longest=0 start=[-1,-1] end=[-1,-1] for i in range(len(shorter)-1, -1, -1): for j in range(len(longer)): count=0 begin = [i,j] while matrix[i][j]==1: finish=[i,j] count=count+1 if j==len(longer)-1 or i==len(shorter)-1: break else: j=j+1 i=i+1 i = i-count if count>longest: longest=count start=begin end=finish break answer=shorter[int(start[0]): int(end[0])+1] return answer

radhikesh93 · Answer

def matchingString(x,y): match='' for i in range(0,len(x)): for j in range(0,len(y)): k=1 # now applying while condition untill we find a substring match and length of substring is less than length of x and y while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]): if len(match) <= len(x[i:i+k]): match = x[i:i+k] k=k+1 return match print matchingString('Apple','ale') #le print matchingString('Apple pie available','Apple pies') #Apple pie

user3838498 · Answer

def LongestSubString(s1,s2): left = 0 right =len(s2) while(left<right): if(s2[left] not in s1): left = left+1 else: if(s2[left:right] not in s1): right = right - 1 else: return(s2[left:right]) s1 = "pineapple" s2 = "applc" print(LongestSubString(s1,s2))

Bantu Manjunath · Answer

C’est le problème de la classe appelé «Recherche de la plus longue séquence». J'ai donné un code simple qui a fonctionné pour moi. Mes entrées sont également des listes d'une séquence qui peut aussi être une chaîne.

def longest_substring(list1,list2): both=[] if len(list1)>len(list2): small=list2 big=list1 else: small=list1 big=list2 removes=0 stop=0 for i in small: for j in big: if i!=j: removes+=1 if stop==1: break Elif i==j: both.append(i) for q in range(removes+1): big.pop(0) stop=1 break removes=0 return both

xXDaveXx · Answer

Retourne la première sous-chaîne commune la plus longue:

def compareTwoStrings(string1, string2): list1 = list(string1) list2 = list(string2) match = [] output = "" length = 0 for i in range(0, len(list1)): if list1[i] in list2: match.append(list1[i]) for j in range(i + 1, len(list1)): if ''.join(list1[i:j]) in string2: match.append(''.join(list1[i:j])) else: continue else: continue for string in match: if length < len(list(string)): length = len(list(string)) output = string else: continue return output

wwii · Answer

Tout d’abord, une fonction helper adaptée de la recette par paires de itertools pour produire des sous-chaînes.

import itertools def n_wise(iterable, n = 2): '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ... n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...''' a = itertools.tee(iterable, n) for x, thing in enumerate(a[1:]): for _ in range(x+1): next(thing, None) return Zip(*a)

Ensuite, une fonction itère sur les sous-chaînes, la plus longue en premier, et teste l’appartenance. (efficacité non prise en compte)

def foo(s1, s2): '''Finds the longest matching substring ''' # the longest matching substring can only be as long as the shortest string #which string is shortest? shortest, longest = sorted([s1, s2], key = len) #iterate over substrings, longest substrings first for n in range(len(shortest)+1, 2, -1): for sub in n_wise(shortest, n): sub = ''.join(sub) if sub in longest: #return the first one found, it should be the longest return sub s = "fdomainster" t = "exdomainid" print(foo(s,t))

>>> domain >>>