pandas dataframe str.contains () AND opération

Question

df (Pandas Dataframe) a trois lignes.

some_col_name "Apple is delicious" "banana is delicious" "Apple and banana both are delicious"

df.col_name.str.contains("Apple|banana")

va attraper toutes les lignes:

"Apple is delicious", "banana is delicious", "Apple and banana both are delicious".

Comment puis-je appliquer l'opérateur AND sur la méthode str.contains, de sorte qu'il n'attrape que les chaînes qui contiennent les deux Apple & banana?

"Apple and banana both are delicious"

Je voudrais saisir des chaînes contenant 10 à 20 mots différents (raisin, pastèque, baie, orange, ..., etc.)

flyingmeatball · Accepted Answer

Vous pouvez le faire comme suit:

df[(df['col_name'].str.contains('Apple')) & (df['col_name'].str.contains('banana'))]

Alexander · Answer

df = pd.DataFrame({'col': ["Apple is delicious", "banana is delicious", "Apple and banana both are delicious"]}) targets = ['Apple', 'banana'] # Any Word from `targets` are present in sentence. >>> df.col.apply(lambda sentence: any(Word in sentence for Word in targets)) 0 True 1 True 2 True Name: col, dtype: bool # All words from `targets` are present in sentence. >>> df.col.apply(lambda sentence: all(Word in sentence for Word in targets)) 0 False 1 False 2 True Name: col, dtype: bool

Anzel · Answer

Vous pouvez également le faire dans un style d'expression régulière:

df[df['col_name'].str.contains(r'^(?=.*Apple)(?=.*banana)')]

Vous pouvez ensuite construire votre liste de mots dans une chaîne d'expression régulière comme ceci:

base = r'^{}' expr = '(?=.*{})' words = ['Apple', 'banana', 'cat'] # example base.format(''.join(expr.format(w) for w in words))

rendra:

'^(?=.*Apple)(?=.*banana)(?=.*cat)'

Ensuite, vous pouvez faire votre travail de manière dynamique.

Charan Reddy · Answer

Cela marche

df.col.str.contains(r'(?=.*Apple)(?=.*banana)',regex=True)

Siraj S. · Answer

si vous voulez saisir au moins au moins deux mots dans la phrase, cela fonctionnera peut-être (en prenant le conseil de @Alexander):

target=['Apple','banana','grapes','orange'] connector_list=['and'] df[df.col.apply(lambda sentence: (any(Word in sentence for Word in target)) & (all(connector in sentence for connector in connector_list)))]

production:

 col 2 Apple and banana both are delicious

si vous avez plus de deux mots à attraper qui sont séparés par une virgule ',' ajoutez-le à la liste de connecteurs et modifiez la deuxième condition de tous à n'importe quel

df[df.col.apply(lambda sentence: (any(Word in sentence for Word in target)) & (any(connector in sentence for connector in connector_list)))]

production:

 col 2 Apple and banana both are delicious 3 orange,banana and Apple all are delicious

Sergey Zakharov · Answer

Si vous souhaitez uniquement utiliser des méthodes natives et éviter d'écrire des expressions rationnelles, voici une version vectorisée sans lambda impliqué:

targets = ['Apple', 'banana', 'strawberry'] fruit_masks = (df['col'].str.contains(string) for string in targets) combined_mask = np.vstack(fruit_masks).all(axis=0) df[combined_mask]

pmaniyan · Answer

Essayez ce regex

Apple.*banana|banana.*Apple

Le code est:

import pandas as pd df = pd.DataFrame([[1,"Apple is delicious"],[2,"banana is delicious"],[3,"Apple and banana both are delicious"]],columns=('ID','String_Col')) print df[df['String_Col'].str.contains(r'Apple.*banana|banana.*Apple')]

Sortie

 ID String_Col 2 3 Apple and banana both are delicious

pault · Answer

Énumérer toutes les possibilités de grandes listes est lourd. Une meilleure façon consiste à utiliser reduce() et l'opérateur ET au niveau du bit (&).

Par exemple, considérez le DataFrame suivant:

df = pd.DataFrame({'col': ["Apple is delicious", "banana is delicious", "Apple and banana both are delicious", "i love Apple, banana, and strawberry"]}) # col #0 Apple is delicious #1 banana is delicious #2 Apple and banana both are delicious #3 i love Apple, banana, and strawberry

Supposons que nous voulions rechercher tous les éléments suivants:

targets = ['Apple', 'banana', 'strawberry']

Nous pouvons faire:

#from functools import reduce # needed for python3 print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))]) # col #3 i love Apple, banana, and strawberry