Comment supprimer des caractères spéciaux sauf l'espace d'un fichier en python?

Question

J'ai un énorme corpus de texte (ligne par ligne) et je souhaite supprimer les caractères spéciaux tout en maintenant l'espace et la structure de la chaîne.

hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.

devrait être

hello there A Z R T world welcome to python this should be the next line followed by another million like this

Chiheb Nexus · Accepted Answer

Vous pouvez également utiliser ce modèle avec regex:

import re a = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.''' for k in a.split("
"): print(re.sub(r"[^a-zA-Z0-9]+", ' ', k)) # Or: # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k)) # print(final)

Sortie:

hello there A Z R T world welcome to python this should the next line followed by an other million like this

Modifier:

Sinon, vous pouvez stocker les dernières lignes dans une list:

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("
")] print(final)

Sortie:

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

Eliethesaiyan · Answer

Je pense que nfn neil answer est génial ... mais je voudrais juste ajouter une regex simple pour supprimer tous les caractères sans mots, mais il faudra considérer le soulignement dans le cadre du mot

print re.sub(r'\W+', ' ', string) >>> hello there A Z R_T world welcome to python

wwii · Answer

Créer un dictionnaire mappant des caractères spéciaux sur Aucun

d = {c:None for c in special_characters}

Créez une table translation à l'aide du dictionnaire. Lire le texte entier dans une variable et utiliser str.translate sur tout le texte.