Comment convertir un fichier en utf-8 en Python?

Question

J'ai besoin de convertir un tas de fichiers en utf-8 en Python et j'ai des problèmes avec la partie "conversion du fichier".

J'aimerais faire l'équivalent de:

iconv -t utf-8 $file > converted/$file # this is Shell code

Merci!

DzinX · Accepted Answer

Vous pouvez utiliser le module codecs , comme ceci:

import codecs BLOCKSIZE = 1048576 # or some other, desired size in bytes with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile: with codecs.open(targetFileName, "w", "utf-8") as targetFile: while True: contents = sourceFile.read(BLOCKSIZE) if not contents: break targetFile.write(contents)

EDIT: ajout du paramètre BLOCKSIZE pour contrôler la taille du fragment de fichier.

Staale · Answer

Cela a fonctionné pour moi dans un petit test:

sourceEncoding = "iso-8859-1" targetEncoding = "utf-8" source = open("source") target = open("target", "w") target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))

S&#233;bastien RoccaSerra · Answer

Merci pour les réponses, ça marche!

Et comme les fichiers source sont dans des formats mélangés, j'ai ajouté une liste de formats source à essayer en séquence (sourceFormats), et le UnicodeDecodeError, j'essaie le format suivant:

from __future__ import with_statement import os import sys import codecs from chardet.universaldetector import UniversalDetector targetFormat = 'utf-8' outputDir = 'converted' detector = UniversalDetector() def get_encoding_type(current_file): detector.reset() for line in file(current_file): detector.feed(line) if detector.done: break detector.close() return detector.result['encoding'] def convertFileBestGuess(filename): sourceFormats = ['ascii', 'iso-8859-1'] for format in sourceFormats: try: with codecs.open(fileName, 'rU', format) as sourceFile: writeConversion(sourceFile) print('Done.') return except UnicodeDecodeError: pass def convertFileWithDetection(fileName): print("Converting '" + fileName + "'...") format=get_encoding_type(fileName) try: with codecs.open(fileName, 'rU', format) as sourceFile: writeConversion(sourceFile) print('Done.') return except UnicodeDecodeError: pass print("Error: failed to convert '" + fileName + "'.") def writeConversion(file): with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile: for line in file: targetFile.write(line) # Off topic: get the file list and call convertFile on each file # ...

(EDIT par Rudro Badhon: cela incorpore l’essai original de plusieurs formats jusqu’à ce que vous n’ayez pas une exception, ainsi qu’une approche alternative utilisant chardet.universaldetector)

Mojtaba Khodadadi · Answer

Ceci est une fonction Python3 pour convertir n'importe quel fichier texte en un fichier encodé en UTF-8 (sans utiliser de paquets inutiles)

def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'): with open(filename, 'r', encoding=encoding_from) as fr: with open(newFilename, 'w', encoding=encoding_to) as fw: for line in fr: fw.write(line[:-1]+'
')

Vous pouvez l'utiliser facilement dans une boucle pour convertir une liste de fichiers.

Ricardo · Answer

Pour deviner quel est le codage source, utilisez la commande file * nix.

Exemple:

$ file --mime jumper.xml jumper.xml: application/xml; charset=utf-8

DEX Data Explorers · Answer

Ceci est ma méthode de force brute. Il prend également en charge les entrées et mélangées.

 # open the CSV file inputfile = open(filelocation, 'rb') outputfile = open(outputfilelocation, 'w', encoding='utf-8') for line in inputfile: if line[-2:] == b'
' or line[-2:] == b'

': output = line[:-2].decode('utf-8', 'replace') + '
' Elif line[-1:] == b'
' or line[-1:] == b'
': output = line[:-1].decode('utf-8', 'replace') + '
' else: output = line.decode('utf-8', 'replace') + '
' outputfile.write(output) outputfile.close() except BaseException as error: cfg.log(self.outf, "Error(18): opening CSV-file " + filelocation + " failed: " + str(error)) self.loadedwitherrors = 1 return ([]) try: # open the CSV-file of this source table csvreader = csv.reader(open(outputfilelocation, "rU"), delimiter=delimitervalue, quoting=quotevalue, dialect=csv.Excel_tab) except BaseException as error: cfg.log(self.outf, "Error(19): reading CSV-file " + filelocation + " failed: " + str(error))

Sole Sensei · Answer

Réponse pour le type de codage source inconnu

basé sur @ Sébastien RoccaSerra

python3.6

import os from chardet import detect # get file encoding type def get_encoding_type(file): with open(file, 'rb') as f: rawdata = f.read() return detect(rawdata)['encoding'] from_codec = get_encoding_type(srcfile) # add try: except block for reliability try: with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e: text = f.read() # for small files, for big use chunks e.write(text) os.remove(srcfile) # remove old encoding file os.rename(trgfile, srcfile) # rename new encoding except UnicodeDecodeError: print('Decode Error') except UnicodeEncodeError: print('Encode Error')