fractionner un csv en plusieurs fichiers en python

Question

J'ai un fichier csv d'environ 5000 lignes en python, je veux le scinder en cinq fichiers.

J'ai écrit un code pour cela mais ça ne marche pas

import codecs import csv NO_OF_LINES_PER_FILE = 1000 def again(count_file_header,count): f3 = open('write_'+count_file_header+'.csv', 'at') with open('import_1458922827.csv', 'rb') as csvfile: candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL) co = 0 for row in candidate_info_reader: co = co + 1 count = count + 1 if count <= count: pass Elif count >= NO_OF_LINES_PER_FILE: count_file_header = count + NO_OF_LINES_PER_FILE again(count_file_header,count) else: writer = csv.writer(f3,delimiter = ',', lineterminator='
',quoting=csv.QUOTE_ALL) writer.writerow(row) def read_write(): f3 = open('write_'+NO_OF_LINES_PER_FILE+'.csv', 'at') with open('import_1458922827.csv', 'rb') as csvfile: candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL) count = 0 for row in candidate_info_reader: count = count + 1 if count >= NO_OF_LINES_PER_FILE: count_file_header = count + NO_OF_LINES_PER_FILE again(count_file_header,count) else: writer = csv.writer(f3,delimiter = ',', lineterminator='
',quoting=csv.QUOTE_ALL) writer.writerow(row) read_write()

Le code ci-dessus crée de nombreux fichiers avec un contenu vide.

Comment diviser un fichier en cinq fichiers csv?

Rudziankoŭ · Accepted Answer

Je vous suggère de ne pas inventer une roue. Il existe une solution existante. Source ici

import os def split(filehandler, delimiter=',', row_limit=1000, output_name_template='output_%s.csv', output_path='.', keep_headers=True): import csv reader = csv.reader(filehandler, delimiter=delimiter) current_piece = 1 current_out_path = os.path.join( output_path, output_name_template % current_piece ) current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) current_limit = row_limit if keep_headers: headers = reader.next() current_out_writer.writerow(headers) for i, row in enumerate(reader): if i + 1 > current_limit: current_piece += 1 current_limit = row_limit * current_piece current_out_path = os.path.join( output_path, output_name_template % current_piece ) current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) if keep_headers: current_out_writer.writerow(headers) current_out_writer.writerow(row)

Utilisez-le comme:

split(open('/your/pat/input.csv', 'r'));

Aziz Alto · Answer

En Python

Utilisez readlines() et writelines() pour le faire, voici un exemple:

>>> csvfile = open('import_1458922827.csv', 'r').readlines() >>> filename = 1 >>> for i in range(len(csvfile)): ... if i % 1000 == 0: ... open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000]) ... filename += 1

les noms des fichiers de sortie seront numérotés 1.csv, 2.csv, ... etc.

Depuis le terminal

Pour votre information, vous pouvez le faire depuis la ligne de commande en utilisant split comme suit:

$ split -l 1000 import_1458922827.csv

Ryan Tuck · Answer

Une solution conviviale pour python3:

def split_csv(source_filepath, dest_folder, split_file_prefix, records_per_file): """ Split a source csv into multiple csvs of equal numbers of records, except the last file. Includes the initial header row in each split file. Split files follow a zero-index sequential naming convention like so: `{split_file_prefix}_0.csv` """ if records_per_file <= 0: raise Exception('records_per_file must be > 0') with open(source_filepath, 'r') as source: reader = csv.reader(source) headers = next(reader) file_idx = 0 records_exist = True while records_exist: i = 0 target_filename = f'{split_file_prefix}_{file_idx}.csv' target_filepath = os.path.join(dest_folder, target_filename) with open(target_filepath, 'w') as target: writer = csv.writer(target) while i < records_per_file: if i == 0: writer.writerow(headers) try: writer.writerow(next(reader)) i += 1 except: records_exist = False break if i == 0: # we only wrote the header, so delete that file os.remove(target_filepath) file_idx += 1

Whitefret · Answer

if count <= count: pass

Cette condition est toujours vraie pour que vous passiez à chaque fois

Sinon, vous pouvez regarder ce post: Fractionner un fichier CSV en parties égales?

Aetos · Answer

Je vous suggère de tirer parti des possibilités offertes par les pandas. Voici les fonctions que vous pourriez utiliser pour faire cela:

def csv_count_rows(file): """ Counts the number of rows in a file. :param file: path to the file. :return: number of lines in the designated file. """ with open(file) as f: nb_lines = sum(1 for line in f) return nb_lines def split_csv(file, sep=",", output_path=".", nrows=None, chunksize=None, low_memory=True, usecols=None): """ Split a csv into several files. :param file: path to the original csv. :param sep: View pandas.read_csv doc. :param output_path: path in which to output the resulting parts of the splitting. :param nrows: Number of rows to split the original csv by, also view pandas.read_csv doc. :param chunksize: View pandas.read_csv doc. :param low_memory: View pandas.read_csv doc. :param usecols: View pandas.read_csv doc. """ nb_of_rows = csv_count_rows(file) # Parsing file elements : Path, name, extension, etc... # file_path = "/".join(file.split("/")[0:-1]) file_name = file.split("/")[-1] # file_ext = file_name.split(".")[-1] file_name_trunk = file_name.split(".")[0] split_files_name_trunk = file_name_trunk + "_part_" # Number of chunks to partition the original file into nb_of_chunks = math.ceil(nb_of_rows / nrows) if nrows: log_debug_process_start = f"The file '{file_name}' contains {nb_of_rows} ROWS. " \ f"
It will be split into {nb_of_chunks} chunks of a max number of rows : {nrows}." \ f"
The resulting files will be output in '{output_path}' as '{split_files_name_trunk}0 to {nb_of_chunks - 1}'" logging.debug(log_debug_process_start) for i in range(nb_of_chunks): # Number of rows to skip is determined by (the number of the chunk being processed) multiplied by (the nrows parameter). rows_to_skip = range(1, i * nrows) if i else None output_file = f"{output_path}/{split_files_name_trunk}{i}.csv" log_debug_chunk_processing = f"Processing chunk {i} of the file '{file_name}'" logging.debug(log_debug_chunk_processing) # Fetching the original csv file and handling it with skiprows and nrows to process its data df_chunk = pd.read_csv(filepath_or_buffer=file, sep=sep, nrows=nrows, skiprows=rows_to_skip, chunksize=chunksize, low_memory=low_memory, usecols=usecols) df_chunk.to_csv(path_or_buf=output_file, sep=sep) log_info_file_output = f"Chunk {i} of file '{file_name}' created in '{output_file}'" logging.info(log_info_file_output)

Et puis dans votre cahier principal ou jupyter, vous mettez:

# This is how you initiate logging in the most basic way. logging.basicConfig(level=logging.DEBUG) file = {#Path to your file} split_csv(file,sep=";" ,output_path={#Path where you'd like to output it},nrows = 4000000, low_memory = False)

P.S.1: Je mets nrows = 4000000 parce que c'est une préférence personnelle. Vous pouvez changer ce nombre si vous le souhaitez.

P.S.2: J'ai utilisé la bibliothèque de journalisation pour afficher des messages. Dans les cas où une telle fonction serait appliquée à des fichiers volumineux existant sur un serveur distant, vous voulez vraiment éviter les «impressions simples» et incorporer des fonctionnalités de journalisation. Vous pouvez remplacer logging.info ou logging.debug par print

P.S.3: Bien sûr, vous devez remplacer les parties {# Blablabla} du code par vos propres paramètres.

Ramesh K · Answer

@Ryan, le code Python3 a fonctionné pour moi, j'ai utilisé newline = '' comme ci-dessous pour éviter les problèmes de lignes vierges, Avec open (target_filepath, 'w', newline = '') comme cible: