Spark Dataframe: Comment ajouter un index Column: Aka Distributed Data Index

Question

Je lis les données d'un fichier csv, mais je n'ai pas d'index.

Je veux ajouter une colonne de 1 au numéro de la ligne.

Que dois-je faire, merci (scala)

Omar14 · Accepted Answer

Avec Scala, vous pouvez utiliser:

import org.Apache.spark.sql.functions._ df.withColumn("id",monotonicallyIncreasingId)

Vous pouvez vous référer à ce exemple et scala docs .

Avec Pyspark, vous pouvez utiliser:

from pyspark.sql.functions import monotonically_increasing_id df_index = df.select("*").withColumn("id", monotonically_increasing_id())

anshu kumar · Answer

monotonically_increasing_id - L'ID généré est garanti d'être monotone en croissance et unique, mais pas consécutif.

"Je veux ajouter une colonne de 1 au numéro de la ligne."

Disons que nous avons le DF suivant

 + -------- + ------------- + ------- + | userId | productCode | nombre | + -------- + ------------- + ------- + | 25 | 6001 | 2 | | 11 | 5001 | 8 | | 23 | 123 | 5 | + -------- + ------------- + ------- +

Pour générer les identifiants à partir de 1

val w = Window.orderBy("count") val result = df.withColumn("index", row_number().over(w))

Cela ajouterait une colonne d'index ordonnée par la valeur croissante de count.

 + -------- + ------------- + ------- + ------- + | userId | productCode | compter | index | + -------- + ------------- + ------- + ------- + |. 25 | 6001 | 2 | 1 | | 23 | 123 | 5 | 2 | | 11 | 5001 | 8 | 3 | + -------- + ------------- + ------- + ------- +

Ram Ghadiyaram · Answer

NOTE : Les approches ci-dessus ne donnent pas un numéro de séquence, mais un identifiant croissant.

un moyen simple de le faire et de vous assurer que l'ordre des index est comme ci-dessous. zipWithIndex.

Échantillon de données.

+-------------------+ | Name| +-------------------+ | Ram Ghadiyaram| | Ravichandra| | ilker| | nick| | Naveed| | Gobinathan SP| |Sreenivas Venigalla| | Jackela Kowski| | Arindam Sengupta| | Liangpi| | Omar14| | anshu kumar| +-------------------+

 package com.example import org.Apache.spark.internal.Logging import org.Apache.spark.sql.SparkSession._ import org.Apache.spark.sql.functions._ import org.Apache.spark.sql.types.{LongType, StructField, StructType} import org.Apache.spark.sql.{DataFrame, Row} /** * DistributedDataIndex : Program to index an RDD with */ object DistributedDataIndex extends App with Logging { val spark = builder .master("local[*]") .appName(this.getClass.getName) .getOrCreate() import spark.implicits._ val df = spark.sparkContext.parallelize( Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick" , "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar" )).toDF("Name") df.show logInfo("addColumnIndex here") // Add index now... val df1WithIndex = addColumnIndex(df) .withColumn("monotonically_increasing_id", monotonically_increasing_id) df1WithIndex.show(false) /** * Add Column Index to dataframe */ def addColumnIndex(df: DataFrame) = { spark.sqlContext.createDataFrame( df.rdd.zipWithIndex.map { case (row, index) => Row.fromSeq(row.toSeq :+ index) }, // Create schema for index column StructType(df.schema.fields :+ StructField("index", LongType, false))) } }

Résultat :

+-------------------+-----+---------------------------+ |Name |index|monotonically_increasing_id| +-------------------+-----+---------------------------+ |Ram Ghadiyaram |0 |0 | |Ravichandra |1 |8589934592 | |ilker |2 |8589934593 | |nick |3 |17179869184 | |Naveed |4 |25769803776 | |Gobinathan SP |5 |25769803777 | |Sreenivas Venigalla|6 |34359738368 | |Jackela Kowski |7 |42949672960 | |Arindam Sengupta |8 |42949672961 | |Liangpi |9 |51539607552 | |Omar14 |10 |60129542144 | |anshu kumar |11 |60129542145 | +-------------------+-----+---------------------------+

Shantanu Sharma · Answer

Comme Ram l'a dit, zippedwithindex est préférable à un nombre croissant d'id, id vous avez besoin de numéros de ligne consécutifs. Essayez ceci (environnement PySpark):

from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, LongType new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)]) zipped_rdd = **original_dataframe**.rdd.zipWithIndex() indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))

où original_dataframe est la base de données à laquelle vous devez ajouter un index et row_with_index est le nouveau schéma avec l'index de colonne que vous pouvez écrire en tant que

row_with_index = Row( "calendar_date" ,"year_week_number" ,"year_period_number" ,"realization" ,"index" )

Ici, calendar_date, year_week_number, year_period_number et realisation étaient les colonnes de mon cadre de données d'origine. Vous pouvez remplacer les noms par les noms de vos colonnes. Index est le nouveau nom de colonne que vous avez dû ajouter pour les numéros de ligne.