Spark: Fusionner 2 images de données en ajoutant un index/numéro de ligne sur les deux images de données

Question

Q: Existe-t-il un moyen de fusionner deux images ou de copier une colonne d’une image à l’autre dans PySpark?

Par exemple, j'ai deux Dataframes:

DF1 C1 C2 23397414 20875.7353 5213970 20497.5582 41323308 20935.7956 123276113 18884.0477 76456078 18389.9269

la seconde dataframe

DF2 C3 C4 2008-02-04 262.00 2008-02-05 257.25 2008-02-06 262.75 2008-02-07 237.00 2008-02-08 231.00

Ensuite, je veux ajouter C3 de DF2 à DF1 comme ceci:

New DF C1 C2 C3 23397414 20875.7353 2008-02-04 5213970 20497.5582 2008-02-05 41323308 20935.7956 2008-02-06 123276113 18884.0477 2008-02-07 76456078 18389.9269 2008-02-08

J'espère que cet exemple était clair.

Ram Ghadiyaram · Accepted Answer

rownum + window function i.e solution 1 ou zipWithIndex.map i.e solution 2 devrait vous aider dans ce cas.

Solution 1: Vous pouvez utiliser les fonctions de la fenêtre pour obtenir ce genre de _

Ensuite, je vous suggérerais d’ajouter rownumber comme nom de colonne supplémentaire à Dataframe say df1.

 DF1 C1 C2 columnindex 23397414 20875.7353 1 5213970 20497.5582 2 41323308 20935.7956 3 123276113 18884.0477 4 76456078 18389.9269 5

le deuxième cadre de données

DF2 C3 C4 columnindex 2008-02-04 262.00 1 2008-02-05 257.25 2 2008-02-06 262.75 3 2008-02-07 237.00 4 2008-02-08 231.00 5

Maintenant .. faire la jonction intérieure de DF1 et DF2 c'est tout ... vous obtiendrez ci-dessous ouput

quelque chose comme ça

from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber w = Window().orderBy() df1 = .... // as showed above df1 df2 = .... // as shown above df2 df11 = df1.withColumn("columnindex", rowNumber().over(w)) df22 = df2.withColumn("columnindex", rowNumber().over(w)) newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex) newDF.show() New DF C1 C2 C3 23397414 20875.7353 2008-02-04 5213970 20497.5582 2008-02-05 41323308 20935.7956 2008-02-06 123276113 18884.0477 2008-02-07 76456078 18389.9269 2008-02-08

Solution 2: Un autre bon moyen (probablement le meilleur :)) en scala, que vous pouvez traduire en pyspark:

/** * Add Column Index to dataframe */ def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame( // Add Column index df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)}, // Create schema StructType(df.schema.fields :+ StructField("columnindex", LongType, false)) ) // Add index now... val df1WithIndex = addColumnIndex(df1) val df2WithIndex = addColumnIndex(df2) // Now time to join ... val newone = df1WithIndex .join(df2WithIndex , Seq("columnindex")) .drop("columnindex")

Jed · Answer

Je pensais partager la traduction python (pyspark) pour la réponse n ° 2 ci-dessus de @Ram Ghadiyaram:

from pyspark.sql.functions import col def addColumnIndex(df): # Create new column names oldColumns = df.schema.names newColumns = oldColumns + ["columnindex"] # Add Column index df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \ row + (columnindex,)).toDF() #Rename all the columns new_df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df_indexed) return new_df # Add index now... df1WithIndex = addColumnIndex(df1) df2WithIndex = addColumnIndex(df2) #Now time to join ... newone = df1WithIndex.join(df2WithIndex, col("columnindex"), 'inner').drop("columnindex")

Zilong · Answer

J'ai fait référence à sa réponse (@Jed)

from pyspark.sql.functions import col def addColumnIndex(df): # Get old columns names and add a column "columnindex" oldColumns = df.columns newColumns = oldColumns + ["columnindex"] # Add Column index df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \ row + (columnindex,)).toDF() #Rename all the columns oldColumns = df_indexed.columns new_df = reduce(lambda data, idx:data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df_indexed) return new_df # Add index now... df1WithIndex = addColumnIndex(df1) df2WithIndex = addColumnIndex(df2) #Now time to join ... newone = df1WithIndex.join(df2WithIndex, col("columnindex"), 'inner').drop("columnindex")

Shankar Koirala · Answer

Voici un exemple simple qui peut vous aider même si vous avez déjà résolu le problème.

 //create First Dataframe val df1 = spark.sparkContext.parallelize(Seq(1,2,1)).toDF("lavel1") //create second Dataframe val df2 = spark.sparkContext.parallelize(Seq((1.0, 12.1), (12.1, 1.3), (1.1, 0.3))). toDF("f1", "f2") //Combine both dataframe val combinedRow = df1.rdd.Zip(df2.rdd). map({ //convert both dataframe to Seq and join them and return as a row case (df1Data, df2Data) => Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq) }) // create new Schema from both the dataframe's schema val combinedschema = StructType(df1.schema.fields ++ df2.schema.fields) // Create a new dataframe from new row and new schema val finalDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema) finalDF.show

MNav · Answer

Développer la réponse de Jed , en réponse au commentaire d'Ajinkya:

Pour obtenir les mêmes anciens noms de colonne, vous devez remplacer "old_cols" par une liste de colonnes des colonnes indexées nouvellement nommées. Voir ma version modifiée de la fonction ci-dessous

def add_column_index(df): new_cols = df.schema.names + ['ix'] ix_df = df.rdd.zipWithIndex().map(lambda (row, ix): row + (ix,)).toDF() tmp_cols = ix_df.schema.names return reduce(lambda data, idx: data.withColumnRenamed(tmp_cols[idx], new_cols[idx]), xrange(len(tmp_cols)), ix_df)

Dyno Fu · Answer

pour la version python3,

from pyspark.sql.types import StructType, StructField, LongType def with_column_index(sdf): new_schema = StructType(sdf.schema.fields + [StructField("ColumnIndex", LongType(), False),]) return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema) df1_ci = with_column_index(df1) df2_ci = with_column_index(df2) join_on_index = df1_ci.join(df2_ci, df1_ci.ColumnIndex == df2_ci.ColumnIndex, 'inner').drop("ColumnIndex")