Ajout d'une nouvelle colonne dans Data Frame dérivée d'autres colonnes (Spark)

Question

J'utilise Spark 1.3.0 et Python. J'ai un dataframe et je souhaite ajouter une colonne supplémentaire qui est dérivée d'autres colonnes. Comme ça,

>>old_df.columns [col_1, col_2, ..., col_m] >>new_df.columns [col_1, col_2, ..., col_m, col_n]

où

col_n = col_3 - col_4

Comment faire cela dans PySpark?

zero323 · Accepted Answer

Une façon d'y parvenir est d'utiliser la méthode withColumn:

old_df = sqlContext.createDataFrame(sc.parallelize( [(0, 1), (1, 3), (2, 5)]), ('col_1', 'col_2')) new_df = old_df.withColumn('col_n', old_df.col_1 - old_df.col_2)

Vous pouvez également utiliser SQL sur une table enregistrée:

old_df.registerTempTable('old_df') new_df = sqlContext.sql('SELECT *, col_1 - col_2 AS col_n FROM old_df')

arker296 · Answer

De plus, nous pouvons utiliser udf

from pyspark.sql.functions import udf,col from pyspark.sql.types import IntegerType from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext() sqlContext = SQLContext(sc) old_df = sqlContext.createDataFrame(sc.parallelize( [(0, 1), (1, 3), (2, 5)]), ('col_1', 'col_2')) function = udf(lambda col1, col2 : col1-col2, IntegerType()) new_df = old_df.withColumn('col_n',function(col('col_1'), col('col_2'))) new_df.show()

That tech guy · Answer

Cela a fonctionné pour moi dans les databricks en utilisant spark.sql

df_converted = spark.sql('select total_bill, tip, sex, case when sex == "Female" then "0" else "1" end as sex_encoded from tips')