Conversion de Pandas dataframe en Spark erreur de dataframe

Question

J'essaie de convertir Pandas DF en Spark un. === un.) DF head :

10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611 10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691

Code:

dataset = pd.read_csv("data/AS/test_v2.csv") sc = SparkContext(conf=conf) sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(dataset)

Et j'ai une erreur:

TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

madman2890 · Accepted Answer

Vous devez vous assurer que vos colonnes pandas dataframe sont appropriées pour le type spark induit.). Si vos pandas listes de dataframe quelque chose comme:

pd.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5062 entries, 0 to 5061 Data columns (total 51 columns): SomeCol 5062 non-null object Col2 5062 non-null object

Et vous obtenez cette erreur essayez:

df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)

Maintenant, assurez-vous que .astype(str) est bien le type que vous souhaitez que ces colonnes soient. Fondamentalement, lorsque le code sous-jacent Java tente de déduire le type d'un objet dans python, il utilise des observations et laisse deviner, si cela ne fonctionne pas appliquer à toutes les données dans la colonne (s) il essaie de convertir de pandas à spark il va échouer).

Grant Shannon · Answer

Les erreurs liées au type peuvent être évitées en imposant un schéma comme suit:

note: un fichier texte a été créé ( test.csv ) avec les données d'origine (comme ci-dessus) et des noms de colonnes hypothétiques inséré ("col1", "col2", ..., "col25").

import pyspark from pyspark.sql import SparkSession import pandas as pd spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate() pdDF = pd.read_csv("test.csv")

contenu du bloc de données pandas:

pdDF col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 ... col16 col17 col18 col19 col20 col21 col22 col23 col24 col25 0 10000001 1 0 1 12:35 OK 10002 1 0 9 ... 3 9 0 0 1 1 0 0 4 543 1 10000001 2 0 1 12:36 OK 10002 1 0 9 ... 3 9 2 1 1 3 1 3 2 611 2 10000002 1 0 4 12:19 PA 10003 1 1 7 ... 2 15 2 0 2 3 1 2 2 691

Ensuite, créez le schéma:

from pyspark.sql.types import * mySchema = StructType([ StructField("Col1", LongType(), True)\ ,StructField("Col2", IntegerType(), True)\ ,StructField("Col3", IntegerType(), True)\ ,StructField("Col4", IntegerType(), True)\ ,StructField("Col5", StringType(), True)\ ,StructField("Col6", StringType(), True)\ ,StructField("Col7", IntegerType(), True)\ ,StructField("Col8", IntegerType(), True)\ ,StructField("Col9", IntegerType(), True)\ ,StructField("Col10", IntegerType(), True)\ ,StructField("Col11", StringType(), True)\ ,StructField("Col12", StringType(), True)\ ,StructField("Col13", IntegerType(), True)\ ,StructField("Col14", IntegerType(), True)\ ,StructField("Col15", IntegerType(), True)\ ,StructField("Col16", IntegerType(), True)\ ,StructField("Col17", IntegerType(), True)\ ,StructField("Col18", IntegerType(), True)\ ,StructField("Col19", IntegerType(), True)\ ,StructField("Col20", IntegerType(), True)\ ,StructField("Col21", IntegerType(), True)\ ,StructField("Col22", IntegerType(), True)\ ,StructField("Col23", IntegerType(), True)\ ,StructField("Col24", IntegerType(), True)\ ,StructField("Col25", IntegerType(), True)])

Note: True (implique nullable autorisé)

créez le dataframe pyspark:

df = spark.createDataFrame(pdDF,schema=mySchema)

confirmez que le bloc de données pandas est maintenant un bloc de données pyspark:

type(df)

sortie:

pyspark.sql.dataframe.DataFrame

De côté:

Pour répondre au commentaire de Kate ci-dessous - pour imposer un schéma général (String), vous pouvez procéder comme suit:

df=spark.createDataFrame(pdDF.astype(str))

RoyaumeIX · Answer

J'ai essayé cela avec vos données et cela fonctionne:

%pyspark import pandas as pd from pyspark.sql import SQLContext print sc df = pd.read_csv("test.csv") print type(df) print df sqlCtx = SQLContext(sc) sqlCtx.createDataFrame(df).show()

Gonzalo Garcia · Answer

J'ai fait cet algorithme, cela a fonctionné pour mes 10 pandas Data frames

from pyspark.sql.types import * # Auxiliar functions def equivalent_type(f): if f == 'datetime64[ns]': return DateType() Elif f == 'int64': return LongType() Elif f == 'int32': return IntegerType() Elif f == 'float64': return FloatType() else: return StringType() def define_structure(string, format_type): try: typo = equivalent_type(format_type) except: typo = StringType() return StructField(string, typo) # Given pandas dataframe, it will return a spark's dataframe. def pandas_to_spark(pandas_df): columns = list(pandas_df.columns) types = list(pandas_df.dtypes) struct_list = [] i = 0 for column, typo in Zip(columns, types): struct_list.append(define_structure(column, typo)) p_schema = StructType(struct_list) return sqlContext.createDataFrame(pandas_df, p_schema)

Vous pouvez le voir aussi dans ce Gist

Avec cela, il vous suffit d'appeler spark_df = pandas_to_spark(pandas_df)

heathensoul · Answer

J'ai reçu un message d'erreur similaire une fois, dans mon cas, c'est parce que mon pandas dataframe contenait des valeurs NULL. Je recommanderai d'essayer de gérer cela pandas avant la conversion to spark (cela a résolu le problème dans mon cas).