Comment détecter si un Spark DataFrame a une colonne

Question

Lorsque je crée un DataFrame à partir d'un fichier JSON dans Spark SQL, comment savoir si une colonne existe avant d'appeler .select

Exemple de schéma JSON:

{ "a": { "b": 1, "c": 2 } }

C'est ce que je veux faire:

potential_columns = Seq("b", "c", "d") df = sqlContext.read.json(filename) potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))

mais je ne trouve pas une bonne fonction pour hasColumn. Le plus proche que j'ai eu est de tester si la colonne est dans ce tableau un peu maladroit:

scala> df.select("a.*").columns res17: Array[String] = Array(b, c)

zero323 · Accepted Answer

Supposons simplement qu’il existe et qu’il échoue avec Try. Clair et simple et supporte une imbrication arbitraire:

import scala.util.Try import org.Apache.spark.sql.DataFrame def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess val df = sqlContext.read.json(sc.parallelize( """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil)) hasColumn(df, "foobar") // Boolean = false hasColumn(df, "foo") // Boolean = true hasColumn(df, "foo.bar") // Boolean = true hasColumn(df, "foo.bar.foobar") // Boolean = true hasColumn(df, "foo.bar.foobaz") // Boolean = false

Ou encore plus simple:

val columns = Seq( "foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz") columns.flatMap(c => Try(df(c)).toOption) // Seq[org.Apache.spark.sql.Column] = List( // foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)

Équivalent Python:

from pyspark.sql.utils import AnalysisException from pyspark.sql import Row def has_column(df, col): try: df[col] return True except AnalysisException: return False df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF() has_column(df, "foobar") ## False has_column(df, "foo") ## True has_column(df, "foo.bar") ## True has_column(df, "foo.bar.foobar") ## True has_column(df, "foo.bar.foobaz") ## False

Jai Prakash · Answer

Une autre option que j'utilise normalement est

df.columns.contains("column-name-to-check")

Cela retourne un booléen

Daniel B. · Answer

En fait, vous n'avez même pas besoin d'appeler select pour utiliser des colonnes, vous pouvez simplement l'appeler sur le dataframe lui-même.

// define test data case class Test(a: Int, b: Int) val testList = List(Test(1,2), Test(3,4)) val testDF = sqlContext.createDataFrame(testList) // define the hasColumn function def hasColumn(df: org.Apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName) // then you can just use it on the DF with a given column name hasColumn(testDF, "a") // <-- true hasColumn(testDF, "c") // <-- false

Vous pouvez également définir une classe implicite à l'aide du modèle pimp my library, de sorte que la méthode hasColumn soit directement disponible sur vos images.

implicit class DataFrameImprovements(df: org.Apache.spark.sql.DataFrame) { def hasColumn(colName: String) = df.columns.contains(colName) }

Ensuite, vous pouvez l'utiliser comme:

testDF.hasColumn("a") // <-- true testDF.hasColumn("c") // <-- false

Michael Lloyd Lee mlk · Answer

Votre autre option serait de manipuler un tableau (dans ce cas un intersect) sur le df.columns et ton potential_columns.

// Loading some data (so you can just copy & paste right into spark-Shell) case class Document( a: String, b: String, c: String) val df = sc.parallelize(Seq(Document("a", "b", "c")), 2).toDF // The columns we want to extract val potential_columns = Seq("b", "c", "d") // Get the intersect of the potential columns and the actual columns, // we turn the array of strings into column objects // Finally turn the result into a vararg (: _*) df.select(potential_columns.intersect(df.columns).map(df(_)): _*).show

Hélas, cela ne fonctionnera pas pour votre scénario d'objet interne ci-dessus. Vous aurez besoin de regarder le schéma pour cela.

Je vais changer ton potential_columns vers des noms de colonne complets

val potential_columns = Seq("a.b", "a.c", "a.d") // Our object model case class Document( a: String, b: String, c: String) case class Document2( a: Document, b: String, c: String) // And some data... val df = sc.parallelize(Seq(Document2(Document("a", "b", "c"), "c2")), 2).toDF // We go through each of the fields in the schema. // For StructTypes we return an array of parentName.fieldName // For everything else we return an array containing just the field name // We then flatten the complete list of field names // Then we intersect that with our potential_columns leaving us just a list of column we want // we turn the array of strings into column objects // Finally turn the result into a vararg (: _*) df.select(df.schema.map(a => a.dataType match { case s : org.Apache.spark.sql.types.StructType => s.fieldNames.map(x => a.name + "." + x) case _ => Array(a.name) }).flatMap(x => x).intersect(potential_columns).map(df(_)) : _*).show

Cela ne va qu’à un niveau, alors pour le rendre générique, il faudrait faire plus de travail.

Nitin Mathur · Answer

Try n'est pas optimal car il évaluera l'expression à l'intérieur de Try avant que la décision ne soit prise.

Pour les grands ensembles de données, utilisez ce qui suit dans Scala:

df.schema.fieldNames.contains("column_name")

mfryar · Answer

Pour ceux qui tombent sur cette recherche d'une solution Python, j'utilise:

if 'column_name_to_check' in df.columns: # do something

Quand j'ai essayé la réponse de _Jai Prakash de df.columns.contains('column-name-to-check') en utilisant Python, j'ai eu AttributeError: 'list' object has no attribute 'contains'.

Shaun Ryan · Answer

Si vous déchiquetez votre JSON à l'aide d'une définition de schéma lors de son chargement, vous n'avez pas besoin de vérifier la colonne. si ce n'est pas dans la source JSON, il apparaîtra comme une colonne nulle.

 val schemaJson = """ { "type": "struct", "fields": [ { "name": field1 "type": "string", "nullable": true, "metadata": {} }, { "name": field2 "type": "string", "nullable": true, "metadata": {} } ] } """ val schema = DataType.fromJson(schemaJson).asInstanceOf[StructType] val djson = sqlContext.read .schema(schema ) .option("badRecordsPath", readExceptionPath) .json(dataPath)