Comment résoudre cette erreur org.Apache.spark.sql.catalyst.errors.package $ TreeNodeException

Question

J'ai deux processus chaque processus 1) connecter Oracle db lire un tableau spécifique 2) former un cadre de données et le traiter. 3) enregistrez le df sur cassandra.

Si j'exécute les deux processus en parallèle, les deux essaient de lire à partir d'Oracle et j'obtiens une erreur inférieure pendant que le deuxième processus lit les données

 ERROR ValsProcessor2: org.Apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#290L]) +- *(1) Scan JDBCRelation((SELECT * FROM BM_VALS WHERE ROWNUM <= 10) T) [numPartitions=2] [] PushedFilters: [], ReadSchema: struct<> at org.Apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) at org.Apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) at org.Apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) at org.Apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770) at org.Apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769) at org.Apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.Apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.Apache.spark.sql.Dataset.count(Dataset.scala:2769) at com.snp.processors.BenchmarkModelValsProcessor2.process(BenchmarkModelValsProcessor2.scala:43) at com.snp.utils.Utils$$anonfun$getAllDefinedProcessors$2.apply(Utils.scala:28) at com.snp.utils.Utils$$anonfun$getAllDefinedProcessors$2.apply(Utils.scala:28) at com.sp.MigrationDriver$$anonfun$main$2$$anonfun$apply$1.apply(MigrationDriver.scala:78) at com.sp.MigrationDriver$$anonfun$main$2$$anonfun$apply$1.apply(MigrationDriver.scala:78) at scala.Option.map(Option.scala:146) at com.sp.MigrationDriver$$anonfun$main$2.apply(MigrationDriver.scala:75) at com.sp.MigrationDriver$$anonfun$main$2.apply(MigrationDriver.scala:74) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:174) at com.sp.MigrationDriver$.main(MigrationDriver.scala:74) at com.sp.MigrationDriver.main(MigrationDriver.scala) Caused by: Java.lang.NullPointerException at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$.needToCopyObjectsBeforeShuffle(ShuffleExchangeExec.scala:163) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:300) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:91) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119) at org.Apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 37 more

Qu'est-ce que je fais mal ici? Comment régler ceci ?

BdEngineer · Accepted Answer

Je fermais le sparkSession dans le bloc enfin dans le premier processeur/classe appelée. Je l'ai déplacé hors du processeur et placé à l'intérieur de la classe appelante qui a résolu le problème.

Alex Ott · Answer

Je ne suis pas sûr de la véritable cause, la seule chose qui a attiré mon attention est l'expression SQL suivante: (SELECT * FROM BM_VALS WHERE ROWNUM <= 10) T - que signifie le T ici?

En ce qui concerne la conception globale, je recommanderais une approche complètement différente. Dans votre cas, vous avez 2 processeurs qui travaillent sur les mêmes données collectées auprès d'Oracle, et chaque processeur récupère les données séparément. Je recommanderais de déplacer la lecture des données Oracle dans une procédure distincte qui renverra le bloc de données (vous devez le mettre en cache), puis vos processeurs travailleront sur ce bloc de données et conserveront les données dans Cassandra.

Ou comme cela était recommandé auparavant, vous pouvez séparer le travail en 2 parties - une qui extrait toutes les données d'Oracle et stocker la trame de données sur le disque (pas persist, mais en utilisant write) , par exemple, en tant que fichier Parquet. Ensuite, séparez les travaux qui prendront les données du disque et effectueront les transformations nécessaires.

Dans les deux scénarios, vous