Je reçois une erreur lorsque je conçois des fonctionnalités sur plus de 30 colonnes pour créer environ 200+ colonnes. Il n'échoue pas, mais l'ERREUR s'affiche. Je veux savoir comment éviter cela.
Spark - 2.3.1 Python - 3.6
Configuration de cluster - 1 maître - 32 Go de RAM, 16 cœurs 4 esclaves - 16 Go de RAM, 8 cœurs
Données d'entrée - 8 partitions de fichier parquet avec compression rapide.
Mon Spark-Submit ->
spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt
Stack-Trace ci-dessous -
ERROR CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.Java:361)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:234)
at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.Java:446)
at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.Java:313)
at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.Java:235)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:204)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.Java:80)
at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$Apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417)
at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.Java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.Java:2379)
at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.Java:2342)
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.Java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.Java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.Java:4004)
at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.Java:4874)
at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
at org.Apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
at org.Apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
at org.Apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.Apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
at org.Apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102)
at org.Apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43)
at org.Apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:97)
at org.Apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67)
at org.Apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91)
at org.Apache.spark.sql.Dataset.persist(Dataset.scala:2924)
at Sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
at Java.lang.reflect.Method.invoke(Method.Java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
at py4j.Gateway.invoke(Gateway.Java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
at py4j.commands.CallCommand.execute(CallCommand.Java:79)
at py4j.GatewayConnection.run(GatewayConnection.Java:238)
at Java.lang.Thread.run(Thread.Java:748)
Caused by: org.codehaus.janino.InternalCompilerException: Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
Le problème est que lorsque Java les programmes générés à l'aide de Catalyst à partir de programmes utilisant DataFrame et Dataset sont compilés en Java bytecode, la taille du code octet d'une méthode ne doit pas être de 64 Ko ou plus, cela entre en conflit avec la limitation du fichier de classe Java, qui est une exception qui se produit.
Masquer l'erreur:
spark.sql.codegen.wholeStage= "false"
Solution:
Afin d'éviter l'occurrence d'une exception due à la restriction ci-dessus, dans Spark, une solution consiste à diviser les méthodes qui compilent et font Java bytecode qui est susceptible de dépasser 64 Ko en plusieurs méthodes lorsque Catalyst génère des programmes Java Cela a été fait.
Utilisez persist ou toute autre séparation logique dans le pipeline
Nous avons résolu cette erreur en ajoutant des "points de contrôle" supplémentaires dans le code.
Checkpoints = Vous devez réécrire la trame de données (données) sur le disque dans notre cas s3, puis la relire dans une nouvelle trame de données qui conduit au processus de vidage de la JVM spark conteneurs et relancez avec nouveau code
Détails sur le point de contrôle
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md
Comme écrit par vaquar, l'introduction d'une séparation logique dans le pipeline devrait aider.
Une façon de réduire la lignée et d'introduire une rupture dans le plan semble être une conversion aller-retour DF -> RDD -> DF
:
df = spark_session.sparkContext.createDataFrame(df.rdd, schema=df.schema)
Dans le livre High Performance Spark ils mentionnent en outre qu'il est préférable (plus rapide) de le faire en utilisant Java RDD, c'est-à-dire en utilisant
j_rdd = df._jdf.toJavaRDD()
et son schéma j_schema = df._jdf.schema()
pour construire un nouveau Java DataFrame et enfin le reconvertir en PySpark DataFrame:
sql_ctx = df.sql_ctx
Java_sql_context = sql_ctx._jsqlContext
new_Java_df = Java_sql_context.createDataFrame(j_rdd, j_schema)
new_df = DataFrame(new_Java_df, sql_ctx)
Si vous utilisez pyspark 2.3+, essayez
spark = SparkSession.builder.master('local').appName('tow-way')\
.config('spark.sql.codegen.wholeStage', 'false')\ ## <-- add this line
.getOrCreate()