PySpark dans le bloc-notes iPython soulève Py4JJavaError lors de l'utilisation de count () et de first ()

Question

J'utilise PySpark (v.2.1.0) dans iPython notebook (python v.3.6) sur virtualenv sur mon Mac (Sierra 10.12.3 bêta).

1.J'ai lancé un ordinateur portable iPython en prenant cette photo dans Terminal -

 PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /Applications/spark-2.1.0-bin-hadoop2.7/bin/pyspark

2.Chargez mon fichier dans Spark Context et assurez-vous qu'il soit chargé-

>>>lines = sc.textFile("/Users/PanchusMac/Dropbox/Learn_py/Virtual_Env/pyspark/README.md") >>>for i in lines.collect(): print(i)

Et cela a bien fonctionné et a imprimé le résultat sur ma console comme indiqué:

# Apache Spark Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. <http://spark.Apache.org/> ## Online Documentation You can find the latest Spark documentation, including a programming guide, on the [project web page](http://spark.Apache.org/documentation.html). This README file only contains basic setup instructions.

Également vérifié le sc -

>>>print(sc) <pyspark.context.SparkContext object at 0x101ce4cc0>

Maintenant, lorsque j'essaie d'exécuter des fonctions lines.count() ou lines.first() sur mon RDD, j'ai l'erreur suivante:

Py4JJavaError Traceback (most recent call last) <ipython-input-33-44aeefde846d> in <module>() ----> 1 lines.count() /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in count(self) 1039 3 1040 """ -> 1041 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 1042 1043 def stats(self): /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in sum(self) 1030 6.0 1031 """ -> 1032 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 1033 1034 def count(self): /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in fold(self, zeroValue, op) 904 # zeroValue provided to each partition is unique from the one provided 905 # to the final reduce call --> 906 vals = self.mapPartitions(func).collect() 907 return reduce(op, vals, zeroValue) 908 /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in collect(self) 807 """ 808 with SCCallSiteSync(self.context) as css: --> 809 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 810 return list(_load_from_socket(port, self._jrdd_deserializer)) 811 /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip/py4j/Java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.Java_exception.toString() /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.
". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 22, localhost, executor driver): org.Apache.spark.SparkException: Error from python worker: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/__init__.py", line 44, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/context.py", line 36, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/Java_gateway.py", line 25, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module> "system node release version machine processor") File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/serializers.py", line 393, in namedtuple TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' PYTHONPATH was: /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/: Java.io.EOFException at Java.io.DataInputStream.readInt(DataInputStream.Java:392) at org.Apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166) at org.Apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.Apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.Apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.Apache.spark.scheduler.Task.run(Task.scala:99) at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142) at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617) at Java.lang.Thread.run(Thread.Java:745) Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1958) at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.Apache.spark.rdd.RDD.collect(RDD.scala:934) at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:745) Caused by: org.Apache.spark.SparkException: Error from python worker: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/__init__.py", line 44, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/context.py", line 36, in <module> File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/Java_gateway.py", line 25, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module> "system node release version machine processor") File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/serializers.py", line 393, in namedtuple TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' PYTHONPATH was: /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/: Java.io.EOFException at Java.io.DataInputStream.readInt(DataInputStream.Java:392) at org.Apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166) at org.Apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.Apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.Apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.Apache.spark.scheduler.Task.run(Task.scala:99) at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142) at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617) ... 1 more

Quelqu'un pourrait-il m'expliquer où cela s'est-il mal passé? Remarque: Lorsque j’ai effectué les mêmes opérations sur mon terminal Mac, elles ont fonctionné comme prévu.

Mariusz · Accepted Answer

Pyspark 2.1.0 n'est pas compatible avec Python 3.6, voir https://issues.Apache.org/jira/browse/SPARK-19019 .

Vous devez utiliser une version antérieure de Python ou vous pouvez essayer de construire le maître ou la branche 2.1 à partir de github et cela devrait fonctionner.

Raja Rajan · Answer

Oui, j'ai eu le même problème il y a longtemps à Pyspark à Anaconda. J'ai essayé plusieurs façons de remédier à ce problème que j'ai finalement trouvé seul en installant Java pour anaconda séparément.

https://anaconda.org/cyclus/Java-jdk

Tung Nguyen · Answer

Si vous utilisez Anaconda, essayez d’installer Java-jdk pour Anaconda:

conda install -c cyclus Java-jdk