Je suis un novice avec Spark et j'essaie de terminer un Spark tutoriel: lien vers le tutoriel
Après l'avoir installé sur la machine locale (Win10 64, Python 3, Spark 2.4.0) et réglé toutes les variables env (HADOOP_HOME, SPARK_HOME, etc.)), je suis essayer d'exécuter un travail simple Spark via le fichier WordCount.py:
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("Word count").setMaster("local[2]")
sc = SparkContext(conf = conf)
lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/Word_count.text")
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.countByValue()
for Word, count in wordCounts.items():
print("{} : {}".format(Word, count))
Après l'avoir exécuté depuis le terminal:
spark-submit WordCount.py
J'obtiens en dessous de l'erreur. J'ai vérifié (en commentant ligne par ligne) qu'il se bloque à
wordCounts = words.countByValue()
Une idée que dois-je vérifier pour le faire fonctionner?
Traceback (most recent call last):
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.Zip\pyspark\worker.py", line 25, in <module>
ModuleNotFoundError: No module named 'resource'
18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.Apache.spark.SparkException: Python worker failed to connect back.
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.Apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.Apache.spark.scheduler.Task.run(Task.scala:121)
at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at Java.lang.Thread.run(Unknown Source)
Caused by: Java.net.SocketTimeoutException: Accept timed out
at Java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at Java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at Java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at Java.net.PlainSocketImpl.accept(Unknown Source)
at Java.net.ServerSocket.implAccept(Unknown Source)
at Java.net.ServerSocket.accept(Unknown Source)
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
wordCounts = words.countByValue()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.Zip\pyspark\rdd.py", line 1261, in countByValue
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.Zip\pyspark\rdd.py", line 844, in reduce
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.Zip\pyspark\rdd.py", line 816, in collect
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.Zip\py4j\Java_gateway.py", line 1257, in __call__
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.Zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe.
: org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.Apache.spark.SparkException: Python worker failed to connect back.
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.Apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.Apache.spark.scheduler.Task.run(Task.scala:121)
at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at Java.lang.Thread.run(Unknown Source)
Caused by: Java.net.SocketTimeoutException: Accept timed out
at Java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at Java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at Java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at Java.net.PlainSocketImpl.accept(Unknown Source)
at Java.net.ServerSocket.implAccept(Unknown Source)
at Java.net.ServerSocket.accept(Unknown Source)
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
Driver stacktrace:
at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.Apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.Apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at Sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at Sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at Java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
at py4j.Gateway.invoke(Gateway.Java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
at py4j.commands.CallCommand.execute(CallCommand.Java:79)
at py4j.GatewayConnection.run(GatewayConnection.Java:238)
at Java.lang.Thread.run(Unknown Source)
Caused by: org.Apache.spark.SparkException: Python worker failed to connect back.
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.Apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.Apache.spark.scheduler.Task.run(Task.scala:121)
at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: Java.net.SocketTimeoutException: Accept timed out
at Java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at Java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at Java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at Java.net.PlainSocketImpl.accept(Unknown Source)
at Java.net.ServerSocket.implAccept(Unknown Source)
at Java.net.ServerSocket.accept(Unknown Source)
at org.Apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
Comme suggéré par theplatypus - vérifié si le module "ressource" peut être importé directement du terminal - apparemment pas:
>>> import resource
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'resource'
En termes de ressources d'installation - j'ai suivi les instructions de ce tutoriel :
Y a-t-il des ressources supplémentaires que je devrais installer?
J'ai eu la même erreur. Je l'ai résolu en installant la version précédente de Spark (2.3 au lieu de 2.4). Maintenant, cela fonctionne parfaitement, c'est peut-être un problème de la dernière version de pyspark.
En regardant la source de l'erreur ( worker.py # L25 ), il semble que l'interpréteur python utilisé pour instancier un travailleur pyspark n'a pas accès à la resource
module, un module intégré référencé dans doc Python dans le cadre de "Unix Specific Services".
Êtes-vous sûr de pouvoir exécuter pyspark sous Windows (sans certains logiciels supplémentaires comme GOW ou MingW au moins), et de sorte que vous n'ayez pas ignoré certaines étapes d'installation spécifiques à Windows?
Pourriez-vous ouvrir une console python (celle utilisée par pyspark) et voir si vous pouvez >>> import resource
sans obtenir le même ModuleNotFoundError
? Si vous ne le faites pas, pourriez-vous fournir les ressources que vous avez utilisées pour l'installer sur W10?
La rétrogradation Spark retour à 2.3.2 à partir de 2.4.0 ne me suffisait pas. Je ne sais pas pourquoi mais dans mon cas, j'ai dû créer SparkContext à partir de SparkSession comme
sc = spark.sparkContext
Puis la même erreur a disparu.