Python worker failed to connect back









up vote
0
down vote

favorite












I'm a newby with Spark and trying to complete a Spark tutorial:
link to tutorial



After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:



from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
conf = SparkConf().setAppName("word count").setMaster("local[2]")
sc = SparkContext(conf = conf)

lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.countByValue()

for word, count in wordCounts.items():
print(" : ".format(word, count))


After running it from terminal:



spark-submit WordCount.py


I get below error.
I checked (by commenting out line by line) that it crashes at



wordCounts = words.countByValue()


Any idea what should I check to make it work?



Traceback (most recent call last):
File "C:UsersmjdbrAnaconda3librunpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:UsersmjdbrAnaconda3librunpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkworker.py", line 25, in <module>
ModuleNotFoundError: No module named 'resource'
18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
wordCounts = words.countByValue()
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 1261, in countByValue
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 844, in reduce
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 816, in collect
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jjava_gateway.py", line 1257, in __call__
File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jprotocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more


As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:



>>> import resource
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'resource'


In terms of installation resources - I followed instructions from this tutorial:



  1. downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website

  2. un-zipped it to my C-drive

  3. already had Python_3 installed (Anaconda distribution) as well as Java

  4. created local 'C:hadoopbin' folder to store winutils.exe

  5. created 'C:tmphive' folder and gave Spark access to it

  6. added environment variables (SPARK_HOME, HADOOP_HOME etc)

Is there any extra resource I should install?










share|improve this question



























    up vote
    0
    down vote

    favorite












    I'm a newby with Spark and trying to complete a Spark tutorial:
    link to tutorial



    After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:



    from pyspark import SparkContext, SparkConf

    if __name__ == "__main__":
    conf = SparkConf().setAppName("word count").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
    words = lines.flatMap(lambda line: line.split(" "))
    wordCounts = words.countByValue()

    for word, count in wordCounts.items():
    print(" : ".format(word, count))


    After running it from terminal:



    spark-submit WordCount.py


    I get below error.
    I checked (by commenting out line by line) that it crashes at



    wordCounts = words.countByValue()


    Any idea what should I check to make it work?



    Traceback (most recent call last):
    File "C:UsersmjdbrAnaconda3librunpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
    File "C:UsersmjdbrAnaconda3librunpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkworker.py", line 25, in <module>
    ModuleNotFoundError: No module named 'resource'
    18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
    at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    at java.net.PlainSocketImpl.accept(Unknown Source)
    at java.net.ServerSocket.implAccept(Unknown Source)
    at java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
    ... 14 more
    18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
    Traceback (most recent call last):
    File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
    wordCounts = words.countByValue()
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 1261, in countByValue
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 844, in reduce
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 816, in collect
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jjava_gateway.py", line 1257, in __call__
    File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jprotocol.py", line 328, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
    Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
    at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    at java.net.PlainSocketImpl.accept(Unknown Source)
    at java.net.ServerSocket.implAccept(Unknown Source)
    at java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
    ... 14 more

    Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)
    Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    ... 1 more
    Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
    at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    at java.net.PlainSocketImpl.accept(Unknown Source)
    at java.net.ServerSocket.implAccept(Unknown Source)
    at java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
    ... 14 more


    As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:



    >>> import resource
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'resource'


    In terms of installation resources - I followed instructions from this tutorial:



    1. downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website

    2. un-zipped it to my C-drive

    3. already had Python_3 installed (Anaconda distribution) as well as Java

    4. created local 'C:hadoopbin' folder to store winutils.exe

    5. created 'C:tmphive' folder and gave Spark access to it

    6. added environment variables (SPARK_HOME, HADOOP_HOME etc)

    Is there any extra resource I should install?










    share|improve this question

























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I'm a newby with Spark and trying to complete a Spark tutorial:
      link to tutorial



      After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:



      from pyspark import SparkContext, SparkConf

      if __name__ == "__main__":
      conf = SparkConf().setAppName("word count").setMaster("local[2]")
      sc = SparkContext(conf = conf)

      lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
      words = lines.flatMap(lambda line: line.split(" "))
      wordCounts = words.countByValue()

      for word, count in wordCounts.items():
      print(" : ".format(word, count))


      After running it from terminal:



      spark-submit WordCount.py


      I get below error.
      I checked (by commenting out line by line) that it crashes at



      wordCounts = words.countByValue()


      Any idea what should I check to make it work?



      Traceback (most recent call last):
      File "C:UsersmjdbrAnaconda3librunpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
      File "C:UsersmjdbrAnaconda3librunpy.py", line 85, in _run_code
      exec(code, run_globals)
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkworker.py", line 25, in <module>
      ModuleNotFoundError: No module named 'resource'
      18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
      org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more
      18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
      Traceback (most recent call last):
      File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
      wordCounts = words.countByValue()
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 1261, in countByValue
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 844, in reduce
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 816, in collect
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jjava_gateway.py", line 1257, in __call__
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jprotocol.py", line 328, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
      Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
      at scala.Option.foreach(Option.scala:257)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
      at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
      at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
      at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
      at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:282)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:238)
      at java.lang.Thread.run(Unknown Source)
      Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      ... 1 more
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more


      As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:



      >>> import resource
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      ModuleNotFoundError: No module named 'resource'


      In terms of installation resources - I followed instructions from this tutorial:



      1. downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website

      2. un-zipped it to my C-drive

      3. already had Python_3 installed (Anaconda distribution) as well as Java

      4. created local 'C:hadoopbin' folder to store winutils.exe

      5. created 'C:tmphive' folder and gave Spark access to it

      6. added environment variables (SPARK_HOME, HADOOP_HOME etc)

      Is there any extra resource I should install?










      share|improve this question















      I'm a newby with Spark and trying to complete a Spark tutorial:
      link to tutorial



      After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:



      from pyspark import SparkContext, SparkConf

      if __name__ == "__main__":
      conf = SparkConf().setAppName("word count").setMaster("local[2]")
      sc = SparkContext(conf = conf)

      lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
      words = lines.flatMap(lambda line: line.split(" "))
      wordCounts = words.countByValue()

      for word, count in wordCounts.items():
      print(" : ".format(word, count))


      After running it from terminal:



      spark-submit WordCount.py


      I get below error.
      I checked (by commenting out line by line) that it crashes at



      wordCounts = words.countByValue()


      Any idea what should I check to make it work?



      Traceback (most recent call last):
      File "C:UsersmjdbrAnaconda3librunpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
      File "C:UsersmjdbrAnaconda3librunpy.py", line 85, in _run_code
      exec(code, run_globals)
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkworker.py", line 25, in <module>
      ModuleNotFoundError: No module named 'resource'
      18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
      org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more
      18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
      Traceback (most recent call last):
      File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
      wordCounts = words.countByValue()
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 1261, in countByValue
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 844, in reduce
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpyspark.zippysparkrdd.py", line 816, in collect
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jjava_gateway.py", line 1257, in __call__
      File "C:Sparkspark-2.4.0-bin-hadoop2.7pythonlibpy4j-0.10.7-src.zippy4jprotocol.py", line 328, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
      Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
      at scala.Option.foreach(Option.scala:257)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
      at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
      at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
      at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
      at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:282)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:238)
      at java.lang.Thread.run(Unknown Source)
      Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
      at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
      at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
      at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      at org.apache.spark.scheduler.Task.run(Task.scala:121)
      at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      ... 1 more
      Caused by: java.net.SocketTimeoutException: Accept timed out
      at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
      at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
      at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
      at java.net.PlainSocketImpl.accept(Unknown Source)
      at java.net.ServerSocket.implAccept(Unknown Source)
      at java.net.ServerSocket.accept(Unknown Source)
      at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
      ... 14 more


      As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:



      >>> import resource
      Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      ModuleNotFoundError: No module named 'resource'


      In terms of installation resources - I followed instructions from this tutorial:



      1. downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website

      2. un-zipped it to my C-drive

      3. already had Python_3 installed (Anaconda distribution) as well as Java

      4. created local 'C:hadoopbin' folder to store winutils.exe

      5. created 'C:tmphive' folder and gave Spark access to it

      6. added environment variables (SPARK_HOME, HADOOP_HOME etc)

      Is there any extra resource I should install?







      python windows apache-spark pyspark local






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 9:11

























      asked Nov 11 at 19:06









      Mike D.

      183




      183






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.






          share|improve this answer




















          • Yes, that worked!
            – Mike D.
            Nov 12 at 22:48

















          up vote
          0
          down vote













          Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".



          Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?



          Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?






          share|improve this answer






















          • Hi, just edited original question, adding info you asked for.
            – Mike D.
            Nov 12 at 9:12










          • It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
            – theplatypus
            Nov 12 at 9:58










          • Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
            – theplatypus
            Nov 12 at 10:00










          • So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
            – Mike D.
            Nov 12 at 18:46











          • Have you installed git (with bash tools) as well ?
            – theplatypus
            Nov 12 at 20:04










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53252181%2fpython-worker-failed-to-connect-back%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.






          share|improve this answer




















          • Yes, that worked!
            – Mike D.
            Nov 12 at 22:48














          up vote
          1
          down vote



          accepted










          I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.






          share|improve this answer




















          • Yes, that worked!
            – Mike D.
            Nov 12 at 22:48












          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.






          share|improve this answer












          I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 at 19:32









          Raf

          261




          261











          • Yes, that worked!
            – Mike D.
            Nov 12 at 22:48
















          • Yes, that worked!
            – Mike D.
            Nov 12 at 22:48















          Yes, that worked!
          – Mike D.
          Nov 12 at 22:48




          Yes, that worked!
          – Mike D.
          Nov 12 at 22:48












          up vote
          0
          down vote













          Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".



          Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?



          Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?






          share|improve this answer






















          • Hi, just edited original question, adding info you asked for.
            – Mike D.
            Nov 12 at 9:12










          • It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
            – theplatypus
            Nov 12 at 9:58










          • Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
            – theplatypus
            Nov 12 at 10:00










          • So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
            – Mike D.
            Nov 12 at 18:46











          • Have you installed git (with bash tools) as well ?
            – theplatypus
            Nov 12 at 20:04














          up vote
          0
          down vote













          Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".



          Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?



          Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?






          share|improve this answer






















          • Hi, just edited original question, adding info you asked for.
            – Mike D.
            Nov 12 at 9:12










          • It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
            – theplatypus
            Nov 12 at 9:58










          • Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
            – theplatypus
            Nov 12 at 10:00










          • So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
            – Mike D.
            Nov 12 at 18:46











          • Have you installed git (with bash tools) as well ?
            – theplatypus
            Nov 12 at 20:04












          up vote
          0
          down vote










          up vote
          0
          down vote









          Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".



          Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?



          Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?






          share|improve this answer














          Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".



          Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?



          Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 12 at 10:37

























          answered Nov 12 at 0:19









          theplatypus

          715




          715











          • Hi, just edited original question, adding info you asked for.
            – Mike D.
            Nov 12 at 9:12










          • It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
            – theplatypus
            Nov 12 at 9:58










          • Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
            – theplatypus
            Nov 12 at 10:00










          • So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
            – Mike D.
            Nov 12 at 18:46











          • Have you installed git (with bash tools) as well ?
            – theplatypus
            Nov 12 at 20:04
















          • Hi, just edited original question, adding info you asked for.
            – Mike D.
            Nov 12 at 9:12










          • It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
            – theplatypus
            Nov 12 at 9:58










          • Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
            – theplatypus
            Nov 12 at 10:00










          • So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
            – Mike D.
            Nov 12 at 18:46











          • Have you installed git (with bash tools) as well ?
            – theplatypus
            Nov 12 at 20:04















          Hi, just edited original question, adding info you asked for.
          – Mike D.
          Nov 12 at 9:12




          Hi, just edited original question, adding info you asked for.
          – Mike D.
          Nov 12 at 9:12












          It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
          – theplatypus
          Nov 12 at 9:58




          It seems the guy in the tutorial installed git before, and on Windows it might imply he installed as well some Unix compatibility package (Mingw). Maybe you could try to install git as well ?
          – theplatypus
          Nov 12 at 9:58












          Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
          – theplatypus
          Nov 12 at 10:00




          Otherwise, seeing this tuto, it seems you can resolve this using Gnu on windows (GOW)
          – theplatypus
          Nov 12 at 10:00












          So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
          – Mike D.
          Nov 12 at 18:46





          So I deleted Spark and re-installed it again using link you suggested. Same issue occurs (both with 'spark-submit' and 'import resource' line), even with GOW installed.
          – Mike D.
          Nov 12 at 18:46













          Have you installed git (with bash tools) as well ?
          – theplatypus
          Nov 12 at 20:04




          Have you installed git (with bash tools) as well ?
          – theplatypus
          Nov 12 at 20:04

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53252181%2fpython-worker-failed-to-connect-back%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

          Museum of Modern and Contemporary Art of Trento and Rovereto