Error on collect()

Hello, We are getting the following error when calling collect() on an RDD: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 25.0 failed 4 times, most recent failure: Lost task 6.3 in stage 25.0 (TID 807, iccluster044.iccluster.epfl.ch, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): etc... The thing is that other actions, like take() work. Anyone knows what this can be? The error message is not informative unfortunately. Thank you!

25 Apr '21 · 1 ·

Martin Lenweiter

Top comment

This is probably because you are trying to collect() a large amount of data into the driver node memory. You should never load the whole raw ratings data into memory (it is too big, which is why we are using Spark to perform distributed computations) - always use the Spark RDD transformations and aggregate over per user or per item before calling collect().

25 Apr '21 ·

Aswin Suresh admin

Hey, I would also like to know what causes this problem, sometimes restarting the kernel gets rid of this problem but it comes around too often and it is always due to the collect() method.

1

25 Apr '21 ·

Stefan Eric

Top comment

This is probably because you are trying to collect() a large amount of data into the driver node memory. You should never load the whole raw ratings data into memory (it is too big, which is why we are using Spark to perform distributed computations) - always use the Spark RDD transformations and aggregate over per user or per item before calling collect().

25 Apr '21 ·

Aswin Suresh admin

I came across the same problem when trying to collect small amount of data too, like for instance from an array obtained from my-ratings.txt transformed into an RDD I call the collect() method and have the same problem, should this still happen?

26 Apr '21 ·

Stefan Eric

Sorry, missed this. Indeed collect() should not cause this error for small amount of data, it could be due to the cluster being overloaded at that particular time. But I trust you were able to work at other times and work on lab 3 despite this?

26 May '21 ·

Aswin Suresh admin

1

Add comment