@cleaton Hi, I only use logInfo in foreachPartition and it's logInfo("test") caused the exception
@sayuan & @cleaton 二位都很歡迎... XD
@jaminglam I assume loginfo is a member function on some class/object.
if it's a member function on a class the whole class instance will be serialized and sent to each node
if it's an object (scala object) the object will be instantiated already on each node (as a singleton) and the each node can use the local object member function without serializing the whole object.
This is one of the traps with the simplicity of the spark programming model. it's simple until it's not.
You have to consider in which scope the function will run (on driver or on executor) and consider what objects have been initialized where.
The recommendation is to try and structure your program using objects as much as possible (objects and lambda functions)
Not OO with classes
The OO approach can easily pull in a lot of dependencies into each serialized task.
I can't really tell if that is your issue in this case though. The example is too small to tell from.
謝謝講師昨天的分享還有 mesos 的 plus 分享
再麻煩取得 slide 讓我 commit
@cleaton Thanks for your advice, I agree with your point that avoid using too many classes and using object instead. logInfo() is a member function of trait org.apache.spark.Logging. It is weird that I read source code of apache spark 1.6, I found some codes such as JdbcUtils (org.apache.spark.sql.execution.datasources.jdbc) also use member functions of Logging in foreachPartition and based on my previous experience, in other function(map, reduceByKey), logInfo works well, exception only occurs in foreachPartition. Besides, how to log for debug while developing spark application, do you have any advice?
I mean is it a class or an object that you extend with the spark.Logging trait?
JdbcUtils is an object.
If you only change foreachPartition to foreach it works?
yes, does you mean that If I have a class which include codes doing foreachPartition. Spark will serialize the whole class code?
the function will be a member of the class instance, and thus it will only exist on the driver at first
when you try to call that function inside foreach/foreachpartition the driver will need to send the whole class instance to the executor machine
if it's an object it is initalized separately
for each jvm
and thus spark only needs to give the method reference
(and any parameter you give on the driver side)
You should be able to see more information in the log from the spark driver
this error happens before it reaches any of the executor nodes.