here is the response from @sumwale -- Shuffle RDDs in the plan are searched in the DAG and explicitly cleared in background after execution. If there is another collect of the same cached plan then any pending clears are first waited on.
Now that the same problem exists in Spark where it will use the previous cached shuffle result if the same plan is collected multiple times. For example if one does collect() on the same DataFrame multiple times then obsolete results can be returned. So in Spark one should execute the query everytime.
Hi folks,
Trying to load a table form multiple paths on s3 via SQL: https://jira.snappydata.io/browse/SNAP-3331
Can somebody suggest simple workaround for now? I can think only of using scala instead of SQL.