I noticed thunder-project/thunder#125 because it was getting my whole cluster into a strange state and I was curious if anyone's seen something similar. If only about 5% of tasks fail, the stage is salvaged by spark.speculation and the job completes. If more than about 5% of tasks fail, the stage fails and keeps retrying with different (arbitrary?) partitioning indefinitely. Ideas on what Spark is trying to do there?