Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Nov 30 18:12
    zeitos edited #1267
  • Nov 30 18:12
    valan4ik opened #1267
  • Nov 29 10:54
    valan4ik opened #1266
  • Nov 20 08:51
    gao634209276 closed #1180
  • Nov 14 22:41
    thoHeinze commented #1260
  • Nov 14 22:39
    thoHeinze commented #1265
  • Nov 14 07:53

    bsikander on master

    feat(jobserver): Move ZK DAO ti… refactor(jobserver): Refactor C… (compare)

  • Nov 14 07:53
    bsikander closed #1258
  • Nov 14 07:53

    bsikander on master

    feat(jobserver): Add stop-conte… feat(jobserver): Make stop-cont… (compare)

  • Nov 14 07:53
    bsikander closed #1256
  • Nov 13 15:24

    noorul on master

    Fixes and multi-jvm tests for s… (compare)

  • Nov 13 15:24
    noorul closed #1264
  • Nov 13 14:16
    codecov-io commented #1264
  • Nov 13 13:46
    valan4ik reopened #1264
  • Nov 13 13:46
    valan4ik closed #1264
  • Nov 12 03:01
    noorul commented #1196
  • Nov 12 03:01

    noorul on master

    feat(webapi): check sjs actors … (compare)

  • Nov 12 03:01
    noorul closed #1196
  • Nov 11 12:30
    codecov-io commented #1196
  • Nov 11 12:00
    SrivigneshM reopened #1196
Peter Farah
@pfarah65
Whenever I submit my job after making a context it always fails right away and This is my error
{
"duration": "1.202 secs",
"classPath": "job.WordCountSparkJob",
"startTime": "2019-10-11T13:38:52.777-04:00",
"context": "py-context",
"result": {
"message": "Context (py-context) for this job was terminated",
"errorClass": "",
"stack": ""
},
"status": "ERROR",
"jobId": "1e34ca22-b50c-4f74-b889-4e1979d1bfeb",
"contextId": ""
}
Peter Farah
@pfarah65
nevermind, I had to downgrade to Spark 2.3.2
pgouda89
@pgouda89

Hi All,

What is the best way to get the failure reason got the spark task if the jobs are submitted asynchronously, I am getting the following response for the ERROR case.
[{
"duration": "5.948 secs",
"classPath": "<ApplicationClass>",
"startTime": "2019-10-15T11:19:30.803+05:30",
"context": "INFA_DATA_PREVIEW_APP_DIS_SJS",
"status": "ERROR",
"jobId": "0e79d7b3-7e31-4232-b354-b1fb01b0928a",
"contextId": "08ee61f8-7425-45f2-a4ce-c4b456fd24b8"
}]

Do we need to perform anything extra in the ApplicationClass in order to get the error stacktrace in the job response body?

pgouda89
@pgouda89

Hi @valan4ik ,

Is it possible to change the spark context name using some parameter/env? Currently, it is set with "spark.jobserver.JobManager"

This message was deleted
image.png
Valentina
@valan4ik
Hi @pgouda89, is it YARN UI? Are you using last Jobserver version?
There was a PR a while ago: spark-jobserver/spark-jobserver#1156
I suppose it was addressing your issue
Valentina
@valan4ik
Also regarding getting the job error in GET /jobs request - it’s not so easy, you will need to change jobserver code :) Usually people do additional GET /job/$jobId requests for the jobs in error state
pgouda89
@pgouda89
@valan4ik Thanks a lot :) I was missing spark-jobserver/spark-jobserver#1156 as I was using spark jobserver 0.9.0
Peter Farah
@pfarah65
anyone having trouble building a docker image? when the build.sbt executes ./dev/make-distrubtion i keep getting failure "Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed"
itsmesrds
@itsmesrds
Hi @valan4ik, I'm running jobserver-0.8.0 on EMR yarn in production. Because of yarn node blacklisting issues spark context is getting killed and all the data stored in the jobserver cached Object gets cleared.
Is there any way to check the heartbeat of sparkContext . If such things happen in production, atleast based on the heartbeat, ill run the job which will cache the data again?
sj123050037
@sj123050037
hi @itsmesrds, 
You can do this by running the rest request "contexts/<context-name>".
This will give you the rest-response which consists of the context state. And based on the state value you can identity if it is running or in any other final state.
itsmesrds
@itsmesrds
Thanks @sj123050037 .
Hi team,
Is there any way to set the number of executors and executor memory for every job in a sparkContext ?
As far as i know we set those parmeters for creating the context. But Is the Same will work for every job request ?
itsmesrds
@itsmesrds
Basically spark parameter for every session in a context
tosen1990
@tosen1990
@valan4ik
hey,I try different ways to rewrite the driver memory,however the job always throw this error. And the spark context also shut down at the same time.
Do you have any idea?
{
  "duration": "69.903 secs",
  "classPath": "org.air.ebds.organize.spark.jobserver.GBSegmentJob",
  "startTime": "2019-11-18T02:41:07.524Z",
  "context": "test-context",
  "result": {
    "message": "Job aborted due to stage failure: Task 2 in stage 10.0 failed 1 times, most recent failure: Lost task 2.0 in stage 10.0 (TID 530, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space\n\tat scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)\n\tat scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)\n\tat scala.Array$.ofDim(Array.scala:218)\n\tat geotrellis.raster.IntArrayTile$.ofDim(IntArrayTile.scala:271)\n\tat geotrellis.raster.ArrayTile$.alloc(ArrayTile.scala:422)\n\tat geotrellis.raster.CroppedTile.mutable(CroppedTile.scala:141)\n\tat geotrellis.raster.CroppedTile.mutable(CroppedTile.scala:133)\n\tat geotrellis.raster.CroppedTile.toArrayTile(CroppedTile.scala:125)\n\tat geotrellis.raster.crop.SinglebandTileCropMethods$class.crop(SinglebandTileCropMethods.scala:45)\n\tat geotrellis.raster.package$withTileMethods.crop(package.scala:55)\n\tat geotrellis.raster.crop.MultibandTileCropMethods$$anonfun$cropBands$1.apply$mcVI$sp(MultibandTileCropMethods.scala:45)\n\tat geotrellis.raster.crop.MultibandTileCropMethods$$anonfun$cropBands$1.apply(MultibandTileCropMethods.scala:44)\n\tat geotrellis.raster.crop.MultibandTileCropMethods$$anonfun$cropBands$1.apply(MultibandTileCropMethods.scala:44)\n\tat scala.collection.immutable.Range.foreach(Range.scala:160)\n\tat geotrellis.raster.crop.MultibandTileCropMethods$class.cropBands(MultibandTileCropMethods.scala:44)\n\tat geotrellis.raster.package$withMultibandTileMethods.cropBands(package.scala:83)\n\tat geotrellis.raster.crop.MultibandTileCropMethods$class.crop(MultibandTileCropMethods.scala:72)\n\tat geotrellis.raster.package$withMultibandTileMethods.crop(package.scala:83)\n\tat geotrellis.raster.package$withMultibandTileMethods.crop(package.scala:83)\n\tat geotrellis.raster.crop.CropMethods$class.crop(CropMethods.scala:66)\n\tat geotrellis.raster.package$withMultibandTileMethods.crop(package.scala:83)\n\tat geotrellis.spark.buffer.BufferTiles$$anonfun$17.geotrellis$spark$buffer$BufferTiles$$anonfun$$genSection$1(BufferTiles.scala:544)\n\tat geotrellis.spark.buffer.BufferTiles$$anonfun$17$$anonfun$apply$17.apply(BufferTiles.scala:549)\n\tat geotrellis.spark.buffer.BufferTiles$$anonfun$17$$anonfun$apply$17.apply(BufferTiles.scala:549)\n\tat scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)\n\tat scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)\n\tat scala.collection.immutable.List.foreach(List.scala:381)\n\tat scala.collection.TraversableLike$class.map(TraversableLike.scala:234)\n\tat scala.collection.immutable.List.map(List.scala:285)\n\tat geotrellis.spark.buffer.BufferTiles$$anonfun$17.apply(BufferTiles.scala:549)\n\tat geotrellis.spark.buffer.BufferTiles$$anonfun$17.apply(BufferTiles.scala:521)\n\tat scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)\n\nDriver stacktrace:"
pgouda89
@pgouda89
Hi @tosen1990 ,
Did you try changing server_shart.sh? If not, try to use the larger value for --driver-memory in the spark-submit command.
tosen1990
@tosen1990
@pgouda89 I tried and failed.
Behroz Sikander
@bsikander
@tosen1990 which version of jobserver are you using?

You can use the following to increase the driver memeory:

curl -X POST "localhost:8090/contexts/test-context?launcher.spark.driver.memory=512m

The valid values are 512m, 1g,2g etc

tosen1990
@tosen1990
@bsikander hey.I've resolved this problem by launcher.spark.driver.memory=512m.
I'm searching this method from the history in this channel. Thanks.
tosen1990
@tosen1990
You are right. I read it from that link.
Behroz Sikander
@bsikander
cool. Btw i think that the information should also be in the README of jobserver. If you want then feel free to open a PR.
tosen1990
@tosen1990
Actually,I've stucked into this issue for several months. Tbh,I'v tried whatever i can ,but still can't fix this.
Behroz Sikander
@bsikander
umm, could you post driver logs in the issue, this will give a full picture
tosen1990
@tosen1990
And when I use the sjs without using docker,it works fine.
Behroz Sikander
@bsikander
ahan
the exception in driver logs should give you clues what is happening.
tosen1990
@tosen1990
Actually,I can't find any driver logs tbh.Just found the logs from yarn.
Behroz Sikander
@bsikander
well, from the UI, you should be able to download the driver logs. If you can't find the logs then it could point to the fact that jobserver fails before the driver is started.
tosen1990
@tosen1990
image.png
image.png
I'll dig it up deeply later. Thanks for your advice.
Behroz Sikander
@bsikander

it seems that driver container is failing to launch.

Please do 1 thing: Directly use spark-submit to submit a WordCount/Pi job to your YARN cluster and see if it runs through.

If it does, then problem could on jobserver side (which i doubt), otherwise the problem is with your YARN setup and you should fix it and then use jobserver again.

tosen1990
@tosen1990
@bsikander sumit a PI job to my cluster and it works well. I do exectly as what the docs says and can't make sure what the error is.
Behroz Sikander
@bsikander
Interesting, then look a bit deeper to see maybe you can something interesting. Please post all the updates in the github issue.
tosen1990
@tosen1990
I'll keep looking to it. Thanks a lot.
itsmesrds
@itsmesrds

Hi @bsikander,

i'm currently using sjs:0.8.0 and spark2.2.1. Is this below issue fixed ? Things are working fine, when sending one request. but whenever i'm sending 2 requests

[2019-11-28 04:36:46,915] ERROR .jobserver.JobManagerActor [] [] - About to restart actor due to exception:
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:167)
        at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
        at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:165)
        at scala.concurrent.Await$.result(package.scala:190)
        at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:282)
        at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:192)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
        at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
        at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogging.scala:24)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at spark.jobserver.common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:23)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
        at spark.jobserver.common.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2019-11-28 04:36:46,916] ERROR ka.actor.OneForOneStrategy [] [akka://JobServer/user/jobManager-6a-9111-2434bb36090d] - Futures timed out after [3 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:167)
        at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
        at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:165)
        at scala.concurrent.Await$.result(package.scala:190)
        at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:282)
        at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:192)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
Is there any resolution to above issue ?
itsmesrds
@itsmesrds
@valan4ik Please let us know, what will be the resolution for above issue
Behroz Sikander
@bsikander
@itsmesrds you are sending 2 requests (POST /jobs) in parallel?
The problem seems that while fetching the binary information it takes too long and Future times out. This exception causes the actor to restart (this is the default behavior of Akka actors).
umm btw which DAO you are using? SQL/file/C*?
itsmesrds
@itsmesrds

@bsikander, i have tried using file, h2 database. both are giving same results. I fixed this by changing the

val daoAskTimeout = Timeout(60 seconds)
in below file 

 'job-server/src/main/scala/spark/jobserver/JobManagerActor.scala"

Now that problem is fixed(Means not throwing the exception). but still, context is getting killed.

Behroz Sikander
@bsikander
ok, check the driver logs to find the exception.
Valentina
@valan4ik
@itsmesrds sorry, but just to clear things up: do you send 2 requests to create contexts?
First context is successfully created (and is visible in UI) and the second request works successfully only with increase of timeout (but context dies after short period of time)?
itsmesrds
@itsmesrds
No @valan4ik, context is already created with cached rdd in it. I'm sending two rest calls to get result's out of cached data. while sending the second the request , when first request is executing . It throws that exception and context is getting killed.