by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Wooyoung Moon
    @wmoon5
    2.0.5
    Savin
    @savingoyal
    Hmm this is rather interesting. We swallow exceptions for GetLogEvents and have retries built in. How long was your job running for?
    Wooyoung Moon
    @wmoon5
    It ran overnight on an EC2 instance in tmux. For most of the night (~14 hours) most of the tasks were waiting in a queue, so this morning I manually terminated some other jobs that were in the way and that was only a couple hours ago
    jobs unrelated to this particular flow
    Savin
    @savingoyal
    Do you have the exception log that you can share?
    Wooyoung Moon
    @wmoon5
    Actually strangely, when I check in the batch console, it says the tasks are still running. But in my terminal I see this:
    2020-05-27 15:53:06.663 [43/compute_embeddings/268 (pid 8539)] Batch job error:
    2020-05-27 15:53:06.663 [43/compute_embeddings/268 (pid 8539)] ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')
    2020-05-27 15:53:06.892 [43/compute_embeddings/268 (pid 8539)]
    2020-05-27 15:53:07.266 [43/compute_embeddings/268 (pid 8539)] Task failed.
    2020-05-27 15:53:07.266 Workflow failed.
    2020-05-27 15:53:07.266 Terminating 15 active tasks...
    2020-05-27 15:53:12.000 Killing 15 remaining tasks after having waited for 5 seconds -- some tasks may not exit cleanly
    2020-05-27 15:53:12.018 Flushing logs...
    Savin
    @savingoyal
    We do a best effort kill. Given that all of these jobs can only execute for a finite amount of time, you are guaranteed that eventually the stragglers will timeout if metaflow was unable to kill them.
    Wooyoung Moon
    @wmoon5
    Got it. So there's no way to recover the results of the straggler jobs, correct? Is there anything I can do to avoid the failure in the future? I could try re-running the flow and let you know if it happens again
    Savin
    @savingoyal
    You can also add @retry to guard against platform issues. Yes, please do let us know if it happens again.
    Wooyoung Moon
    @wmoon5
    Great thanks I'll try adding @retry as well!
    Wooyoung Moon
    @wmoon5
    @savingoyal just a quick update: I'm re-running the same flow as before with retries added and I'm seeing the same failure (GetLogEvents ThrottlingException) causing tasks to have to retry (3 out of the 16 parallel tasks already).
    Wooyoung Moon
    @wmoon5
    @savingoyal it doesn't look like this update (https://github.com/Netflix/metaflow/pull/193/files/f65df3ae2dd21ce89bc6debdcbbb67d7a611c286) is in the current master
    Savin
    @savingoyal
    ~Uggh I did a revert a little while back ~
    Wooyoung Moon
    @wmoon5
    Well actually, I'm looking at the code that we're running on (2.0.5) and at least the version we have installed on our machines doesn't seem to have the update...
    Savin
    @savingoyal
    Sorry, my head was elsewhere - we made the decision to eventually rollback that patch - Netflix/metaflow@3947447
    We are also working on another log streaming solution which will get rid of this bottleneck. Essentially every task will upload it's logs in chunks to S3 and we will stream them back to the CLI.
    Wooyoung Moon
    @wmoon5
    Ah I see I see makes sense thanks!! Is there a pull request somewhere so I can keep track of that future change? In the meantime I'm guessing our options are to either: 1) hack it and mute that particular exception again or 2) limit the number of jobs we run at the same time?
    Savin
    @savingoyal
    The problem with muting that exception is that you won't get any logs back till the jitter is resolved and the limits on GetLogs are global in nature, so even if you limit the number of jobs you are launching, somebody else can cause trouble for you by launching jobs at the same time.
    I will tag you once the PR is open for review.
    Another option is to deploy your workflow on top of AWS Step Functions Netflix/metaflow#204
    You can very easily control the concurrency by specifying --max-workers=
    Wooyoung Moon
    @wmoon5
    I see that makes sense that's really helpful. Please do tag me once the PR is open for review. Thank you for the help Savin! Super appreciated!
    Savin
    @savingoyal
    For sure! Do let us know your feedback so that we make metaflow better.
    Christopher Wong
    @christopher-wong
    I recently tried applying the CloudFormation template and ran into the error ECSFargateService CREATE_FAILED Service arn:aws:ecs:us-west-2:<account_id>:service/metadata-service did not stabilize.
    Has anyone seen this error before? Happened 3 times in a row
    ferras
    @ferras
    @christopher-wong can you verify if the RDS instance is up?
    Christopher Wong
    @christopher-wong
    let me try it again and check
    Christopher Wong
    @christopher-wong
    @ferras RDS is up and running :) ECSFargateService is still stuck sadly
    if I look at the task, I see the following error: Status reason CannotPullContainerError: Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    ferras
    @ferras
    hmm, strange
    let me check the docker image link used by the template
    Christopher Wong
    @christopher-wong
    and I verified that the Fargate security group has an allow 0.0.0.0/0 rule, so should be fine on the networking side
    Christopher Wong
    @christopher-wong
    i am :)
    My template is exactly the same as master, except I pulled out the VPC / subnets and replaced them with my already existing values, hard coded.
    ferras
    @ferras
    hmm, seems to be an issue with the docker config
    https://docs.docker.com/config/daemon/systemd/#httphttps-proxy
    see similar issue here:
    docker/for-win#611
    Christopher Wong
    @christopher-wong
    huh this is weird
    that should all be set by AWS for ECS/Fargate right? Do we have control over that?
    ferras
    @ferras
    hmm yeah
    one thing you can do is try to push the image to your own s3 account
    and then pull from there
    Christopher Wong
    @christopher-wong
    ah, yeah, that’s a good suggestion
    I’ll give that a try and report back
    ferras
    @ferras
    sorry for the inconvience, ideally you wouldn't actually need todo that
    but it's hard for me to diagnose a connection issue within your aws account
    Christopher Wong
    @christopher-wong
    at least if that resolves the issue, it narrows down potential root causes
    I apprecaite the suggestion
    thanks for the help