Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    jaiprasad Reddy
    @jaiprasadreddy
    Hi team, I started working on the metaflow recently. I am trying to use S3 and AWS batch only. I have setup accordingly the S3 and AWS batch and also ECS IAM role access. I can see that it is able to create Cluster in ECS, upload the flow in the S3. But its not able to submit a job in AWS Batch. when I am running it on the terminal "python helloaws.py run" i am getting segmentation fault. How to debug this.?
    2 replies
    Wooyoung Moon
    @wmoon5
    Anyone ever run into an error message like this?:
    Metaflow service error:
    Metadata request (/flows/PoIExperimentFlow/run) failed (code 500): "{\"err_msg\": \"__init__() got an unexpected keyword argument 'run_id'\"}"
    16 replies
    Malay Shah
    @malay95
    Hello everyone, happy Thanksgiving, I am working on a metaflow flow which uses high RAM and processors, I want to monitor the usage on batch as well as on my local, Can I use something which can generate the usage of RAM and processors at the end of each step or a decorator that does it? I want to estimate a good amount of RAM that would be good in such cases.
    2 replies
    Denis Maciel
    @denismaciel
    hi there, is it possible to pass the default Docker image to run on AWS Batch from the command line? Something like python flow.py --with batch --dockerimage <image-url> or is it only possible using the batch decorator?
    2 replies
    bishax
    @bishax

    Hi, according to Netflix/metaflow#193 then ThrottlingException's don't cause task failure; however I am reliably getting ThrottlingException followed by a task failure a second later, e.g.

    2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)]     AWS Batch job error:
    2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)]     ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')
    2020-11-30 15:16:58.539 [659/link_finder/4638 (pid 1526921)] 
    2020-11-30 15:16:59.030 [659/link_finder/4638 (pid 1526921)] Task failed.
    2020-11-30 15:16:59.092 [659/link_finder/4638 (pid 1535437)] Task is starting (retry).

    Perhaps, a flurry of logging error events are being masked by the throttling exception hiding the source of the failure?

    12 replies
    Ji Xu
    @xujiboy

    Hi I am trying to install metaflow for R following the doc, but ran into this following error when trying to test with metaflow::test():

    Metaflow 2.2.0 executing HelloWorldFlow for user:ji.xu
    Validating your flow...
        The graph looks good!
    2020-11-30 10:50:33.216 Workflow starting (run-id 1606762233207871):
    2020-11-30 10:50:33.222 [1606762233207871/start/1 (pid 49822)] Task is starting.
    2020-11-30 10:50:34.783 [1606762233207871/start/1 (pid 49822)] Fatal Python error: initsite: Failed to import the site module
    2020-11-30 10:50:34.785 [1606762233207871/start/1 (pid 49822)] Traceback (most recent call last):
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 550, in <module>
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     main()
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 531, in main
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     known_paths = addusersitepackages(known_paths)
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 282, in addusersitepackages
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     user_site = getusersitepackages()
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 258, in getusersitepackages
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     user_base = getuserbase() # this will also set USER_BASE
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 248, in getuserbase
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     USER_BASE = get_config_var('userbase')
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 609, in get_config_var
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     return get_config_vars().get(name)
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 588, in get_config_vars
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     import _osx_support
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/_osx_support.py", line 4, in <module>
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     import re
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/re.py", line 123, in <module>
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]     import sre_compile
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sre_compile.py", line 17, in <module>
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]     assert _sre.MAGIC == MAGIC, "SRE module mismatch"
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] AssertionError: SRE module mismatch
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] Error: Error 1 occurred creating conda environment r-reticulate
    2020-11-30 10:50:34.817 [1606762233207871/start/1 (pid 49822)] Execution halted
    2020-11-30 10:50:34.820 [1606762233207871/start/1 (pid 49822)] Task failed.
    2020-11-30 10:50:34.820 Workflow failed.
    2020-11-30 10:50:34.820 Terminating 0 active tasks...
    2020-11-30 10:50:34.820 Flushing logs...
        Step failure:
        Step start (task-id 1) failed.

    It seems the problem is on the python side. Has anyone seen the same issue and has a solution?

    45 replies
    David Patschke
    @dpatschke

    I've been having loads of problems getting my AWS Batch job working through Metaflow. Since I'm not overly experienced with AWS CloudOps, it's hard for me to tell whether the issue is an AWS issue or a Metaflow limitation.

    Here is one thing I experienced which may help others:

    1. As per @russellbrooks suggestion in this thread, I have created a single job queue which contains both CPU and GPU compute environments. When I use the @batch decorator, I noticed that GPU instances were getting launched even when I explicitly denoted gpu=0 as a parameter in the decorator.

    This appears to be happening for a couple of reasons:

    • I maxed out my vCPU limit on my CPU ComputeEnvironment which is forcing a job to launch on the GPU ComputeEnvironment. After talking with AWS support, if any of you here are really wanting to crank up the number of batch workers, make sure the MaxVCPUBatch parameter in the CloudFormation template is also adjusted upwards accordingly. For me, I'm running Dask parallelization within each Batch task, so I'm using up the MaxVCPUBatch pretty quickly and was only seeing one c5.18xlarge instance launch at any one time when I had a MaxVCPUBatch value of 96 in my CloudFormation template. So ... even though the Metaflow documentation lists a --max-workers parameters in the CLI, the number of maximum workers will also be throttled by MaxVCPUBatch in the CloudFormation Template.

    • The explicit denotion of gpu=0 does nothing within the Metaflow @batch decorator (BatchJobclass). I know there are a lot of ways to correct for this (separate job queues, solution mentioned above, etc.) but was curious what the Metaflow devs on this forum think of possibly changing line 150 in batch_client.py to read if int(gpu) >= 0 to protect from GPU instances being launched "unnecessarily".

    11 replies
    jpcloudconsulting
    @jpcloudguru_twitter
    Hi trying to run a hello metaflow example on AWS batch. Getting the following error. Any ideas?
    4 replies
    Metaflow 2.2.5 executing HelloAWSFlow for user:jpujari
    Validating your flow...
        The graph looks good!
    Running pylint...
        Pylint is happy!
    2020-11-30 17:12:42.323 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] Setting up task environment.
    2020-11-30 17:12:42.325 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -le: unexpected operator
    2020-11-30 17:12:44.974 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -gt: unexpected operator
    2020-11-30 17:12:44.975 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: job.tar: Cannot open: No such file or directory
    2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: Error is not recoverable: exiting now
    2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)]     AWS Batch error:
    2020-11-30 17:12:45.225 [58/hello/251 (pid 14705)]     Essential container in task exited This could be a transient error. Use @retry to retry.
    2020-11-30 17:12:45.233 [58/hello/251 (pid 14705)]
    2020-11-30 17:12:47.600 [58/hello/251 (pid 14705)] Task failed.
    2020-11-30 17:12:47.878 [58/hello/251 (pid 28819)] Task is starting (retry).
    2020-11-30 17:12:48.588 [58/hello/251 (pid 28819)] Sleeping 2 minutes before the next AWS Batch retry
    beks
    @teki-b
    hello, we have been using the pip decorator to install versions of libraries as follows: @pip(libraries={"<library-name>":"<version>"}) - does anyone know if it's possible/ how to get this to use the latest version dynamically without specifying this value?
    1 reply
    TDo13
    @TDo13
    Hello, are there potentially good examples of how to use the step command to run an individual step or a subset of steps in a metaflow workflow?
    11 replies
    Ian Wesley-Smith
    @iwsmith
    Hello, I am running a Metaflow job on AWS Batch and am getting some weird S3 errors. One of my parallel steps fails, with one job returning:
    Task is starting.
    <flow UserProfileFlow step make_user_profile[14] (input: [UserList(user_id=18...)> failed:
        Internal error
    Traceback (most recent call last):
      File "/metaflow/metaflow/datatools/s3.py", line 588, in _read_many_files
        stdout, stderr = self._s3op_with_retries(op,
      File "/metaflow/metaflow/datatools/s3.py", line 658, in _s3op_with_retries
        time.sleep(2i + random.randint(0, 10))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
        self.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
        self._closer.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
        unlink(self.name)
    OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3op.stderrn16glb60'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/metaflow/metaflow/cli.py", line 883, in main
        start(auto_envvar_prefix='METAFLOW', obj=state)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 829, in __call__
        return self.main(args, kwargs)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, ctx.params)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 610, in invoke
        return callback(args, kwargs)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
        return f(get_current_context().obj, args, kwargs)
      File "/metaflow/metaflow/cli.py", line 437, in step
        task.run_step(step_name,
      File "/metaflow/metaflow/task.py", line 394, in run_step
        self._exec_step_function(step_func)
      File "/metaflow/metaflow/task.py", line 47, in _exec_step_function
        step_function()
      File "train.py", line 121, in make_user_profile
        files = s3.get_many(user_keys, return_missing=True)
      File "/metaflow/metaflow/datatools/s3.py", line 417, in get_many
        return list(starmap(S3Object, _get()))
      File "/metaflow/metaflow/datatools/s3.py", line 411, in _get
        for s3prefix, s3url, fname in res:
      File "/metaflow/metaflow/datatools/s3.py", line 597, in _read_many_files
        yield tuple(map(url_unquote, line.strip(b'\n').split(b' ')))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
        self.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
        self._closer.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
        unlink(self.name)
    OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3.inputs._ztikcnn'
    service@http://Metaflo-XXXXX.elb.us-east-1.amazonaws.com
    4 replies
    russellbrooks
    @russellbrooks
    PSA potentially related to :point_up:, looks like the Batch team got around to updating the default compute environment ECS-optimized AMIs to use Amazon Linux 2 :tada:
    https://aws.amazon.com/about-aws/whats-new/2020/11/aws-batch-now-has-integrated-amazon-linux-2-support/
    3 replies
    ayorgo
    @ayorgo
    Hey Metaflow,
    How can I approach a clean-up of my metadata service from old metadata and artifacts?
    7 replies
    ayorgo
    @ayorgo

    Hey Metaflow,
    How can I disable the timestamp in the printout? I tried to monkeypatch the logger as follows but it messed all the parameter taking up and didn't work

    from metaflow import cli
    from functools import partial
    cli.logger = partial(cli.logger, timestamp=False)

    Is there any other way?

    4 replies
    Christopher Wong
    @christopher-wong
    I’m testing out the new Metaflow scehduler for the first time (very excited!) but running into some issues with private conda packages. When I try and schedule, Metaflow obviously won’t be able to find my private conda package and fails to schedule. Previuosly, I manually installed the conda package on the EC2 I was using to run the job. Is there a workaround with Step functions?
    4 replies
    mkjacks5
    @mkjacks5

    from https://docs.metaflow.org/metaflow/tagging
    " if you have separate training and prediction flows in production, the prediction flow can access the previously built model as long as one exists in the same namespace"

    I have two such flows, but I can't figure out how to have them in the same namespace. I've tried --authorize but it seems it creates a unique production token (i.e. namespace("production:flow1-0-zjgv")) for every unique flow name. I'm able to get around this by changing namespaces inside the script, but it sounds from the documentation there should be a way to have them in the same namespace so that I can more easily access trained models from the traning flow when I run the prediction flow. Am I misunderstanding something here?

    1 reply
    Roman Kindruk
    @sappier

    Hi here, we're developing the plugin to run the Metaflow flows in the k8s using Argo Workflows. It's similar to the AWS Step Functions plugin but generates an Argo's WorkflowTemplate instead of the SFN's StateMachine. Also it adds an extra @argo decorator to specify k8s resources:

    @argo(image='tensorflow/tensorflow:2.2.1-gpu-py3', nodeSelector={'gpu': 'nvidia-tesla-v100'})
    @resources(gpu=1, cpu=2, memory=6000)
    @step
    def training(self):
        ...

    Would you be interested to make such plugin a part of the Metaflow project?

    7 replies
    Antoine Tremblay
    @hexa00
    I this this error familiar to anyone ? botocore.exceptions.HTTPClientError: An HTTP Client raised an unhandled exception: 'SSLSocket' object has no attribute 'connection' , I get that with File "/metaflow/metaflow/datastore/s3.py", line 178, in save_metadata with requests-2.23.0 it needs 2.24 ? from the get_pinned_conda_libs it seems... will try
    6 replies
    Christopher Wong
    @christopher-wong

    I’ve been running into this issue quite a bit recently.

    AWS Batch Error: CannotCreateContainerError: Error response from daemon: devmapper: Thin Pool has 4115 free data blocks which is less than minimum required 4449 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior This could be a transient error. Use @retry to retry.

    Removing any EC2s controlled by Metaflow and letting the ASG create new ones seems to temporarily solve the problem but it keeps reappearing. Any advice on how to mitigate?

    5 replies
    Greg Hilston
    @GregHilston
    Hey guys, any thoughts as to what could be causing a conda Error: UnsatisfiableError when Bootstrapping a conda environment when running locally on OSX but not occur when running remotely on Batch?
    10 replies
    Ville Tuulos
    @tuulos
    many of you have asked how Netflix uses Metaflow internally. Here's finally a blog article that shares some details that we couldn't share earlier https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f
    1 reply
    Peter Wilton
    @ammodramus
    Hi folks, what is the suggested way of running 1000's of Batch jobs simultaneously? Attempting to run them using the standard python flow.py run causes the machine I'm running this on to go OOM after launching a Python process for each job. This is after setting --max-workers and --max-num-splits high enough to allow this many jobs.
    11 replies
    Antoine Tremblay
    @hexa00
    Is there a way to see the logs of step functions ? Looks like logs are not enabled by default... (For context my use case is running a job that takes 5 days... and that this unfortunately can't be ran from a laptop since the connection to the original caller may break.. So basically I'm trying to reproduce the normal batch workflow with step...) Let me know if there's a better way...
    5 replies
    Antoine Tremblay
    @hexa00

    I got this very weird issue running metaflow locally with a step that import pandas... while running the code out of metaflow it all works fine... but the same code in the same env with metaflow I get

     ImportError: cannot import name 'Collection' from 'typing' (/home/hexa/.cache/pypoetry/virtualenvs/nima-images-H6fn72k1-py3.7/lib/python3.7/site-packages/typing.py)

    Ideas ?

    5 replies
    Running in aws batch it runs fine too....
    Ville Väänänen
    @fortum-vaanavil
    Hi! I'd like to create a new custom Environment. Are there any examples on how to use the plugin mechanism?
    1 reply
    Antoine Tremblay
    @hexa00
    Hi there just wanted to say thanks to all of you working on metaflow, I was finally able to migrate key components of our arch to it ! and it now works great !
    Key lessons to date:
    • Using docker images is a lot less trouble then conda.. or to try to use pip via a custom decorator
    • Using patch_env is great to get debugging to work with session based aws keys
    • Dependency clashes due to metaflow actually being part of your app, can be really painful
      • Setting a minimum CPU to the cluster can leave with with a GPU instance running instead of pure CPU....
      • Using step functions for long running processes is still a bit painful, can't wait for improvements there :)
    5 replies
    Greg Hilston
    @GregHilston

    Hey Metaflow, one unexpected discovery I've stumbled upon is the orchestration of the DAG is performed locally, even when running on AWS Batch.

    Additionally, I created an access list for our AWS API Gateway that our Metaflow API uses, as one layer of security.

    This means that we have to leave our data scientists machines running and on the VPN, during long running training flows.

    Is there any configuration of Metaflow to allow the orchestration to be performed remotely and allow our development machines to be disconnected from the VPN or even shut down during execution?

    1 reply
    Christopher Wong
    @christopher-wong

    Has anyone run into intermmitent errors from the Metaflow service?

    failed (code 500): {“message”: “Internal server error”}

    This seems to happen on flows with multiple steps where the pipeline starts fine and runs a few steps, but then fails with the above error

    17 replies
    Malay Shah
    @malay95
    Hello all, we want to deploy our flows into production using metaflow and wanted to use fargate for the compute environment. Can we setup the job definition and other parameters required for submitting a fargate job to aws batch? Is there any documentation on how to setup fargate clusters for compute instead of ec2 on-demand compute environment?
    7 replies
    Malay Shah
    @malay95
    Hello all, I was wondering why we set ttl when we setup the dynamoDB table for step functions. And what would be a good value for that. I am not aware of the usage of DynamoDB for step functions.
    7 replies
    Malay Shah
    @malay95
    I am creating a fargate cluster for the metadata service, after following all the steps in the manual steps in the document. I see this error in the cloudwatch events:
    
    /migration_service/migration_server.py:17: DeprecationWarning: loop argument is deprecated
    
    app = web.Application(loop=loop)
    
    /migration_service/migration_server.py:28: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
    
    AttributeError: 'NoneType' object has no attribute 'cursor'
    
    /bin/sh: 1: metadata_service: not found
    4 replies
    Matej Války
    @enderstorm
    Hello all, I have just started working on data pipelines using excellent metaflow, but I am not sure how to make flows as steps, after Flow A, Flow B is run, in aws environment - batch + scheduled SFN. I tried to instantiate flows in one main flow, but that failed. Do you know any workarounds? Or have I to create single bigger flow, as every other flows are dependant on one the first flow?
    waz-mataz
    @waz-mataz
    Hi. I would like to build a custom docker image, with all the requirements for Metaflow and some other custom dependencies on top and use
    @batch(image='custom-image') . What would be the best way to build on top of the default image to ensure requirements for Metaflow are preserved? How does Metaflow install necessary dependencies on the default image?
    4 replies
    russellbrooks
    @russellbrooks
    kinda random but sharing in case it's useful for anyone else – when using SFN-based metaflow executions with fanout steps, the step before the fanout will fail if curl is not installed in the image. Seems to be used to lookup the dynamodb host. FWIW not an unreasonable dependency, and I was surprised to find that curlwasn't already baked into the continuumio/miniconda3:latest image. Once installing it in the docker image the dynamodb host resolution worked as expected.
    5 replies
    TDo13
    @TDo13

    Hello all, I'm trying to take a look at the artifacts associated with one of our SFN executions but when trying to call:

    Step(...).task.data

    We run into the following error:

    ServiceException: Metadata request (/flows/{Flow}/runs/{Run}/steps/start/tasks/{Task}/artifacts) failed (code 500): 500 Internal Server Error

    Looking at the logs from our metadata service, I see the following error:

    Traceback (most recent call last):
        File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 418, in start
            resp = await task
        File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_app.py", line 458, in _handle
            resp = await handler(request)
        File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 140, in get_artifacts_by_task
            artifacts.body)
        File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 355, in _filter_artifacts_by_attempt_id
            attempt_id = ArtificatsApi._get_latest_attempt_id(artifacts)
        File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 349, in _get_latest_attempt_id
            if artifact['attempt_id'] > attempt_id:
    TypeError: string indices must be integers
    4 replies
    Itamar Turner-Trauring
    @itamarst
    @tuulos hi, stopping by from Hacker News
    32 replies
    waz-mataz
    @waz-mataz

    Hello, I'm trying to run metaflow on a docker alpine image with python and node. I've installed the metaflow required dependencies in the image along with other node requirements for my use case; and ran my script using --with batch:image=my-custom-image. It resulted in this error

    2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] Setting up task environment.
    2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] /usr/bin/python: No module named pip
    2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
    2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
    2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] tar: can't open 'job.tar': No such file or directory
    2020-12-23 00:14:08.205 [4747/start/30948 (pid 15530)]     AWS Batch error:
    2020-12-23 00:14:08.439 [4747/start/30948 (pid 15530)]     Essential container in task exited This could be a transient error. Use @retry to retry.
    2020-12-23 00:14:08.440 [4747/start/30948 (pid 15530)]
    2020-12-23 00:14:08.791 [4747/start/30948 (pid 15530)] Task failed.

    My question is, pip and the required python dependencies are installed in the container so what is causing the No module named pip error? Thanks

    14 replies
    Antoine Tremblay
    @hexa00

    Anyone had problems with multi GPU on Aws Batch ? : I get like:

     CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr This could be a transient error. Use @retry to retry.

    1 GPU works fine

    22 replies
    Sonu Patidar
    @skamdar
    @savingoyal Are you guys planning to support EMR on Metaflow as well?
    16 replies
    David Patschke
    @dpatschke
    Is there a way to pass a custom environment variable to a Metaflow AWS Batch job? I've seen several recommendations on this board and tried them all, but none of them work for me ... well none that don't expose the environment variable via the command-line. I think @russellbrooks showed an example with CONDA_CHANNELS but that doesn't work for me.
    @tuulos You mentioned prepending METAFLOW_RUN_ to the desired variable but this doesn't seem to bring the variable into the AWS Batch environment for me.
    @savingoyal You mentioned using the environment decorator but I'm getting a linting error when attempting to use that. Then, when I use --no-pylint to override, none of my Flow steps work.
    I just want to be able to os.environ.get a custom environment variable from within one of my Metaflow steps that was created in my local environment and passed to the AWS Batch environment. I feel like I'm missing something rather obvious.
    Thanks in advance for the help!
    11 replies
    russellbrooks
    @russellbrooks

    hey guys – curious if anyone else would find value in exposing the batch job parameter for sharedMemorySize? It looks like the AWS batch team added the parameter towards the end of last year and it's a passthrough to docker run --shm-size, which can really speed up the performance of pytorch parallel dataloaders (especially to saturate multiple GPUs) and some boosting libraries.

    ECS defaults the instance shm to 50% of memory allocation, but docker will only expose 64mb of that by default to running containers.

    6 replies
    seanv507
    @seanv507
    Hi is there an update on retroactive editing of tags? My use case is that we would want to label a run as "official" after human inspection. and to link flows together ( eg data preprocessing) followed by model run flow. I would like to tag the data_preprocessing flow used for a given model_run
    2 replies
    Vinicius Agostini
    @viagostini
    Hey guys, I was wondering if there is a way to make a Flow trigger another Flow, in order to reuse them as components of a bigger system or maybe if its on the roadmap, couldn't find anything about it
    12 replies
    NeeleshG
    @neeleshg
    Hi Guys,
    I want to try Metaflow IDS on AWS Infra.
    However when I checked AMI in Marketplace, it is updated in 2018.
    Do we have any updated AMI ?
    2 replies
    Ville Tuulos
    @tuulos

    📣 Metaflow was just included in the Netflix's security bug bounty program! Find vulnerabilities in the code and get paid for it 💰(Or just enjoy Metaflow getting more secure over time)

    https://bugcrowd.com/netflix/updates/59a4e5dc-5e79-4965-9289-ae5a0d9de044

    Greg Hilston
    @GregHilston

    Hey Metaflow! I have a pretty specific question:

    I find myself having trouble running a flow on AWS Batch that uses a container with pre-installed Python libraries. I happen to be using conda to install a few extra libraries in this step but by doing so, it seems I now have a fragmented environment.

    Any advice on how one can use a Docker container as a base environment and then seemingly add a few more packages in a specific step using conda?

    The success criteria here would be to successfully import a package installed by the Docker image as well as a different package installed by the conda decorator

    9 replies
    russellbrooks
    @russellbrooks
    Sharing a difference in the behavior of --max-workers between the local runtime and when deployed via SFN, specifically when having nested foreach fanouts. Locally, the runtime will enforce the parallelization at the task level so it will never go beyond that, however the SFN concurrency limit is enforced per-split, so the nested fanout will result in an effective parallelism of max-workers^2. Similarly, normal fanouts in a SFN deployment are not rate limited. Not sure it’s worth explicitly stating this in the docs, but thought I’d mention it just in case
    2 replies
    Christopher Wong
    @christopher-wong

    I just noticed Batch has started hitting the Docker free tier rate limit. What’s the best way to mitigate this?

    CannotPullContainerError: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

    Any chance we can get a copy of the Metaflow docker image hosted on the new Public ECR repos?

    4 replies