Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Ahmad Houri
    ur env there then batch just downloads & run the ima
    Hey All! I was recently introduced to Metaflow and I have few questions. If anyone can help me through? Does metaflow provides data labelling?, explainability feature?, team collaborations and if it is open source?
    3 replies


    I am passing my image to python yo.py step-functions create --with batch:image=bla. Are there any ways to pass runtime variables to that image? thanks in advance!

    6 replies
    Greg Hilston

    I'm experiencing some problems when trying to install pytorch with CUDA enabled.

    I'm running my flow on AWS Batch, powered by a p3.2xlarge machine and using the image


    to get the NVIDIA driver installed.

    The relevant flow code looks like:

    class FooFlow(FlowSpec):
        @batch(image=URL ABOVE)
        # this line below is of most interest
        @conda(libraries={"pytorch": "1.6.0", "cudatoolkit": "11.0.221"})
        @resources(memory=4*1024, cpu=2, gpu=1)
        def test_gpu(self):
            import os
           print(os.popen("nvcc --version).read())
           import torch

    I'm not convinced this is precisely a Metaflow issue, but the common solutions one finds when Googling involves installing Pytorch using the conda CLI, which obviously the @conda decorartor extrapolates away from us.

    I've been running many flows, of different versions of pytorch and cudatoolkit.

    Torch not compiled with CUDA enabled

    I'm familiar with the Github Issue: Netflix/metaflow#250

    Any advise at all?

    19 replies
    Taleb Zeghmi

    We’re working on creating a @notify() decorator that could send a notification upon success or failure, per Flow or per Step. It could send email or slack messages.

    It would be up to the scheduler (local, AWS Step Functions, KFP) to honor the @notify decorator.

    @notify(email_address=“oncall@foo.com", on="failure")
    @notify(email_address=“ai@foo.com", on="success")
    class MyFlow(Flow):
       @notify(slack_channel=“#foo", on="success")
       def my_step(self):

    To implement this I’d like to introduce a new Metaflow concept, a @finally step.

    class MyFlow(Flow):
       def finally_step(self, status):
          status  # we need a way to message Success or Failure
    7 replies
    Hi, I have recently been working with Metaflow, and am not able to access the previous flows of other members of group using namespace. Just want to make sure I am not missing anything regarding namespace, any help is appreciated. Thanks
    1 reply


    How can I pass @Parametersdifferent than default to step-functions create?
    I know step-functions trigger can take any @Parameters in a pipeline python file but this is valid only for this run.
    What I wanna do is to pass @Parameters to cron schedule in AWS EventBridge dynamically.

    9 replies
    Just wanted to say - fantastic work on this. Can't wait for the addition of some of the new features, particularly the graph composition and inclusion of external modules (symlink).
    15 replies
    Ayotomiwa Salau
    Hello guys, I am pretty new to the Metaflow community. How do I start contributing?
    8 replies
    Daniel Perez

    Hey guys, been using metaflow for a bit over a year now, and I've recently started to ingrate our deployment with AWS Batch for the scale-out pattern. I'm now able to execute flows with some steps that run in Batch, however I don't see the ECS cluster ever scaling back down

    To ellaborate, my compute environment has the following settings, min vcpus = 0, desired vcpus = 0, max vpcus = 32

    When I run a flow, a job definition gets added into the job queue, an instance gets started in the cluster, the task runs and finishes fine, but the job definition stays as "Active" the instance seems to stay up indefinitely inside the cluster until I go and manually "deregister" the job definition

    Is this the way it's designed? or am I missing something in the way I configured my Compute environment?

    Is metaflow supposed to update the job definition after a flow finishes?

    5 replies

    hey guys, would anyone find it useful to expose the batch param for ulimits? It's a list of dict that maps to the docker --ulimit option of docker run. In particular, I've noticed that the ECS/batch default ulimit for the number of open files per container is 1024/4096. With this option, it could be potentially increased up to the daemon limit using:

    ulimits=[{"name": "nofile", "softLimit": 65535, "hardLimit": 1048576}]


    2 replies

    FWIW this can be set via a launch template for the batch compute environment/ECS cluster, so it's not a necessity and also is a bit ugly for a decorator which is why I ask :sweat_smile:. As an example of what this looks like in a launch template:

    Content-Type: multipart/mixed; boundary="==BOUNDARY=="
    MIME-Version: 1.0
    Content-Type: text/cloud-boothook; charset="us-ascii"
    cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --default-ulimit nofile=65535:1048576"' >> /etc/sysconfig/docker


    4 replies
    Ayotomiwa Salau
    Hello guys, can I get links to resources on working with model & data versioning on Metaflow. Cant seem to find it in the documentation.
    2 replies
    Corrie Bartelheimer
    Hey folks, a question regarding retries. We frequently see some jobs fail first and then succeed in a second try so we definitely need the retry functionality. However, we are a bit weary that this also means any code error in this step would lead to retries. Since the step is in parallel, this could mean a few thousand jobs retried which would occur unnecessary costs. What would be the best way to handle such a situation?
    2 replies
    Matt McClean
    Hi all. I created an issue around supporting AWS Batch multi-node here: Netflix/metaflow#444 . Would be useful for anyone wanting to deploy distributed training jobs on more than one instance
    2 replies

    I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: None
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    After I read in the config file

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.

    When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):

    get_namespace(): None
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
    get_metadata() local@/home/ec2-user/workspace/models-inference_staging
    list(Metaflow()) []

    So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:


    but I get the error

    Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.

    Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks

    10 replies
    Snehal Shirgure
    Hello folks, has anyone looked into inspecting data and metrics from runs using a visualization tool such as ipywidgets in jupyter notebooks? Any suggestions/ideas on the same lines are welcome :)
    3 replies
    Malay Shah
    Hey guys, I am looking into the implementation of metaflow and how metaflow interacts with batch and other services of aws. I wanted to look at the code for that but could not find the code related to the same. Can anyone point me to the script or class that handles all the interaction with aws? Thank you very much.
    2 replies
    Are there any plans to allow batch steps to reuse the same container? At the moment it feels we get a massive slow down when moving from local runs to batch runs, because each step suddenly requires the whole batch scheduling, draw down of containers etc. I am keen to have the flexibilty of demanding extra resources etc for a particular step, but oftentimes consecutive steps don't need additional resources. [ obviously one can collapse these steps together, but then one loses the whole retry functionality ]
    19 replies
    Ayotomiwa Salau
    Hello, I was trying out metaflow in a notebook, I got this error "Flow('PlayListFlow') does not exist". I can't find a way to instantiate/create a flow in a notebook.
    2 replies
    Kyle Smith
    Hello, I'm doing some initial work for a manual deployment. It's important that we only use a private subnet. Is this feasible? Why does the default installation include a public subnet?
    3 replies
    Elham Zamansani
    Hey guys, I have a problem in importing environment decorator. I guess there is a bug there. Because when I import it as follows: from metaflow import FlowSpec, step, environment , it gives an error that environment is not callable and it makes sense, because when import like this, metaflow wants to read from environment.py script. I did a small test. If I change the name of environment in line 27 from environmnet_decorator.py to anything else and then import that, it works. Could you please check it or correct me if I miss sth regarding the import?
    9 replies
    Analysing job run times. Hi we would be interested in monitoring AWS batch run times ideally within cloudwatch. https://docs.aws.amazon.com/batch/latest/userguide/batch_cwet.html provides a very useful stream of information. metaflow provides eg run_id (https://github.com/Netflix/metaflow/blob/04881c58c22e4e7e66a4faa7f676fcfca454c027/metaflow/plugins/aws/batch/batch.py#L127) which appears in this stream. My question is how we can cross reference this to eg run parameters. so that we can get aggregate statistics for eg a given parameter.
    10 replies
    Robert Sandmann
    Dear metaflow team, first of all I want to thank you for this great piece of technology! It's quite amazing how insanely easy you make it for our data scientists to define their workflows, especially when you look at the internals on how you realize that, great work!
    A use case that we currently try to implement is using metaflow to setup a workflow for federated learning. That means that we do the client training in a foreach for every client which works great.The problem now is that we need to repeat these federated training rounds which introduces cycles into the DAG.
    Our current approach is to monkeypatch metaflow internals to allow cycles in the DAG and dynamically add new steps for every round using a custom FlowDecorator.
    This approach seems rather hacky (and is not yet quite working).
    Branch specific concurrency (Netflix/metaflow#172) or graph composition (https://github.com/Netflix/metaflow/issues/144) might make our lives easier.
    But I was wondering if you had any ideas on how to make this possible in the current state of metaflow. I'm grateful for any hints!
    2 replies
    Patrick John Chia

    Hello! @christineyu-coveo and I have been using metaflow recently and really enjoy it. We also face another issue related to using @batch and @environemnt.

    Consider the following

    def step_A(self):
    def step_B(self):

    Metaflow initializes decorators for all steps before running any step. For @environment this includes running step_init, where it updates the environment variables
    based on the vars passed in the decorator. Following the above flow, when we are running step_A, the environment decorator for step_B will also be initialzied, and an exception will occur because var_2 is None in the batch enviornment for step_A, since it was not included in the @environment decorator for step_A. Our current fix involves disabling enitrely step_init for @environment. While this works for our use case (i.e. >1 @batch steps, with use of @environment in either or both @batch steps), I suspect this might disable some of the other usecases of @environment. Do you have any alternate solutions to this problem? Prehaps batch decorator could be modified to also allow for inclusion of environemnt variables that we want to ship with the job.

    7 replies

    metaflow could not install or find cuda in GPU environment and pytorch could not use GPU at all, issue was marked as resolved on Netflix/metaflow#250 but I could not replicate it.

    sample code test_gpu.py I used

    from metaflow import FlowSpec, step, batch, IncludeFile, Parameter, conda, conda_base
    class TestGPUFlow(FlowSpec):
        @batch(cpu=2, gpu=1, memory=2400)
        @conda(libraries={'pytorch': '1.5.1', 'cudatoolkit': '10.1.243'})
        def start(self):
            import os
            import sys
            import torch
            from subprocess import call
            print(os.popen("nvcc --version").read())
            print('__Python VERSION:', sys.version)
            print('__pyTorch VERSION:', torch.__version__)
            print('__CUDA VERSION')
            print('__CUDNN VERSION:', torch.backends.cudnn.version())
            print('__Number CUDA Devices:', torch.cuda.device_count())
            call(["nvidia-smi", "--format=csv",
            print('Active CUDA Device: GPU', torch.cuda.current_device())
            print('Available devices ', torch.cuda.device_count())
            print('Current cuda device ', torch.cuda.current_device())
            print(f"GPU count: {torch.cuda.device_count()}")
        def end(self):
    if __name__ == "__main__":

    cmd line I used

    USERNAME=your_name CONDA_CHANNELS=default,conda-forge,pytorch METAFLOW_PROFILE=your_profile AWS_PROFILE=your_profile python test_gpu.py --datastore=s3 --environment=conda run --with batch:image=your_base_image_with_cuda_support

    metaflow output

    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: N/A      |
    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |               MIG M. |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |===============================+======================+======================|
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | N/A   43C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |                  N/A |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38]
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-----------------------------------------------------------------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Processes:                                                                  |
    2021-03-10 18:38:13.788 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |  GPU   GI   CI        PID   Type   Process name                  GPU M

    Any idea what's wrong?

    19 replies
    Jacopo Tagliabue
    Hi MF community, small request for feedback! We just posted a brief article with code on re-imagining model cards in a DAG-first world. Looking for honest feedback: if you like "DAG cards", we may invest some time in building a configurable package and release it (ping me anytime).
    4 replies
    Taleb Zeghmi
    Ayotomiwa Salau
    Hello guys,
    Battling with this connection error. I restarted the port 443, 8080. No avail
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='42xg9kw0rk.execute-api.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /api/flows/MovieStatsFlow (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fec6ace2358>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    4 replies
    Anirudh Kaushik
    Metaflow seems to ignore @retry decorators on a @catch'd step if it's followed by a @catch'd step. If I've got start -> @retry @catch A -> @catch B -> end, and step A raises an exception, A won't retry at all. The flow goes to step B. Is this normal?
    5 replies
    Vishal Siramshetty

    Hi all,

    I'm having an issue with IncludeFile. When I'm trying to pass my training data file as an input, it throws an error:

    AttributeError: 'bytes' object has no attribute 'path'

    I am trying to read the file in one of the steps using Pandas. I'd really appreciate any suggestions to deal with this issue.

    Thank you,

    8 replies
    Richard Puckett
    Is there a best-practice way to deploy Metaflow into an environment that has no inbound access from the Internet? Everything would be run from within the VPC. Thanks!
    3 replies
    Ryan Chui

    In the step function role in the metaflow cloudformation template:

            - PolicyName: AllowCloudwatch
                Version: '2012-10-17'
                  - Sid: CloudwatchLogDelivery
                    Effect: Allow
                      - "logs:CreateLogDelivery"
                      - "logs:GetLogDelivery"
                      - "logs:UpdateLogDelivery"
                      - "logs:DeleteLogDelivery"
                      - "logs:ListLogDeliveries"
                      - "logs:PutResourcePolicy"
                      - "logs:DescribeResourcePolicies"
                      - "logs:DescribeLogGroups"
                    Resource: '*'

    What is the action logs:PutResourcePolicy used for?

    8 replies
    Corrie Bartelheimer

    Hey, I want to dynamically change the required resources for a step and found this example Netflix/metaflow#431 for a workaround:

    @resources(cpu=8, memory=os.environ['MEMORY'])

    and then starting the flow with MEMORY=16000 python myflow.py run. This works fine locally but fails when running with batch. Am I missing something?
    Or is there any other way to change the resources using parameters or similar without creating different sized steps?

    9 replies
    Greg Hilston

    Hey @savingoyal and other Metaflow developers, myself and some colleagues are getting to the point where we'll have a PR ready for the metaflow-tools repo. This PR will add a deploy-able Terraform stack.

    We've read through the CONTRIBUTING.md file and found this older issue that documents asking for a Terraform stack:


    Our goal is to have this PR submitted by the end of this week and just wanted to start the dialogue with you guys. Super excited to see what happens :)

    7 replies

    Hi Guys,

    Could you provide some sort of diagram of AWS resources, required to run metaflow on cloud? Cloudformation template is not much helpful. The yaml file is huge

    4 replies
    Antoine Tremblay
    Hi, I just realized that Metaflow doesn't show output that comes from the standard logging modules... like calls to logging.info("something") are not printed out.... is there a way to make those print ?
    2 replies
    David Patschke
    Is there a way to launch a Flow run via command-line with --namespace that sets the run to the global namespace?
    I tried the suggestion in the CLI help recommending the empty string (--namespace=) but when I run get_namespace() within the Flow (or current.namespace), I'm still getting the user namespace. I also tried setting it to --namespace=None but that uses the string 'None' vs. NoneType. As per the Metaflow docs, I'm hesitant to hardcode namespace(None) into my code as a workaround.
    5 replies
    Ayotomiwa Salau
    Battling with this server connection error. I restarted the port 443, 8080. No avail
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='42xg9kw0rk.execute-api.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /api/flows/MovieStatsFlow (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fec6ace2358>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    I pip installed metaflow on local. Please assistance.
    17 replies
    David Patschke
    Would it be possible to upgrade the version of pylint that Metaflow requires for the next release?
    It is currently at < 2.5.0.
    I've started to get a nasty pylinting error that is associated with pandas. This issue appears to be resolved in pylint 2.7. Here is a link to the issue I'm experiencing: PyCQA/pylint#3836
    4 replies
    Richard Puckett
    Sorry if I missed any previous questions on this, but I'm curious if I'm doing something wrong here. Seems execution stops after instantiating TestFlow (see gist). Is this expected behavior? Thanks. https://gist.github.com/rapuckett/b5355828695d1f7711400ddd837c5ede
    13 replies
    Sam Petulla
    Does anyone know the expected date for supporting Sagemaker Models?
    5 replies
    Anyone have any good dashboard notebooks for inspecting runs in progress?
    I've just been using the one from the metaflow tutorial, but it's pretty barebones and I'm wondering if anyone came up with something nicer
    6 replies
    Hi guys.
    I created a step function for my flow and specified CPU and RAM for steps using resources decorator, but after the run I noticed that it ignores values (to be more correct, it runs on default container) that are lower than metaflow defaults which are cpu : 1, memory : 4000. The situation is different when using batch decorator, in that case it allows smaller containers. Anyone facing this issue or having an idea why we can't specify smaller resources while running flow as a step function ?
    7 replies

    Hey All, Quick question,

    Does anyone know if there are any PyCharm settings or PyCharm plugins for having your IDE to check if all the right dependencies have been loaded at the _"@step" level?

    1 reply