Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Ritesh Agrawal
    why code has all the metaflow examples in it
    4 replies
    Kyle Smith
    If a step both creates a bunch of tasks with a foreach and branches to another step, will all the tasks created by this step execute in parallel?
    9 replies

    Hi, I am working on getting the metaflow artifacts from S3. The code is deployed on AWS lambda I set the environment variable “METAFLOW_DATASTORE_SYSROOT_S3” to the s3 location. Our use case requires us to change the datastore environ variable in every iteration so that different flows and runs’ artifacts can be accessed as follows:

    def _queryMetaflow(self, appName, starflowResp):
    metaflow_run_id = starflowResp["details"]["frdm"]["metaflowRunNumber"]
    metaflow_name = starflowResp["details"]["frdm"]["metaflowId"]

        os.environ['METAFLOW_DATASTORE_SYSROOT_S3'] = "{}/artifacts/{}/higher".format(getMetadataLocation(), appName)
        from metaflow import Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
        metadata1 = metadata(getMetadataURL())
        mf = Metaflow()
        # call metaflow and get results and send success or error
            metaflowResp = Run(metaflow_name + '/' + metaflow_run_id).data
            del Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
            return metaflowResp
        except Exception as e:
            print("Exception occured in query metaflow: {}".format(e))
            raise CapAppFailure("Exception occured in metaflow response, S3 datastore operation _get_s3_object failed likely")

    When this method is called the first time, it doesn’t fail in the first iteration but fails in the second iteration. I inspected the environ variable and the location is the correct in every iteration but this error is encountered in the second iteration:
    S3 datastore operation _get_s3_object failed (An error occurred (404) when calling the HeadObject operation: Not Found). Retrying 7 more times..

    I am unable to fix this issue. Can you please help?

    4 replies
    Kyle Smith

    Hello Netflix employees, can someone please share about Metaflow's adoption at Netflix? In late 2018 it was used in 134 projects, how has it grown since then? What percentage of Netflix data scientists use metaflow?

    We're considering Metaflow at my organization, so I'd just like to get a sense of the adoption rate we can hope for at my employer.

    7 replies
    Matt McClean
    Hi there. Am new to Metaflow and trying to run the tutorial Episode 8 Autopilot but getting the following error message in the AWS Batch job ModuleNotFoundError: No module named 'pandas' when the step function is triggered. I tried running with the commands python 02-statistics/stats.py --environment=conda step-functions create --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}" as well as python 02-statistics/stats.py step-functions create --max-workers 4 and both give the same error message.
    However if I run the command python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}" it works fine.
    3 replies
    Matt McClean
    How can I switch my local machine to run Metaflow on AWS? I have already run the CloudFormation template to setup the Stack and can run metaflow commands from the SageMaker notebook instance. However when I run on my local machine metaflow configure aws --profile dev and then metaflow configure show it still says Configuration is set to run locally.
    4 replies
    Kelly Davis

    I am attempting to run step-functions create and getting the following error: AWS Step Functions error: ClientError("An error occurred (AccessDeniedException) when calling the CreateStateMachine operation: 'arn:aws:iam::REDACTED:role/metaflow-step_functions_role' is not authorized to create managed-rule.")

    I am specifying METAFLOW_SFN_IAM_ROLE=arn:aws:iam::REDACTED:role/metaflow-step_functions_role in my metaflow config.

    The role is being created via terraform, but is based on https://github.com/Netflix/metaflow-tools/blob/master/aws/cloudformation/metaflow-cfn-template.yml#L839. That role does not have a grant for states:CreateStateMachine but even if I add that, I still get the same error.

    Any tips for troubleshooting this?

    2 replies
    Corrie Bartelheimer
    I created a step function flow using python flow.py --with retry step-functions create --max-workers 1000 but when triggering the flow it only runs maximum 40 tasks in parallel. When running the flow without step functions on batch it worked fine. Any ideas what could be the reason for this throttling?
    5 replies
    Taleb Zeghmi
    Has anybody thought of how Metaflow datastore and CCPA data compliance? For example, the ability to remove customer data at the customer’s behest unless the data expires or does not exist after 28 days?
    15 replies
    What is the correct way to set a specific namespace before doing a local run? We have several people running metaflow locally on Sagemaker instances, which defaults to the username 'ec2-user'. Using namespace('user:[correc username]') does not change the namespace used for the actual local run, seems to just affects namespace used for inspecting results. Thanks
    2 replies
    Ahmad Houri
    ur env there then batch just downloads & run the ima
    Hey All! I was recently introduced to Metaflow and I have few questions. If anyone can help me through? Does metaflow provides data labelling?, explainability feature?, team collaborations and if it is open source?
    3 replies


    I am passing my image to python yo.py step-functions create --with batch:image=bla. Are there any ways to pass runtime variables to that image? thanks in advance!

    6 replies
    Greg Hilston

    I'm experiencing some problems when trying to install pytorch with CUDA enabled.

    I'm running my flow on AWS Batch, powered by a p3.2xlarge machine and using the image


    to get the NVIDIA driver installed.

    The relevant flow code looks like:

    class FooFlow(FlowSpec):
        @batch(image=URL ABOVE)
        # this line below is of most interest
        @conda(libraries={"pytorch": "1.6.0", "cudatoolkit": "11.0.221"})
        @resources(memory=4*1024, cpu=2, gpu=1)
        def test_gpu(self):
            import os
           print(os.popen("nvcc --version).read())
           import torch

    I'm not convinced this is precisely a Metaflow issue, but the common solutions one finds when Googling involves installing Pytorch using the conda CLI, which obviously the @conda decorartor extrapolates away from us.

    I've been running many flows, of different versions of pytorch and cudatoolkit.

    Torch not compiled with CUDA enabled

    I'm familiar with the Github Issue: Netflix/metaflow#250

    Any advise at all?

    19 replies
    Taleb Zeghmi

    We’re working on creating a @notify() decorator that could send a notification upon success or failure, per Flow or per Step. It could send email or slack messages.

    It would be up to the scheduler (local, AWS Step Functions, KFP) to honor the @notify decorator.

    @notify(email_address=“oncall@foo.com", on="failure")
    @notify(email_address=“ai@foo.com", on="success")
    class MyFlow(Flow):
       @notify(slack_channel=“#foo", on="success")
       def my_step(self):

    To implement this I’d like to introduce a new Metaflow concept, a @finally step.

    class MyFlow(Flow):
       def finally_step(self, status):
          status  # we need a way to message Success or Failure
    7 replies
    Hi, I have recently been working with Metaflow, and am not able to access the previous flows of other members of group using namespace. Just want to make sure I am not missing anything regarding namespace, any help is appreciated. Thanks
    1 reply


    How can I pass @Parametersdifferent than default to step-functions create?
    I know step-functions trigger can take any @Parameters in a pipeline python file but this is valid only for this run.
    What I wanna do is to pass @Parameters to cron schedule in AWS EventBridge dynamically.

    9 replies
    Just wanted to say - fantastic work on this. Can't wait for the addition of some of the new features, particularly the graph composition and inclusion of external modules (symlink).
    15 replies
    Ayotomiwa Salau
    Hello guys, I am pretty new to the Metaflow community. How do I start contributing?
    8 replies
    Daniel Perez

    Hey guys, been using metaflow for a bit over a year now, and I've recently started to ingrate our deployment with AWS Batch for the scale-out pattern. I'm now able to execute flows with some steps that run in Batch, however I don't see the ECS cluster ever scaling back down

    To ellaborate, my compute environment has the following settings, min vcpus = 0, desired vcpus = 0, max vpcus = 32

    When I run a flow, a job definition gets added into the job queue, an instance gets started in the cluster, the task runs and finishes fine, but the job definition stays as "Active" the instance seems to stay up indefinitely inside the cluster until I go and manually "deregister" the job definition

    Is this the way it's designed? or am I missing something in the way I configured my Compute environment?

    Is metaflow supposed to update the job definition after a flow finishes?

    5 replies

    hey guys, would anyone find it useful to expose the batch param for ulimits? It's a list of dict that maps to the docker --ulimit option of docker run. In particular, I've noticed that the ECS/batch default ulimit for the number of open files per container is 1024/4096. With this option, it could be potentially increased up to the daemon limit using:

    ulimits=[{"name": "nofile", "softLimit": 65535, "hardLimit": 1048576}]


    2 replies

    FWIW this can be set via a launch template for the batch compute environment/ECS cluster, so it's not a necessity and also is a bit ugly for a decorator which is why I ask :sweat_smile:. As an example of what this looks like in a launch template:

    Content-Type: multipart/mixed; boundary="==BOUNDARY=="
    MIME-Version: 1.0
    Content-Type: text/cloud-boothook; charset="us-ascii"
    cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --default-ulimit nofile=65535:1048576"' >> /etc/sysconfig/docker


    4 replies
    Ayotomiwa Salau
    Hello guys, can I get links to resources on working with model & data versioning on Metaflow. Cant seem to find it in the documentation.
    2 replies
    Corrie Bartelheimer
    Hey folks, a question regarding retries. We frequently see some jobs fail first and then succeed in a second try so we definitely need the retry functionality. However, we are a bit weary that this also means any code error in this step would lead to retries. Since the step is in parallel, this could mean a few thousand jobs retried which would occur unnecessary costs. What would be the best way to handle such a situation?
    2 replies
    Matt McClean
    Hi all. I created an issue around supporting AWS Batch multi-node here: Netflix/metaflow#444 . Would be useful for anyone wanting to deploy distributed training jobs on more than one instance
    2 replies

    I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: None
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    After I read in the config file

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.

    When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):

    get_namespace(): None
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
    get_metadata() local@/home/ec2-user/workspace/models-inference_staging
    list(Metaflow()) []

    So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:


    but I get the error

    Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.

    Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks

    10 replies
    Snehal Shirgure
    Hello folks, has anyone looked into inspecting data and metrics from runs using a visualization tool such as ipywidgets in jupyter notebooks? Any suggestions/ideas on the same lines are welcome :)
    3 replies
    Malay Shah
    Hey guys, I am looking into the implementation of metaflow and how metaflow interacts with batch and other services of aws. I wanted to look at the code for that but could not find the code related to the same. Can anyone point me to the script or class that handles all the interaction with aws? Thank you very much.
    2 replies
    Are there any plans to allow batch steps to reuse the same container? At the moment it feels we get a massive slow down when moving from local runs to batch runs, because each step suddenly requires the whole batch scheduling, draw down of containers etc. I am keen to have the flexibilty of demanding extra resources etc for a particular step, but oftentimes consecutive steps don't need additional resources. [ obviously one can collapse these steps together, but then one loses the whole retry functionality ]
    19 replies
    Ayotomiwa Salau
    Hello, I was trying out metaflow in a notebook, I got this error "Flow('PlayListFlow') does not exist". I can't find a way to instantiate/create a flow in a notebook.
    2 replies
    Kyle Smith
    Hello, I'm doing some initial work for a manual deployment. It's important that we only use a private subnet. Is this feasible? Why does the default installation include a public subnet?
    3 replies
    Elham Zamansani
    Hey guys, I have a problem in importing environment decorator. I guess there is a bug there. Because when I import it as follows: from metaflow import FlowSpec, step, environment , it gives an error that environment is not callable and it makes sense, because when import like this, metaflow wants to read from environment.py script. I did a small test. If I change the name of environment in line 27 from environmnet_decorator.py to anything else and then import that, it works. Could you please check it or correct me if I miss sth regarding the import?
    9 replies
    Analysing job run times. Hi we would be interested in monitoring AWS batch run times ideally within cloudwatch. https://docs.aws.amazon.com/batch/latest/userguide/batch_cwet.html provides a very useful stream of information. metaflow provides eg run_id (https://github.com/Netflix/metaflow/blob/04881c58c22e4e7e66a4faa7f676fcfca454c027/metaflow/plugins/aws/batch/batch.py#L127) which appears in this stream. My question is how we can cross reference this to eg run parameters. so that we can get aggregate statistics for eg a given parameter.
    10 replies
    Robert Sandmann
    Dear metaflow team, first of all I want to thank you for this great piece of technology! It's quite amazing how insanely easy you make it for our data scientists to define their workflows, especially when you look at the internals on how you realize that, great work!
    A use case that we currently try to implement is using metaflow to setup a workflow for federated learning. That means that we do the client training in a foreach for every client which works great.The problem now is that we need to repeat these federated training rounds which introduces cycles into the DAG.
    Our current approach is to monkeypatch metaflow internals to allow cycles in the DAG and dynamically add new steps for every round using a custom FlowDecorator.
    This approach seems rather hacky (and is not yet quite working).
    Branch specific concurrency (Netflix/metaflow#172) or graph composition (https://github.com/Netflix/metaflow/issues/144) might make our lives easier.
    But I was wondering if you had any ideas on how to make this possible in the current state of metaflow. I'm grateful for any hints!
    2 replies
    Patrick John Chia

    Hello! @christineyu-coveo and I have been using metaflow recently and really enjoy it. We also face another issue related to using @batch and @environemnt.

    Consider the following

    def step_A(self):
    def step_B(self):

    Metaflow initializes decorators for all steps before running any step. For @environment this includes running step_init, where it updates the environment variables
    based on the vars passed in the decorator. Following the above flow, when we are running step_A, the environment decorator for step_B will also be initialzied, and an exception will occur because var_2 is None in the batch enviornment for step_A, since it was not included in the @environment decorator for step_A. Our current fix involves disabling enitrely step_init for @environment. While this works for our use case (i.e. >1 @batch steps, with use of @environment in either or both @batch steps), I suspect this might disable some of the other usecases of @environment. Do you have any alternate solutions to this problem? Prehaps batch decorator could be modified to also allow for inclusion of environemnt variables that we want to ship with the job.

    7 replies

    metaflow could not install or find cuda in GPU environment and pytorch could not use GPU at all, issue was marked as resolved on Netflix/metaflow#250 but I could not replicate it.

    sample code test_gpu.py I used

    from metaflow import FlowSpec, step, batch, IncludeFile, Parameter, conda, conda_base
    class TestGPUFlow(FlowSpec):
        @batch(cpu=2, gpu=1, memory=2400)
        @conda(libraries={'pytorch': '1.5.1', 'cudatoolkit': '10.1.243'})
        def start(self):
            import os
            import sys
            import torch
            from subprocess import call
            print(os.popen("nvcc --version").read())
            print('__Python VERSION:', sys.version)
            print('__pyTorch VERSION:', torch.__version__)
            print('__CUDA VERSION')
            print('__CUDNN VERSION:', torch.backends.cudnn.version())
            print('__Number CUDA Devices:', torch.cuda.device_count())
            call(["nvidia-smi", "--format=csv",
            print('Active CUDA Device: GPU', torch.cuda.current_device())
            print('Available devices ', torch.cuda.device_count())
            print('Current cuda device ', torch.cuda.current_device())
            print(f"GPU count: {torch.cuda.device_count()}")
        def end(self):
    if __name__ == "__main__":

    cmd line I used

    USERNAME=your_name CONDA_CHANNELS=default,conda-forge,pytorch METAFLOW_PROFILE=your_profile AWS_PROFILE=your_profile python test_gpu.py --datastore=s3 --environment=conda run --with batch:image=your_base_image_with_cuda_support

    metaflow output

    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: N/A      |
    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |               MIG M. |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |===============================+======================+======================|
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | N/A   43C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |                  N/A |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38]
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-----------------------------------------------------------------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Processes:                                                                  |
    2021-03-10 18:38:13.788 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |  GPU   GI   CI        PID   Type   Process name                  GPU M

    Any idea what's wrong?

    19 replies
    Jacopo Tagliabue
    Hi MF community, small request for feedback! We just posted a brief article with code on re-imagining model cards in a DAG-first world. Looking for honest feedback: if you like "DAG cards", we may invest some time in building a configurable package and release it (ping me anytime).
    4 replies
    Taleb Zeghmi
    Ayotomiwa Salau
    Hello guys,
    Battling with this connection error. I restarted the port 443, 8080. No avail
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='42xg9kw0rk.execute-api.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /api/flows/MovieStatsFlow (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fec6ace2358>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    4 replies
    Anirudh Kaushik
    Metaflow seems to ignore @retry decorators on a @catch'd step if it's followed by a @catch'd step. If I've got start -> @retry @catch A -> @catch B -> end, and step A raises an exception, A won't retry at all. The flow goes to step B. Is this normal?
    5 replies
    Vishal Siramshetty

    Hi all,

    I'm having an issue with IncludeFile. When I'm trying to pass my training data file as an input, it throws an error:

    AttributeError: 'bytes' object has no attribute 'path'

    I am trying to read the file in one of the steps using Pandas. I'd really appreciate any suggestions to deal with this issue.

    Thank you,

    8 replies
    Richard Puckett
    Is there a best-practice way to deploy Metaflow into an environment that has no inbound access from the Internet? Everything would be run from within the VPC. Thanks!
    3 replies
    Ryan Chui

    In the step function role in the metaflow cloudformation template:

            - PolicyName: AllowCloudwatch
                Version: '2012-10-17'
                  - Sid: CloudwatchLogDelivery
                    Effect: Allow
                      - "logs:CreateLogDelivery"
                      - "logs:GetLogDelivery"
                      - "logs:UpdateLogDelivery"
                      - "logs:DeleteLogDelivery"
                      - "logs:ListLogDeliveries"
                      - "logs:PutResourcePolicy"
                      - "logs:DescribeResourcePolicies"
                      - "logs:DescribeLogGroups"
                    Resource: '*'

    What is the action logs:PutResourcePolicy used for?

    8 replies
    Corrie Bartelheimer

    Hey, I want to dynamically change the required resources for a step and found this example Netflix/metaflow#431 for a workaround:

    @resources(cpu=8, memory=os.environ['MEMORY'])

    and then starting the flow with MEMORY=16000 python myflow.py run. This works fine locally but fails when running with batch. Am I missing something?
    Or is there any other way to change the resources using parameters or similar without creating different sized steps?

    9 replies
    Greg Hilston

    Hey @savingoyal and other Metaflow developers, myself and some colleagues are getting to the point where we'll have a PR ready for the metaflow-tools repo. This PR will add a deploy-able Terraform stack.

    We've read through the CONTRIBUTING.md file and found this older issue that documents asking for a Terraform stack:


    Our goal is to have this PR submitted by the end of this week and just wanted to start the dialogue with you guys. Super excited to see what happens :)

    7 replies

    Hi Guys,

    Could you provide some sort of diagram of AWS resources, required to run metaflow on cloud? Cloudformation template is not much helpful. The yaml file is huge

    4 replies
    Antoine Tremblay
    Hi, I just realized that Metaflow doesn't show output that comes from the standard logging modules... like calls to logging.info("something") are not printed out.... is there a way to make those print ?
    2 replies
    David Patschke
    Is there a way to launch a Flow run via command-line with --namespace that sets the run to the global namespace?
    I tried the suggestion in the CLI help recommending the empty string (--namespace=) but when I run get_namespace() within the Flow (or current.namespace), I'm still getting the user namespace. I also tried setting it to --namespace=None but that uses the string 'None' vs. NoneType. As per the Metaflow docs, I'm hesitant to hardcode namespace(None) into my code as a workaround.
    5 replies