Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    seanv507
    @seanv507
    Hi, is there anyway to specify the memory dynamically for batch jobs? eg if size of data =M in step X, allocate memory 5M in step X+1?
    4 replies
    Ahmad Houri
    @ahmad_hori_twitter
    Hi, is there a way to define step function name on aws to be different from the flow name when creating it?
    I want to do this because I am thinking to create 2 different step functions from the same flow: MY_FLOW_STG and MY_FLOW_PRD and then to update these step functions through a pipeline when user pushes to specific branch
    5 replies
    jrs2
    @jrs2
    Is there a way to specify a Docker image for a flow when running locally? I can see how to do it for Batch and have used that, but only see @conda for local dependency support.
    1 reply
    Mehmet Catalbas
    @baratrion
    What is the best way to profile memory usage step by step within a flow (better yet line-by-line within a step?) memory_profiler @profile decorator does not work well with foreach steps, so is the best approach to copy steps to isolated functions to different scripts and run memory profiling on them?
    12 replies
    seanv507
    @seanv507
    @baratrion you might want to look at https://pythonspeed.com/products/filmemoryprofiler/, the author @itamarst has reached out here ... its focussing on peak memory
    14 replies
    Ahmad Houri
    @ahmad_hori_twitter
    Hi, I have a question regarding the performance of running the batch jobs in AWS, I run a simple flow (HelloWorld) on my machine which contains 3 steps, the execution on my machine took around 23 seconds while on AWS (when I run the same flow --with batch it takes around 8 minutes) most of the time is consumed to bootstrap conda environment for each step before running it!!!
    is there something I missed here or is there any cache technique I should use to improve the flow performance on AWS?
    6 replies
    Greg Hilston
    @GregHilston

    I was in our AWS Batch console and I noticed two jobs that were seeming stuck in RUNNING. The individual who kicked off those jobs says all his terminal sessions have been ended, even to go as far to restart his PC/sever internet connection.

    I figure this is more of an AWS situation I'm debugging but has anyone witnessed flows being stuck in RUNNING?

    I know the jobs will die when the timeout is reached, just want to understand what may have caused this

    4 replies
    joe153
    @joe153
    Hi, I am trying to include a number of json/sql files in the conda package. I have MANIFEST.in file specifying the files and also setup.py has include_package_data=True but I am not seeing them. What am I missing? How do I include them in the conda package?
    6 replies
    Nimar Arora
    @nimar
    Hi, I am just trying to follow along with the tutorial and in 08-autopilot my AWS Batch job fails with "ModuleNoteFoundError: No module named 'pandas'" on stats.py, line 41. Looking at the code I'm not sure how this tutorial is expected to work since there is no @conda decorator to install the pandas library in 02-statistics/stats.py.
    3 replies
    Nimar Arora
    @nimar

    The tutorial 4 seems to be failing attempting to create a conda environment. The funny thing is that if I run that command directly it seems to succeed. Not sure how to get the conda errors:

    python 04-playlist-plus/playlist.py --environment=conda runMetaflow 2.2.6 executing PlayListFlow for user:...
    Validating your flow...
        The graph looks good!
    Running pylint...
        Pylint is happy!
    Bootstrapping conda environment...(this could take a few minutes)
        Conda ran into an error while setting up environment.:
        Step: start, Error: command '['/opt/miniconda/condabin/conda', 'create', '--yes', '--no-default-packages', '--name', 'metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361', '--quiet', b'python==3.8.5', b'click==7.1.2', b'requests==2.24.0', b'boto3==1.17.0', b'coverage==5.4', b'pandas==0.24.2']' returned error (-9): b''

    Note that the following command succeeds:

    /opt/miniconda/condabin/conda create --yes --no-default-packages --name metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361 --quiet python==3.8.5 click==7.1.2 requests==2.24.0 boto3==1.17.0 coverage==5.4 pandas==0.24.2

    Note: that I had to make a few minor changes to the demo to refer to Python 3.8.5 and to add a dependency to more recent versions of boto3 and coverage than what metaflow was requesting otherwise the generated conda create command would fail even on the command line.

    1 reply
    joe153
    @joe153
    Hi, I am having a problem using Metaflow 2.2.6 and Fargate as the compute environment when foreach is used. What works fine with EC2 doesn't work with Fargate. Here is the error message: ... File "/metaflow/metaflow/plugins/aws/step_functions/step_functions_decorator.py", line 54, in task_finished self._save_foreach_cardinality(os.environ['AWS_BATCH_JOB_ID'], ... requests.exceptions.ConnectionError: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/placement/availability-zone/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7c0f12e3d0>: Failed to establish a new connection: [Errno 22] Invalid argument'))
    7 replies
    Ritesh Agrawal
    @ragrawal
    hi, I am trying to leverage metaflow to train and deploy models on sagemaker. I am able to train the model but not able to find relevant documentation on how to deploy models. Ideally I would like to create a docker container with proper environment and all the supporting files and then deploy the docker container either on sagemaker or on our kubernetes cluster. What I am missing is once the pipeline is execute successfully, how can I get the environment, supporting files, model files
    13 replies
    Anirudh Kaushik
    @anirudh-k
    Hi! What's the best way to handle a potentially empty list for a foreach step?
    7 replies
    Ritesh Agrawal
    @ragrawal
    where to specify import statements. Assuming I have a train step that requires sklearn and I am using @conda to install the package. Should I define import sklearn statement inside the step or it can be outside the class definition
    6 replies
    Ritesh Agrawal
    @ragrawal
    I am getting access denied to following s3 folder: "s3://.../metaflow/conda" as it doesn't exists. Is there anything I need to do in order to create this S3 Key ?
    4 replies
    Ritesh Agrawal
    @ragrawal
    why code has all the metaflow examples in it
    4 replies
    Kyle Smith
    @smith-kyle
    If a step both creates a bunch of tasks with a foreach and branches to another step, will all the tasks created by this step execute in parallel?
    9 replies
    dewiris
    @kavyashankar93

    Hi, I am working on getting the metaflow artifacts from S3. The code is deployed on AWS lambda I set the environment variable “METAFLOW_DATASTORE_SYSROOT_S3” to the s3 location. Our use case requires us to change the datastore environ variable in every iteration so that different flows and runs’ artifacts can be accessed as follows:

    def _queryMetaflow(self, appName, starflowResp):
    metaflow_run_id = starflowResp["details"]["frdm"]["metaflowRunNumber"]
    metaflow_name = starflowResp["details"]["frdm"]["metaflowId"]

        os.environ['METAFLOW_DATASTORE_SYSROOT_S3'] = "{}/artifacts/{}/higher".format(getMetadataLocation(), appName)
    
        from metaflow import Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
    
        metadata1 = metadata(getMetadataURL())
        namespace(None)
        mf = Metaflow()
    
        # call metaflow and get results and send success or error
        try:
            metaflowResp = Run(metaflow_name + '/' + metaflow_run_id).data
            print(metaflowResp)
            del Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
            return metaflowResp
        except Exception as e:
            print("Exception occured in query metaflow: {}".format(e))
            raise CapAppFailure("Exception occured in metaflow response, S3 datastore operation _get_s3_object failed likely")

    When this method is called the first time, it doesn’t fail in the first iteration but fails in the second iteration. I inspected the environ variable and the location is the correct in every iteration but this error is encountered in the second iteration:
    S3 datastore operation _get_s3_object failed (An error occurred (404) when calling the HeadObject operation: Not Found). Retrying 7 more times..

    I am unable to fix this issue. Can you please help?

    4 replies
    Kyle Smith
    @smith-kyle

    Hello Netflix employees, can someone please share about Metaflow's adoption at Netflix? In late 2018 it was used in 134 projects, how has it grown since then? What percentage of Netflix data scientists use metaflow?

    We're considering Metaflow at my organization, so I'd just like to get a sense of the adoption rate we can hope for at my employer.

    7 replies
    Matt McClean
    @mattmcclean
    Hi there. Am new to Metaflow and trying to run the tutorial Episode 8 Autopilot but getting the following error message in the AWS Batch job ModuleNotFoundError: No module named 'pandas' when the step function is triggered. I tried running with the commands python 02-statistics/stats.py --environment=conda step-functions create --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}" as well as python 02-statistics/stats.py step-functions create --max-workers 4 and both give the same error message.
    However if I run the command python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}" it works fine.
    3 replies
    Matt McClean
    @mattmcclean
    How can I switch my local machine to run Metaflow on AWS? I have already run the CloudFormation template to setup the Stack and can run metaflow commands from the SageMaker notebook instance. However when I run on my local machine metaflow configure aws --profile dev and then metaflow configure show it still says Configuration is set to run locally.
    4 replies
    Kelly Davis
    @kldavis4

    I am attempting to run step-functions create and getting the following error: AWS Step Functions error: ClientError("An error occurred (AccessDeniedException) when calling the CreateStateMachine operation: 'arn:aws:iam::REDACTED:role/metaflow-step_functions_role' is not authorized to create managed-rule.")

    I am specifying METAFLOW_SFN_IAM_ROLE=arn:aws:iam::REDACTED:role/metaflow-step_functions_role in my metaflow config.

    The role is being created via terraform, but is based on https://github.com/Netflix/metaflow-tools/blob/master/aws/cloudformation/metaflow-cfn-template.yml#L839. That role does not have a grant for states:CreateStateMachine but even if I add that, I still get the same error.

    Any tips for troubleshooting this?

    2 replies
    Corrie Bartelheimer
    @corriebar
    Hey,
    I created a step function flow using python flow.py --with retry step-functions create --max-workers 1000 but when triggering the flow it only runs maximum 40 tasks in parallel. When running the flow without step functions on batch it worked fine. Any ideas what could be the reason for this throttling?
    5 replies
    Taleb Zeghmi
    @talebzeghmi
    Has anybody thought of how Metaflow datastore and CCPA data compliance? For example, the ability to remove customer data at the customer’s behest unless the data expires or does not exist after 28 days?
    13 replies
    mkjacks5
    @mkjacks5
    What is the correct way to set a specific namespace before doing a local run? We have several people running metaflow locally on Sagemaker instances, which defaults to the username 'ec2-user'. Using namespace('user:[correc username]') does not change the namespace used for the actual local run, seems to just affects namespace used for inspecting results. Thanks
    2 replies
    Ahmad Houri
    @ahmad_hori_twitter
    ur env there then batch just downloads & run the ima
    behjatHume
    @behjatHume
    Hey All! I was recently introduced to Metaflow and I have few questions. If anyone can help me through? Does metaflow provides data labelling?, explainability feature?, team collaborations and if it is open source?
    3 replies
    derek
    @yukiegosapporo_twitter

    Hey,

    I am passing my image to python yo.py step-functions create --with batch:image=bla. Are there any ways to pass runtime variables to that image? thanks in advance!

    6 replies
    Greg Hilston
    @GregHilston

    I'm experiencing some problems when trying to install pytorch with CUDA enabled.

    I'm running my flow on AWS Batch, powered by a p3.2xlarge machine and using the image

    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04

    to get the NVIDIA driver installed.

    The relevant flow code looks like:

    @conda_base(python="3.8")
    class FooFlow(FlowSpec):
        ...
        @batch(image=URL ABOVE)
        # this line below is of most interest
        @conda(libraries={"pytorch": "1.6.0", "cudatoolkit": "11.0.221"})
        @resources(memory=4*1024, cpu=2, gpu=1)
        @step
        def test_gpu(self):
            import os
    
           print(os.popen("nvidia-smi).read())
           print(os.popen("nvcc --version).read())
    
           import torch

    I'm not convinced this is precisely a Metaflow issue, but the common solutions one finds when Googling involves installing Pytorch using the conda CLI, which obviously the @conda decorartor extrapolates away from us.

    I've been running many flows, of different versions of pytorch and cudatoolkit.

    Torch not compiled with CUDA enabled

    I'm familiar with the Github Issue: Netflix/metaflow#250

    Any advise at all?

    19 replies
    Taleb Zeghmi
    @talebzeghmi

    We’re working on creating a @notify() decorator that could send a notification upon success or failure, per Flow or per Step. It could send email or slack messages.

    It would be up to the scheduler (local, AWS Step Functions, KFP) to honor the @notify decorator.

    @notify(email_address=“oncall@foo.com", on="failure")
    @notify(email_address=“ai@foo.com", on="success")
    class MyFlow(Flow):
       @notify(slack_channel=“#foo", on="success")
       @step
       def my_step(self):

    To implement this I’d like to introduce a new Metaflow concept, a @finally step.

    class MyFlow(Flow):
       @finally
       def finally_step(self, status):
          status  # we need a way to message Success or Failure
    7 replies
    Hamid
    @Hamid75224834_twitter
    Hi, I have recently been working with Metaflow, and am not able to access the previous flows of other members of group using namespace. Just want to make sure I am not missing anything regarding namespace, any help is appreciated. Thanks
    1 reply
    derek
    @yukiegosapporo_twitter

    Hi!

    How can I pass @Parametersdifferent than default to step-functions create?
    I know step-functions trigger can take any @Parameters in a pipeline python file but this is valid only for this run.
    What I wanna do is to pass @Parameters to cron schedule in AWS EventBridge dynamically.

    9 replies
    grizzledmysticism
    @grizzledmysticism
    Just wanted to say - fantastic work on this. Can't wait for the addition of some of the new features, particularly the graph composition and inclusion of external modules (symlink).
    15 replies
    Ayotomiwa Salau
    @AyonzOnTop
    Hello guys, I am pretty new to the Metaflow community. How do I start contributing?
    8 replies
    Daniel Perez
    @sandman21dan

    Hey guys, been using metaflow for a bit over a year now, and I've recently started to ingrate our deployment with AWS Batch for the scale-out pattern. I'm now able to execute flows with some steps that run in Batch, however I don't see the ECS cluster ever scaling back down

    To ellaborate, my compute environment has the following settings, min vcpus = 0, desired vcpus = 0, max vpcus = 32

    When I run a flow, a job definition gets added into the job queue, an instance gets started in the cluster, the task runs and finishes fine, but the job definition stays as "Active" the instance seems to stay up indefinitely inside the cluster until I go and manually "deregister" the job definition

    Is this the way it's designed? or am I missing something in the way I configured my Compute environment?

    Is metaflow supposed to update the job definition after a flow finishes?

    5 replies
    russellbrooks
    @russellbrooks

    hey guys, would anyone find it useful to expose the batch param for ulimits? It's a list of dict that maps to the docker --ulimit option of docker run. In particular, I've noticed that the ECS/batch default ulimit for the number of open files per container is 1024/4096. With this option, it could be potentially increased up to the daemon limit using:

    ulimits=[{"name": "nofile", "softLimit": 65535, "hardLimit": 1048576}]

    https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html
    https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit

    2 replies
    russellbrooks
    @russellbrooks

    FWIW this can be set via a launch template for the batch compute environment/ECS cluster, so it's not a necessity and also is a bit ugly for a decorator which is why I ask :sweat_smile:. As an example of what this looks like in a launch template:

    Content-Type: multipart/mixed; boundary="==BOUNDARY=="
    MIME-Version: 1.0
    
    --==BOUNDARY==
    Content-Type: text/cloud-boothook; charset="us-ascii"
    #cloud-boothook
    #!/bin/bash
    cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --default-ulimit nofile=65535:1048576"' >> /etc/sysconfig/docker

    https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html

    4 replies
    Ayotomiwa Salau
    @AyonzOnTop
    Hello guys, can I get links to resources on working with model & data versioning on Metaflow. Cant seem to find it in the documentation.
    2 replies
    Corrie Bartelheimer
    @corriebar
    Hey folks, a question regarding retries. We frequently see some jobs fail first and then succeed in a second try so we definitely need the retry functionality. However, we are a bit weary that this also means any code error in this step would lead to retries. Since the step is in parallel, this could mean a few thousand jobs retried which would occur unnecessary costs. What would be the best way to handle such a situation?
    2 replies
    Matt McClean
    @mattmcclean
    Hi all. I created an issue around supporting AWS Batch multi-node here: Netflix/metaflow#444 . Would be useful for anyone wanting to deploy distributed training jobs on more than one instance
    2 replies
    mkjacks5
    @mkjacks5

    I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: None
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    After I read in the config file

    get_namespace(): None
    list(Metaflow()): [Flow('training_flow_1')]
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
    get_metadata(): local@/home/ec2-user/workspace/models-training_staging

    So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.

    When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):

    get_namespace(): None
    metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
    get_metadata() local@/home/ec2-user/workspace/models-inference_staging
    list(Metaflow()) []

    So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:

    metadata('https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api')

    but I get the error

    Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.

    Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks

    10 replies
    Snehal Shirgure
    @snehalshirgure
    Hello folks, has anyone looked into inspecting data and metrics from runs using a visualization tool such as ipywidgets in jupyter notebooks? Any suggestions/ideas on the same lines are welcome :)
    3 replies
    Malay Shah
    @malay95
    Hey guys, I am looking into the implementation of metaflow and how metaflow interacts with batch and other services of aws. I wanted to look at the code for that but could not find the code related to the same. Can anyone point me to the script or class that handles all the interaction with aws? Thank you very much.
    2 replies
    seanv507
    @seanv507
    Are there any plans to allow batch steps to reuse the same container? At the moment it feels we get a massive slow down when moving from local runs to batch runs, because each step suddenly requires the whole batch scheduling, draw down of containers etc. I am keen to have the flexibilty of demanding extra resources etc for a particular step, but oftentimes consecutive steps don't need additional resources. [ obviously one can collapse these steps together, but then one loses the whole retry functionality ]
    19 replies
    Ayotomiwa Salau
    @AyonzOnTop
    Hello, I was trying out metaflow in a notebook, I got this error "Flow('PlayListFlow') does not exist". I can't find a way to instantiate/create a flow in a notebook.
    2 replies
    Kyle Smith
    @smith-kyle
    Hello, I'm doing some initial work for a manual deployment. It's important that we only use a private subnet. Is this feasible? Why does the default installation include a public subnet?
    3 replies
    Elham Zamansani
    @elham-zs
    Hey guys, I have a problem in importing environment decorator. I guess there is a bug there. Because when I import it as follows: from metaflow import FlowSpec, step, environment , it gives an error that environment is not callable and it makes sense, because when import like this, metaflow wants to read from environment.py script. I did a small test. If I change the name of environment in line 27 from environmnet_decorator.py to anything else and then import that, it works. Could you please check it or correct me if I miss sth regarding the import?
    9 replies
    seanv507
    @seanv507
    Analysing job run times. Hi we would be interested in monitoring AWS batch run times ideally within cloudwatch. https://docs.aws.amazon.com/batch/latest/userguide/batch_cwet.html provides a very useful stream of information. metaflow provides eg run_id (https://github.com/Netflix/metaflow/blob/04881c58c22e4e7e66a4faa7f676fcfca454c027/metaflow/plugins/aws/batch/batch.py#L127) which appears in this stream. My question is how we can cross reference this to eg run parameters. so that we can get aggregate statistics for eg a given parameter.
    10 replies
    Robert Sandmann
    @rsandmann
    Dear metaflow team, first of all I want to thank you for this great piece of technology! It's quite amazing how insanely easy you make it for our data scientists to define their workflows, especially when you look at the internals on how you realize that, great work!
    A use case that we currently try to implement is using metaflow to setup a workflow for federated learning. That means that we do the client training in a foreach for every client which works great.The problem now is that we need to repeat these federated training rounds which introduces cycles into the DAG.
    Our current approach is to monkeypatch metaflow internals to allow cycles in the DAG and dynamically add new steps for every round using a custom FlowDecorator.
    This approach seems rather hacky (and is not yet quite working).
    Branch specific concurrency (Netflix/metaflow#172) or graph composition (https://github.com/Netflix/metaflow/issues/144) might make our lives easier.
    But I was wondering if you had any ideas on how to make this possible in the current state of metaflow. I'm grateful for any hints!
    2 replies
    Patrick John Chia
    @patrickjohncyh

    Hello! @christineyu-coveo and I have been using metaflow recently and really enjoy it. We also face another issue related to using @batch and @environemnt.

    Consider the following

    @batch
    @environment(vars={‘var_1’:os.getenv(‘var_1’)})
    def step_A(self):
         ….
        self.next(self.step_B)
    
    @batch
    @environment(vars={‘var_2’:os.getenv(‘var_2’)}) 
    def step_B(self):

    Metaflow initializes decorators for all steps before running any step. For @environment this includes running step_init, where it updates the environment variables
    based on the vars passed in the decorator. Following the above flow, when we are running step_A, the environment decorator for step_B will also be initialzied, and an exception will occur because var_2 is None in the batch enviornment for step_A, since it was not included in the @environment decorator for step_A. Our current fix involves disabling enitrely step_init for @environment. While this works for our use case (i.e. >1 @batch steps, with use of @environment in either or both @batch steps), I suspect this might disable some of the other usecases of @environment. Do you have any alternate solutions to this problem? Prehaps batch decorator could be modified to also allow for inclusion of environemnt variables that we want to ship with the job.

    7 replies