Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    tianyi-qi-zefr
    @tianyi-qi-zefr

    metaflow could not install or find cuda in GPU environment and pytorch could not use GPU at all, issue was marked as resolved on Netflix/metaflow#250 but I could not replicate it.

    sample code test_gpu.py I used

    from metaflow import FlowSpec, step, batch, IncludeFile, Parameter, conda, conda_base
    
    class TestGPUFlow(FlowSpec):
    
        @batch(cpu=2, gpu=1, memory=2400)
        @conda(libraries={'pytorch': '1.5.1', 'cudatoolkit': '10.1.243'})
        @step
        def start(self):
            import os
            import sys
            import torch
            from subprocess import call
            print(os.popen("nvidia-smi").read())
            print(os.popen("nvcc --version").read())
    
            print('__Python VERSION:', sys.version)
            print('__pyTorch VERSION:', torch.__version__)
            print('__CUDA VERSION')
            print('__CUDNN VERSION:', torch.backends.cudnn.version())
            print('__Number CUDA Devices:', torch.cuda.device_count())
            print('__Devices')
            call(["nvidia-smi", "--format=csv",
                  "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
            print('Active CUDA Device: GPU', torch.cuda.current_device())
            print('Available devices ', torch.cuda.device_count())
            print('Current cuda device ', torch.cuda.current_device())
            print(f"GPU count: {torch.cuda.device_count()}")
    
            self.next(self.end)
    
        @step
        def end(self):
            pass
    
    if __name__ == "__main__":
        TestGPUFlow()

    cmd line I used

    USERNAME=your_name CONDA_CHANNELS=default,conda-forge,pytorch METAFLOW_PROFILE=your_profile AWS_PROFILE=your_profile python test_gpu.py --datastore=s3 --environment=conda run --with batch:image=your_base_image_with_cuda_support

    metaflow output

    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: N/A      |
    2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |               MIG M. |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |===============================+======================+======================|
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
    2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | N/A   43C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |                               |                      |                  N/A |
    2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-------------------------------+----------------------+----------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38]
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-----------------------------------------------------------------------------+
    2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Processes:                                                                  |
    2021-03-10 18:38:13.788 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |  GPU   GI   CI        PID   Type   Process name                  GPU M

    Any idea what's wrong?

    19 replies
    Jacopo Tagliabue
    @jacopotagliabue_twitter
    card.png
    Hi MF community, small request for feedback! We just posted a brief article with code on re-imagining model cards in a DAG-first world. Looking for honest feedback: if you like "DAG cards", we may invest some time in building a configurable package and release it (ping me anytime).
    4 replies
    Taleb Zeghmi
    @talebzeghmi
    image.png
    Ayotomiwa Salau
    @AyonzOnTop
    Hello guys,
    Battling with this connection error. I restarted the port 443, 8080. No avail
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='42xg9kw0rk.execute-api.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /api/flows/MovieStatsFlow (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fec6ace2358>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    4 replies
    Anirudh Kaushik
    @anirudh-k
    Metaflow seems to ignore @retry decorators on a @catch'd step if it's followed by a @catch'd step. If I've got start -> @retry @catch A -> @catch B -> end, and step A raises an exception, A won't retry at all. The flow goes to step B. Is this normal?
    5 replies
    Vishal Siramshetty
    @iwwwish

    Hi all,

    I'm having an issue with IncludeFile. When I'm trying to pass my training data file as an input, it throws an error:

    AttributeError: 'bytes' object has no attribute 'path'

    I am trying to read the file in one of the steps using Pandas. I'd really appreciate any suggestions to deal with this issue.

    Thank you,
    Vishal

    8 replies
    Richard Puckett
    @rapuckett
    Is there a best-practice way to deploy Metaflow into an environment that has no inbound access from the Internet? Everything would be run from within the VPC. Thanks!
    3 replies
    Ryan Chui
    @rchui

    In the step function role in the metaflow cloudformation template:

            - PolicyName: AllowCloudwatch
              PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Sid: CloudwatchLogDelivery
                    Effect: Allow
                    Action:
                      - "logs:CreateLogDelivery"
                      - "logs:GetLogDelivery"
                      - "logs:UpdateLogDelivery"
                      - "logs:DeleteLogDelivery"
                      - "logs:ListLogDeliveries"
                      - "logs:PutResourcePolicy"
                      - "logs:DescribeResourcePolicies"
                      - "logs:DescribeLogGroups"
                    Resource: '*'

    What is the action logs:PutResourcePolicy used for?

    8 replies
    Corrie Bartelheimer
    @corriebar

    Hey, I want to dynamically change the required resources for a step and found this example Netflix/metaflow#431 for a workaround:

    @resources(cpu=8, memory=os.environ['MEMORY'])

    and then starting the flow with MEMORY=16000 python myflow.py run. This works fine locally but fails when running with batch. Am I missing something?
    Or is there any other way to change the resources using parameters or similar without creating different sized steps?

    9 replies
    Greg Hilston
    @GregHilston

    Hey @savingoyal and other Metaflow developers, myself and some colleagues are getting to the point where we'll have a PR ready for the metaflow-tools repo. This PR will add a deploy-able Terraform stack.

    We've read through the CONTRIBUTING.md file and found this older issue that documents asking for a Terraform stack:

    Netflix/metaflow#38

    Our goal is to have this PR submitted by the end of this week and just wanted to start the dialogue with you guys. Super excited to see what happens :)

    7 replies
    Samvel
    @spacejaM1892_twitter

    Hi Guys,

    Could you provide some sort of diagram of AWS resources, required to run metaflow on cloud? Cloudformation template is not much helpful. The yaml file is huge

    4 replies
    Antoine Tremblay
    @hexa00
    Hi, I just realized that Metaflow doesn't show output that comes from the standard logging modules... like calls to logging.info("something") are not printed out.... is there a way to make those print ?
    2 replies
    David Patschke
    @dpatschke
    Is there a way to launch a Flow run via command-line with --namespace that sets the run to the global namespace?
    I tried the suggestion in the CLI help recommending the empty string (--namespace=) but when I run get_namespace() within the Flow (or current.namespace), I'm still getting the user namespace. I also tried setting it to --namespace=None but that uses the string 'None' vs. NoneType. As per the Metaflow docs, I'm hesitant to hardcode namespace(None) into my code as a workaround.
    5 replies
    Ayotomiwa Salau
    @AyonzOnTop
    Battling with this server connection error. I restarted the port 443, 8080. No avail
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='42xg9kw0rk.execute-api.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: /api/flows/MovieStatsFlow (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fec6ace2358>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    I pip installed metaflow on local. Please assistance.
    17 replies
    David Patschke
    @dpatschke
    Would it be possible to upgrade the version of pylint that Metaflow requires for the next release?
    It is currently at < 2.5.0.
    I've started to get a nasty pylinting error that is associated with pandas. This issue appears to be resolved in pylint 2.7. Here is a link to the issue I'm experiencing: PyCQA/pylint#3836
    4 replies
    Richard Puckett
    @rapuckett
    Sorry if I missed any previous questions on this, but I'm curious if I'm doing something wrong here. Seems execution stops after instantiating TestFlow (see gist). Is this expected behavior? Thanks. https://gist.github.com/rapuckett/b5355828695d1f7711400ddd837c5ede
    13 replies
    Sam Petulla
    @petulla1_gitlab
    Does anyone know the expected date for supporting Sagemaker Models?
    5 replies
    pranaygp
    @pranaygp:beeperhq.com
    [m]
    Hey!
    Anyone have any good dashboard notebooks for inspecting runs in progress?
    I've just been using the one from the metaflow tutorial, but it's pretty barebones and I'm wondering if anyone came up with something nicer
    6 replies
    serj90
    @serj90
    Hi guys.
    I created a step function for my flow and specified CPU and RAM for steps using resources decorator, but after the run I noticed that it ignores values (to be more correct, it runs on default container) that are lower than metaflow defaults which are cpu : 1, memory : 4000. The situation is different when using batch decorator, in that case it allows smaller containers. Anyone facing this issue or having an idea why we can't specify smaller resources while running flow as a step function ?
    7 replies
    jamesbsilva
    @jamesbsilva

    Hey All, Quick question,

    Does anyone know if there are any PyCharm settings or PyCharm plugins for having your IDE to check if all the right dependencies have been loaded at the _"@step" level?

    1 reply
    mkjacks5
    @mkjacks5

    I am trying to run metaflow inside a docker container, but I have to run it as non-root user. When I try to import with

    "metaflow configure import metaflow_config/config.txt"

    I get "PermissionError: [Errno 13] Permission denied: '/.metaflowconfig'"

    I have tried changing the permissions with chown and chmod, currently set to

    drwxrwxrwx 2 1000 1000 6 Apr 5 19:01 .metaflowconfig

    But no luck. Can I run metaflow inside a docker container without being a root user?

    6 replies
    pranaygp
    @pranaygp:beeperhq.com
    [m]
    Can you inherit and extend a flow?
    I'l like to reuse the entire flow, and just override one of the steps
    4 replies
    when trying to do this, metaflow complains about steps being missing whenever I try to do "self.next" to a step that exists in the parent flow class
    grizzledmysticism
    @grizzledmysticism
    Has anyone been able to debug in Pycharm using the conda decorators, and not gotten a "No conda installation found. Install conda first" message? I followed the instructions in the documentation (using env var PATH with path to conda executable) but it still doesn't recognize it
    20 replies
    Corrie Bartelheimer
    @corriebar
    Hey everyone, one question regarding metaflow on step functions. Is it possible to somewhere see when a flow was last updated, i.e. redeployed?
    2 replies
    Sam Petulla
    @petulla1_gitlab
    Is it possible to remove any services from the cloudformation template? The services list is pretty extensive, wondering if any can be removed for cost-savings
    5 replies
    Sam Petulla
    @petulla1_gitlab

    Separately, anyone had this issue?

    I can run aws c3 cp (my file) (remote bucket) and it pulls from the $AWS_PROFILE variable correctly (with an alias set on aws command to do so). However, running METAFLOW_PROFILE=personal python 05-helloaws/helloaws.py --datastore=s3 run I'm getting a token expired error which, I think is because it is using the wrong profile. Any tips on how to debug without just switching profile names. I will need to use a named profile.

    7 replies
    Oleg Avdeev
    @oavdeev

    I've created an Netflix/metaflow#473 to stop supporting Python 2.x from the next Metaflow release. Mostly because it will allow type annotations from Python 3, and make codebase more contributor friendly.

    It seems like a pretty conservative move given that Python 2.7 has been EOL'ed more than a year ago. But I'm curious if anyone here is still using Metaflow with Python 2.7 and would be affected by this change?

    mkjacks5
    @mkjacks5

    I am getting internal server errors when I add 'METAFLOW_DEFAULT_METADATA': 'service' to my config file. My config file contains METAFLOW_SERVICE_URL, METAFLOW_SERVICE_INTERNAL_URL and METAFLOW_SERVICE_AUTH_KEY and I have verified they match what is in the cloudformation stack output.

    when I try to run a script locally with

    python inference-flow.py --environment=conda --datastore=s3 run

    I get the following

    Bootstrapping conda environment...(this could take a few minutes)
    
        Metaflow service error:
    
        Metadata request (/flows/inference-flow) failed (code 500): {"message": "Internal server error"}

    If I try to run step functions create I get the following:

    Running pylint...
    
        Pylint is happy!
    
    Deploying inference_flow to AWS Step Functions...
    
        Internal error
    
    Traceback (most recent call last):
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/cli.py", line 930, in main
    
        start(auto_envvar_prefix='METAFLOW', obj=state)
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    
        return self.main(args, kwargs)
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    
        rv = self.invoke(ctx)
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    
        return ctx.invoke(self.callback, ctx.params)
    
      File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    
        return callback(args, kwargs)
    
      File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 33, in new_func
    
        return f(get_current_context().obj, args, kwargs)
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/aws/step_functions/step_functions_cli.py", line 88, in create
    
        check_metadata_service_version(obj)
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/aws/step_functions/step_functions_cli.py", line 120, in check_metadata_service_version
    
        version = metadata.version()
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 41, in version
    
        return self._version(self._monitor)
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 288, in _version
    
        (path, resp.status_code, resp.text),
    
    NameError: name 'path' is not defined

    list(Metaflow()) gives the following

    Traceback (most recent call last):
    
      File "cleanup.py", line 14, in <module>
    
        print('list(Metaflow())',list(Metaflow()))
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/client/core.py", line 245, in __iter__
    
        all_flows = self.metadata.get_object('root', 'flow')
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/metadata/metadata.py", line 357, in get_object
    
        return cls._get_object_internal(obj_type, type_order, sub_type, sub_order, filters, *args)
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 116, in _get_object_internal
    
        return MetadataProvider._apply_filter(cls._request(None, url), filters)
    
      File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 247, in _request
    
        resp.text)
    
    metaflow.plugins.metadata.service.ServiceException: Metadata request (/flows) failed (code 500): {"message": "Internal server error"}
    
    script returned exit code 1

    Any ideas what might be happening?

    16 replies
    Mike Bentley Mills
    @mikejmills
    Hi I'm wondering if there is a way of seeing the code packages that where saved? Specifically can you look at them using the Metaflow Run Task modules.
    8 replies
    Edvin Mońćibob
    @emocibob
    Hi, is there a way to inspect a past run and see the parameters (metaflow.Parameter) it was started with?
    2 replies
    pranaygp
    @pranaygp:beeperhq.com
    [m]
    This has probably been asked before but, can you do conditional self.next ?
    3 replies
    if x:
      self.next(...)
    else
      self.next(...)
    sorry. that's a rhetorical question. I guess I meant. What's the recommended way to do something like that? cc @tuulos
    6 replies
    pranaygp
    @pranaygp:beeperhq.com
    [m]
    Ah, yeah #3 is my best case. Would be good to have native support for it
    1 reply
    The problem with 1 and 2 is that they take a lot of time when the steps are on batch
    Since we need to provision a GPU instance
    Just for noop
    Kamil Bobrowski
    @kbobrowski

    Hi, question about isolation of steps, I noticed that these two steps will be executed in the same conda environment:

    from metaflow import FlowSpec, step, conda
    
    class IsolationTest(FlowSpec):
    
        @conda(python="3.8.5")
        @step
        def start(self):
            import sys
            print(f"start executable: {sys.executable}")
            self.next(self.end)
    
        @conda(python="3.8.5")
        @step
        def end(self):
            import sys
            print(f"end executable: {sys.executable}")
    
    
    if __name__ == "__main__":
        IsolationTest()

    They will run in separated environments only if the python version is different. Is there a way to ensure that separate environments are created? (context: I need to install packages from pip, which results in heavy installing / rolling back packages every time flow is executed)

    5 replies
    Kelly Davis
    @kldavis4
    Question about step function IAM permissions. We have a step that is doing a next() with a foreach param, and when it runs as a step function, we get an error that the aws batch execution role doesn't have permission to call PutItem to the step functions dynamo table. This is in the step_functions_decoratory.py in task_finished(), so it makes sense to me that the batch execution role would need the dynamo permissions, but when I look at the cloudformation template in metaflow-tools, it doesn't seem to be granting those permissions. Can someone confirm that the batch execution role does need permissions to the step function dynamodb table?
    2 replies
    daavidstein
    @daavidstein

    Re (2) of issue #149:

    We are considering porting our data processing/ML pipelines to metaflow, but one thing that is holding us back is the lack of support for integration testing. For instance, we currently use kedro for our data pipelines. Kedro provides the ability to call and run a kedro pipeline from another python script (directly, without using subprocess.run) and additionally provides the ability to override the default datasets used in that pipeline at runtime. This is important, because some of our datasets are very large, so we naturally want to use subsetted versions of these datasets. In fact, some of the datasets we inject into the pipeline for the integration test are generated with Hypothesis which ensures that our pipelines are robust to unanticipated variations in the data.

    Furthermore, although it's not ideal for an integration test, there are some expensive functions, or functions that rely on a network connection, that we want to patch using unnitest.mock.patch . This doesn't seem to be possible when running a metaflow pipeline with subprocess.run .

    One solution could be to just to have a test flow inherit from the flow to be tested, and override the artifacts that way, ie:

    class TestPlaylistFlow(PlayListFlow):
        movie_data = test_data

    But as far as I understand it, the child flow will not inherit the step functions, which would force us to import them and manually specify them in the test flow.

    The only other solution I can think of at present is to define a boolean parameter test in the original flow and based on the value of that parameter assign different artifacts to the instance variables as necessary. Is there another option that can be implemented with the current version of metaflow==2.2.9?

    11 replies
    russellbrooks
    @russellbrooks

    hey guys, had a teammate run into what I believe is a bug in how Parameters are used in Step Functions deployments. If there is a Parameter that defaults to None like

    Parameter(name="test_param", type=int, default=None)

    it'll result in the following error when deploying with step-functions create.

    Flow failed:
        The value of parameter test_param is ambiguous. It does not have a default and it is not required.

    The same flow/parameter will run successfully locally or when submitted to batch without SFNs.

    1 reply
    pranaygp
    @pranaygp:beeperhq.com
    [m]
    is there a good way to "tag" runs?
    the auto increment number is fine, but I'd like to use a name ideally when kicking off jobs from the command line so it's easier to keep track of results from multiple runs and compare them easily
    2 replies
    vinod-rachala
    @vinod-rachala
    I have using metaflow on the ECR and while i am executing the code through the step function i am getting this error please let me if there any solutions. Error:Metaflow 2.2.8 executing preprocessflow unknown user:Metaflow could not determine your username based on environment variables ($USERNAME etc.)
    1 reply