Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Greg Hilston
    Successfully ran locally, using AWS resources
    Service provider differs from local (successfully ran) and sage maker (unsuccesfully ran)
    from metaflow import Flow, get_metadata
    print("Current metadata provider: %s" % get_metadata())
    Am I missing some steps/documentation? I feel like I'm missing a big step post installation with Cloud Formation
    Yeah, I think the sagemaker notebook instance doesn't allow one to launch AWS Batch tasks via metaflow. This is something that we should enable in the cloudformation template. But you should be able to launch your workflows from your laptop onto the configured AWS resources and track the workflow using sagemaker notebook.
    @queueburt ^ I think we should make this change in the CFN template and make it more permissive.
    Greg Hilston

    Okay, noted on that front.

    So it looks like I was able to deploy a metaflow task locally using the new knowledge of running AWS Configure with the Cloud Formation output.

    Are there any other steps/documentation I may be missing on "linking" what I just kicked off locally, to what the sagemaker ipynb can "see"?

    Like I've noticed the metadata does not match, as a starter. And I understand I can put the Sage Maker notebook namespace to None, to look at everything. anything else?

    As Sage Maker is not reporting any successful runs for that flow, but I know one occurred.
    The metadata might not match exactly, since we set the metadata in the notebook to the internal NLB address which is accessible only within the VPC.
    Greg Hilston
    Okay, I see what you're talking about with the environment variables exposed in the yml file and understand the internal VPC address only

    Any thoughts as to why I'm seeing the flow worked correctly:

    05-helloaws % ./venv/bin/python3 run
    Metaflow 2.0.5 executing HelloAWSFlow for user:greg
    Validating your flow...
        The graph looks good!
    Running pylint...
        Pylint is happy!
    2020-06-08 15:39:49.381 Workflow starting (run-id 12):
    2020-06-08 15:39:50.239 [12/start/33 (pid 25604)] Task is starting.
    2020-06-08 15:39:55.576 [12/start/33 (pid 25604)] HelloAWS is starting.
    2020-06-08 15:39:55.667 [12/start/33 (pid 25604)]
    2020-06-08 15:39:55.667 [12/start/33 (pid 25604)] Using metadata provider: service@someurl
    2020-06-08 15:39:55.668 [12/start/33 (pid 25604)]
    2020-06-08 15:39:55.668 [12/start/33 (pid 25604)] The start step is running locally. Next, the
    2020-06-08 15:39:55.668 [12/start/33 (pid 25604)] 'hello' step will run remotely on AWS batch.
    2020-06-08 15:39:55.668 [12/start/33 (pid 25604)] If you are running in the Netflix sandbox,
    2020-06-08 15:39:55.668 [12/start/33 (pid 25604)] it may take some time to acquire a compute resource.
    2020-06-08 15:39:58.257 [12/start/33 (pid 25604)] Task finished successfully.
    2020-06-08 15:40:00.804 [12/hello/34 (pid 25666)] Task is starting.
    2020-06-08 15:40:02.483 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task is starting (status SUBMITTED)...
    2020-06-08 15:40:07.081 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task is starting (status RUNNABLE)...
    2020-06-08 15:40:08.228 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task is starting (status STARTING)...
    2020-06-08 15:40:28.829 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task is starting (status RUNNING)...
    2020-06-08 15:40:33.002 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Setting up task environment.
    2020-06-08 15:40:52.479 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Downloading code package.
    2020-06-08 15:40:52.479 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Code package downloaded.
    2020-06-08 15:40:56.269 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task is starting.
    2020-06-08 15:40:56.269 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Metaflow says: Hi from AWS!
    2020-06-08 15:40:56.269 [12/hello/34 (pid 25666)] [ba8fc20f-33e7-4e57-bc07-d0adfb85a0d9] Task finished with exit code 0.
    2020-06-08 15:40:59.325 [12/hello/34 (pid 25666)] Task finished successfully.
    2020-06-08 15:41:00.678 [12/end/35 (pid 25871)] Task is starting.
    2020-06-08 15:41:06.365 [12/end/35 (pid 25871)] HelloAWS is finished.
    2020-06-08 15:41:08.839 [12/end/35 (pid 25871)] Task finished successfully.
    2020-06-08 15:41:09.336 Done!

    but running this code from helloaws.ipynb

    run = Flow('HelloAWSFlow').latest_successful_run
    print("Using run: %s" % str(run))

    Errors out, as no successful_run is found?

    That's the situation I'm in right now and trying to debug
    Can you set the metadata provider within the notebook and try again?
    Or just try listing all runs
    Greg Hilston
    for flow in Flow('HelloAWSFlow').runs():
        print(f"{} finished at {flow.finished_at} with a status of {flow.successful}")
    10 finished at None with a status of False
    9 finished at None with a status of False
    8 finished at None with a status of False
    7 finished at None with a status of False
    6 finished at None with a status of False
    5 finished at None with a status of False
    4 finished at None with a status of False
    3 finished at None with a status of False
    2 finished at None with a status of False
    1 finished at None with a status of False
    And your namespace is set to None?
    Greg Hilston
    Thank you so much, it was not set to None. I never executed that cell
    It worked :)
    Apologies for the very simple situation. I'm just happy to have completed my first AWS execution :)
    No worries. Glad we got it working! I think we should document this better in our Notebook.
    @GregHilston - Netflix/metaflow#215
    Wooyoung Moon

    @wmoon5 You can try executing from the parent folder or rely on IncludeFile.

    Hi just circling back on this. Are there any best practices for structuring larger projects with potentially many different flows that want to share the same set of library modules that we've written ourselves? Is there any good way to have a "flows" subdirectory that can still import modules from elsewhere in the project via relative imports?

    Hi @wmoon5 - I typically have all my flows at the top-level directory, and then have sub directories for each of my custom python modules. Just make sure each modules subdirectory has an If the modules are used by multiple projects (e.g. different git repos with different flows), you can have your custom libraries as a separate git repo and use a git submodule within your flows repo.
    Wooyoung Moon
    Thanks @bergdavidj appreciate the tips!
    Marty Kemka
    Hi everyone, I am sorry if this is explained in the docs but is it ok to have a 'foreach' loop as well as a split in the same flow? For some reason I am receiving a join error even when I join the two seperate steps.
    1 reply
    Alexander Efimov
    @savingoyal I saw you did an integration with AWS Step Functions. Imagine I have a wokflow with an async step invoking external service that might take from 1 to 48 hours to execute. Any way to trigger continuation steps when I get triggered from that external service? I could split this workflow into two, but how do I correlate the runs then?
    9 replies
    Joseph Bentivegna

    Hi all, I ran into this error today out of the blue when running my flow. I tried running the HelloFlow metaflow-tutorial flow as well and got the same error. Anyone know a solution?

    Metaflow 2.0.2 executing HelloFlow for user
    Validating your flow...
    The graph looks good!
    Running pylint...
    Pylint not found, so extra checks are disabled.
    Metaflow service error:
    Metadata request (/flows/HelloFlow) failed (code 500): "{\"err_msg\": \"asynchronous connection attempt underway\"}"

    30 replies
    Joseph Bentivegna
    Sreehari Sreejith

    Hi, We are interested in coming up with a POC version of running Metaflow on our existing KubeFlow Pipelines (KFP) setup and am hoping to get some feedback on a few initial thoughts I had in mind (Following up on @savingoyal 's comment)

    The idea for the POC is to try to represent Metaflow code in a format that the KFP SDK understands (specifically GraphComponentSpec - link to schema, outline) and have it execute using KFP (option 1 below). This idea is based on the suggestions on this KF thread, from which we see two ways to achieve this.

    1. Metaflow -> (compiles to) -> GraphComponentSpec -> (which is then executable by) -> KFP (Preferrable option for the short term)
    2. Metaflow -> (compiles to) -> TFX IR -> (which is then executable by) -> KFP (Support is currently limited for this option)

    At this point, I’ve gone over the tutorials and the technical overview of Metaflow and I’m not entirely sure if we would be able to compile to the specified format correctly as expected. I would really like to hear your thoughts on this and if this (i.e., option 1) sounds like a feasible approach? We would like some feedback on this thought and on the feasibility of this and if you have alternate suggestions on being able to use/run Metaflow on KubeflowPipelines/KF.

    Would be great if you can point me to any docs/links that you may have (even if it is work in progress) that may point me to ideas for this and act as a starting point. Thank you!

    13 replies
    Joseph Bentivegna

    @ferras and @savingoyal here's an update on the status of this error I have been encountering. This morning we destroyed and rebuilt our Metaflow infrastructure using a Terraform script that my colleague created. After this rebuild, when we ran our flow we got a new error: “requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response’))”.

    This error pointed us to our Metaflow Service container running on ECS. When we investigated this we found some weird behavior:

    1. We noticed the task container would oscillate between ‘RUNNING’, ‘STOPPED’, and ‘PENDING’ status even when we wouldn’t be running any flows.
    2. When we checked the logs we saw a similar error as before: “psycopg2.OperationalError: asynchronous connection attempt underway”

    I’m once again stumped on where to go from here. Any guidance would be greatly appreciated from my team here at Cigna.

    6 replies
    Sreehari Sreejith
    The documentation points only to a CloudFormation template and I was wondering if there was a terraform equivalent. I found this PR Netflix/metaflow-tools#14 and the relevant issue Netflix/metaflow#38 - wondering if there are any updates on this?
    4 replies
    @jbentivegna15 any luck with your previous issue?
    Joseph Bentivegna
    @ferras nope :/ I suspected it could have been a security group issue between RDS and ECS but even after manually updating them it didn't fix the issue
    sorry for not responding to your previous message
    how would i get the stack trace?
    @jbentivegna15 have you tried using our provided cloudformation template?
    Joseph Bentivegna
    Yeah i've tried it but due to Cigna's internal permissions it doesn't work
    4 replies
    Sergio Calderón Pérez-Lozao
    Hi guys :) can you check this issue? Netflix/metaflow#166
    @jbentivegna15 can you follow up on this issue here: Netflix/metaflow-service#15
    Joseph Bentivegna
    @ferras can do, lemme write up some stuff
    @sergiocalde94 Yes, we are in the process of revamping logging for OSS. Expect some progress on that issue relatively soon.
    (EJ) Vivek Pandey

    So I am getting this error while trying to inspect the data of my last successful run this morning: S3 datastore operation _get_s3_object failed (An error occurred (404) when calling the HeadObject operation: Not Found)

    I tried looking up the path with flow.latest_successful_run.end_task.artifacts.raw_data_df._object and got the s3 location of the raw_data_df data artifact. And used a separate script to verify I can access and download and it seems to work fine. Any ideas what I might be missing here?

    1 reply
    Tutorial Episode 6 does not work as in Netflix/metaflow#241. Please advise how to fix.
    1 reply
    A Ivan
    Hi! I am trying to run the 05-hellowaws example. I have manually setup a Batch Job Queue and Compute environment, as well as an S3 bucket. To debug, I have given the AWSBatchServiceRole and the ecsInstanceRole the S3FullAccess policy. I have also created an IAM role for ecs-tasks with the S3FullAccess policy, this is the policy that I have specified for METAFLOW_ECS_S3_ACCESS_IAM_ROLE in~/.metaflowconfig/config.json. When I run the script with python --datastore=s3 run everything executes fine, however if I try to use the conda environment flag python --environment=conda --datastore=s3 run the job hangs for an hour on the "Bootstrapping environment" step. After 45 minutes - hour the job crashes with an OOM error OutOfMemoryError: Container killed due to memory usage with an underlying error in the logs metaflow.datatools.s3.MetaflowS3Exception: Getting S3 files failed. First prefix requested: s3://mf-test-bucket-28393/mf/conda/ -- Has anybody seen this before or is able to provide any insights?
    2 replies
    Is there a way to
    • run a step after a foreach step that doesn't join?
    • run another foreach after a foreach step?
    3 replies
    something like
             foreach (a) -- continue (a)
            /                           \
    start --                             -- end
            \                           /
             foreach (b) -- continue (b)
    Ji Xu
    May I know if this issue has been looked into? #179
    I really would like to take advantage of @conda but don't have root access.
    2 replies
    Is there a way to specify dynamically memory requirements for the @batch decorator at runtime? My workflow has a fan out with many parallel jobs where each batch job uses between 10GB of memory to 1TB, and the amount of memory per job can be determined at runtime based its input, created dynamically in a previous step. Obviously, I would like to avoid having to allocate an EC2 instance with 1TB of memory for each (small) job.
    2 replies
    Wooyoung Moon

    Hi is there any way to run generic flows that read from a flow parameters json/yml file? So that I can do something like this:

    python run --parameters my_flow_parameters.yml

    5 replies