METAFLOW_INIT_
. A parameter with a name like "my-param", which is otherwise perfectly valid for Metaflow when using the local runtime, will result in an error when running via SFN, because many shells won't allow env vars with dashes in the name.
Hi, what is the way to run a nodejs process in the background in metaflow? I am running on batch using a custom docker image that has nodejs and python dependencies. The node app, once started, waits for a json post which is done by a task later in the metaflow python process.
The way to start the node app to is "npm run dev" however when I use os.system('npm run dev')
, the metaflow process gets paused at "App listening on http: / / localhost :8888" (as below) since it starts the node app right away which is then waiting for the json on port 8888. However this will be calculated in a later metaflow step and posted via requests.post("http://localhost:8888/savings-report", json=self.json_structure)
2021-01-26 00:29:49.835 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] Starting report generator ...
2021-01-26 00:29:49.835 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] > report-generator@1.0.0 dev /usr/src/app
2021-01-26 00:29:49.836 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] > ts-node src/server.ts
2021-01-26 00:29:49.836 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] App listening on http://localhost:8888
I would like to start the nodejs app using npm run dev
via metaflow and leave it running in the background and continue to the next steps in metaflow
Wondering if there's a more efficient way to implement the following design pattern directly in metaflow such that it would utilize multiprocessing to load and combine multiple dataframes after a foreach fanout:
df = [input.partition_df for input in inputs]
df = pd.concat(df, ignore_index=True, sort=False, copy=False)
A hacky way that's coming to mind is to just use joblib.Parallel
or metaflow's parallel_map
to access the artifacts in parallel, but it feels a bit odd. This pattern may also be related to the roadmap effort to open source your all's in-house goodies for dataframes. I use partitioned parquet files in a couple places to split out data, pass references around, and load in parallel – but there's a couple use cases where I'd prefer to stay within the metaflow ecosystem if possible :smiley:
Curious what your all's thoughts are, and just want to make sure I'm not missing something like a clever usage of s3.get_many
.
--with batch
it takes around 8 minutes) most of the time is consumed to bootstrap conda environment for each step before running it!!!I was in our AWS Batch console and I noticed two jobs that were seeming stuck in RUNNING
. The individual who kicked off those jobs says all his terminal sessions have been ended, even to go as far to restart his PC/sever internet connection.
I figure this is more of an AWS situation I'm debugging but has anyone witnessed flows being stuck in RUNNING
?
I know the jobs will die when the timeout is reached, just want to understand what may have caused this
02-statistics/stats.py
.
The tutorial 4 seems to be failing attempting to create a conda environment. The funny thing is that if I run that command directly it seems to succeed. Not sure how to get the conda errors:
python 04-playlist-plus/playlist.py --environment=conda runMetaflow 2.2.6 executing PlayListFlow for user:...
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Bootstrapping conda environment...(this could take a few minutes)
Conda ran into an error while setting up environment.:
Step: start, Error: command '['/opt/miniconda/condabin/conda', 'create', '--yes', '--no-default-packages', '--name', 'metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361', '--quiet', b'python==3.8.5', b'click==7.1.2', b'requests==2.24.0', b'boto3==1.17.0', b'coverage==5.4', b'pandas==0.24.2']' returned error (-9): b''
Note that the following command succeeds:
/opt/miniconda/condabin/conda create --yes --no-default-packages --name metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361 --quiet python==3.8.5 click==7.1.2 requests==2.24.0 boto3==1.17.0 coverage==5.4 pandas==0.24.2
Note: that I had to make a few minor changes to the demo to refer to Python 3.8.5 and to add a dependency to more recent versions of boto3 and coverage than what metaflow was requesting otherwise the generated conda create command would fail even on the command line.
...
File "/metaflow/metaflow/plugins/aws/step_functions/step_functions_decorator.py", line 54, in task_finished self._save_foreach_cardinality(os.environ['AWS_BATCH_JOB_ID'],
...
requests.exceptions.ConnectionError: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/placement/availability-zone/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7c0f12e3d0>: Failed to establish a new connection: [Errno 22] Invalid argument'))
Hi, I am working on getting the metaflow artifacts from S3. The code is deployed on AWS lambda I set the environment variable “METAFLOW_DATASTORE_SYSROOT_S3” to the s3 location. Our use case requires us to change the datastore environ variable in every iteration so that different flows and runs’ artifacts can be accessed as follows:
def _queryMetaflow(self, appName, starflowResp):
metaflow_run_id = starflowResp["details"]["frdm"]["metaflowRunNumber"]
metaflow_name = starflowResp["details"]["frdm"]["metaflowId"]
os.environ['METAFLOW_DATASTORE_SYSROOT_S3'] = "{}/artifacts/{}/higher".format(getMetadataLocation(), appName)
from metaflow import Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
metadata1 = metadata(getMetadataURL())
namespace(None)
mf = Metaflow()
# call metaflow and get results and send success or error
try:
metaflowResp = Run(metaflow_name + '/' + metaflow_run_id).data
print(metaflowResp)
del Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
return metaflowResp
except Exception as e:
print("Exception occured in query metaflow: {}".format(e))
raise CapAppFailure("Exception occured in metaflow response, S3 datastore operation _get_s3_object failed likely")
When this method is called the first time, it doesn’t fail in the first iteration but fails in the second iteration. I inspected the environ variable and the location is the correct in every iteration but this error is encountered in the second iteration:
S3 datastore operation _get_s3_object failed (An error occurred (404) when calling the HeadObject operation: Not Found). Retrying 7 more times..
I am unable to fix this issue. Can you please help?
Hello Netflix employees, can someone please share about Metaflow's adoption at Netflix? In late 2018 it was used in 134 projects, how has it grown since then? What percentage of Netflix data scientists use metaflow?
We're considering Metaflow at my organization, so I'd just like to get a sense of the adoption rate we can hope for at my employer.
ModuleNotFoundError: No module named 'pandas'
when the step function is triggered. I tried running with the commands python 02-statistics/stats.py --environment=conda step-functions create --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"
as well as python 02-statistics/stats.py step-functions create --max-workers 4
and both give the same error message.python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"
it works fine.
metaflow configure aws --profile dev
and then metaflow configure show
it still says Configuration is set to run locally.
I am attempting to run step-functions create
and getting the following error: AWS Step Functions error:
ClientError("An error occurred (AccessDeniedException) when calling the CreateStateMachine operation: 'arn:aws:iam::REDACTED:role/metaflow-step_functions_role' is not authorized to create managed-rule.")
I am specifying METAFLOW_SFN_IAM_ROLE=arn:aws:iam::REDACTED:role/metaflow-step_functions_role
in my metaflow config.
The role is being created via terraform, but is based on https://github.com/Netflix/metaflow-tools/blob/master/aws/cloudformation/metaflow-cfn-template.yml#L839. That role does not have a grant for states:CreateStateMachine
but even if I add that, I still get the same error.
Any tips for troubleshooting this?
python flow.py --with retry step-functions create --max-workers 1000
but when triggering the flow it only runs maximum 40 tasks in parallel. When running the flow without step functions on batch it worked fine. Any ideas what could be the reason for this throttling?
I'm experiencing some problems when trying to install pytorch
with CUDA enabled.
I'm running my flow on AWS Batch, powered by a p3.2xlarge
machine and using the image
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04
to get the NVIDIA driver installed.
The relevant flow code looks like:
@conda_base(python="3.8")
class FooFlow(FlowSpec):
...
@batch(image=URL ABOVE)
# this line below is of most interest
@conda(libraries={"pytorch": "1.6.0", "cudatoolkit": "11.0.221"})
@resources(memory=4*1024, cpu=2, gpu=1)
@step
def test_gpu(self):
import os
print(os.popen("nvidia-smi).read())
print(os.popen("nvcc --version).read())
import torch
I'm not convinced this is precisely a Metaflow issue, but the common solutions one finds when Googling involves installing Pytorch using the conda CLI
, which obviously the @conda
decorartor extrapolates away from us.
I've been running many flows, of different versions of pytorch
and cudatoolkit
.
Torch not compiled with CUDA enabled
I'm familiar with the Github Issue: Netflix/metaflow#250
Any advise at all?
We’re working on creating a @notify()
decorator that could send a notification upon success or failure, per Flow or per Step. It could send email or slack messages.
It would be up to the scheduler (local, AWS Step Functions, KFP) to honor the @notify
decorator.
@notify(email_address=“oncall@foo.com", on="failure")
@notify(email_address=“ai@foo.com", on="success")
class MyFlow(Flow):
@notify(slack_channel=“#foo", on="success")
@step
def my_step(self):
To implement this I’d like to introduce a new Metaflow concept, a @finally
step.
class MyFlow(Flow):
@finally
def finally_step(self, status):
status # we need a way to message Success or Failure
Hi!
How can I pass @Parameters
different than default to step-functions create
?
I know step-functions trigger
can take any @Parameters
in a pipeline python file but this is valid only for this run.
What I wanna do is to pass @Parameters
to cron schedule in AWS EventBridge dynamically.
Hey guys, been using metaflow for a bit over a year now, and I've recently started to ingrate our deployment with AWS Batch for the scale-out pattern. I'm now able to execute flows with some steps that run in Batch, however I don't see the ECS cluster ever scaling back down
To ellaborate, my compute environment has the following settings, min vcpus = 0, desired vcpus = 0, max vpcus = 32
When I run a flow, a job definition gets added into the job queue, an instance gets started in the cluster, the task runs and finishes fine, but the job definition stays as "Active" the instance seems to stay up indefinitely inside the cluster until I go and manually "deregister" the job definition
Is this the way it's designed? or am I missing something in the way I configured my Compute environment?
Is metaflow supposed to update the job definition after a flow finishes?
hey guys, would anyone find it useful to expose the batch param for ulimits
? It's a list of dict that maps to the docker --ulimit
option of docker run. In particular, I've noticed that the ECS/batch default ulimit for the number of open files per container is 1024/4096. With this option, it could be potentially increased up to the daemon limit using:
ulimits=[{"name": "nofile", "softLimit": 65535, "hardLimit": 1048576}]
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html
https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit
FWIW this can be set via a launch template for the batch compute environment/ECS cluster, so it's not a necessity and also is a bit ugly for a decorator which is why I ask :sweat_smile:. As an example of what this looks like in a launch template:
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
#cloud-boothook
#!/bin/bash
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --default-ulimit nofile=65535:1048576"' >> /etc/sysconfig/docker
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html
I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: None
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
After I read in the config file
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.
When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):
get_namespace(): None
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
get_metadata() local@/home/ec2-user/workspace/models-inference_staging
list(Metaflow()) []
So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:
metadata('https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api')
but I get the error
Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.
Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks