I'm experiencing some problems when trying to install pytorch
with CUDA enabled.
I'm running my flow on AWS Batch, powered by a p3.2xlarge
machine and using the image
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04
to get the NVIDIA driver installed.
The relevant flow code looks like:
@conda_base(python="3.8")
class FooFlow(FlowSpec):
...
@batch(image=URL ABOVE)
# this line below is of most interest
@conda(libraries={"pytorch": "1.6.0", "cudatoolkit": "11.0.221"})
@resources(memory=4*1024, cpu=2, gpu=1)
@step
def test_gpu(self):
import os
print(os.popen("nvidia-smi).read())
print(os.popen("nvcc --version).read())
import torch
I'm not convinced this is precisely a Metaflow issue, but the common solutions one finds when Googling involves installing Pytorch using the conda CLI
, which obviously the @conda
decorartor extrapolates away from us.
I've been running many flows, of different versions of pytorch
and cudatoolkit
.
Torch not compiled with CUDA enabled
I'm familiar with the Github Issue: Netflix/metaflow#250
Any advise at all?
We’re working on creating a @notify()
decorator that could send a notification upon success or failure, per Flow or per Step. It could send email or slack messages.
It would be up to the scheduler (local, AWS Step Functions, KFP) to honor the @notify
decorator.
@notify(email_address=“oncall@foo.com", on="failure")
@notify(email_address=“ai@foo.com", on="success")
class MyFlow(Flow):
@notify(slack_channel=“#foo", on="success")
@step
def my_step(self):
To implement this I’d like to introduce a new Metaflow concept, a @finally
step.
class MyFlow(Flow):
@finally
def finally_step(self, status):
status # we need a way to message Success or Failure
Hi!
How can I pass @Parameters
different than default to step-functions create
?
I know step-functions trigger
can take any @Parameters
in a pipeline python file but this is valid only for this run.
What I wanna do is to pass @Parameters
to cron schedule in AWS EventBridge dynamically.
Hey guys, been using metaflow for a bit over a year now, and I've recently started to ingrate our deployment with AWS Batch for the scale-out pattern. I'm now able to execute flows with some steps that run in Batch, however I don't see the ECS cluster ever scaling back down
To ellaborate, my compute environment has the following settings, min vcpus = 0, desired vcpus = 0, max vpcus = 32
When I run a flow, a job definition gets added into the job queue, an instance gets started in the cluster, the task runs and finishes fine, but the job definition stays as "Active" the instance seems to stay up indefinitely inside the cluster until I go and manually "deregister" the job definition
Is this the way it's designed? or am I missing something in the way I configured my Compute environment?
Is metaflow supposed to update the job definition after a flow finishes?
hey guys, would anyone find it useful to expose the batch param for ulimits
? It's a list of dict that maps to the docker --ulimit
option of docker run. In particular, I've noticed that the ECS/batch default ulimit for the number of open files per container is 1024/4096. With this option, it could be potentially increased up to the daemon limit using:
ulimits=[{"name": "nofile", "softLimit": 65535, "hardLimit": 1048576}]
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch.html
https://docs.docker.com/engine/reference/commandline/run/#set-ulimits-in-container---ulimit
FWIW this can be set via a launch template for the batch compute environment/ECS cluster, so it's not a necessity and also is a bit ugly for a decorator which is why I ask :sweat_smile:. As an example of what this looks like in a launch template:
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
#cloud-boothook
#!/bin/bash
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --default-ulimit nofile=65535:1048576"' >> /etc/sysconfig/docker
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/bootstrap_container_instance.html
I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: None
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
After I read in the config file
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.
When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):
get_namespace(): None
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
get_metadata() local@/home/ec2-user/workspace/models-inference_staging
list(Metaflow()) []
So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:
metadata('https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api')
but I get the error
Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.
Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks
from metaflow import FlowSpec, step, environment
, it gives an error that environment is not callable and it makes sense, because when import like this, metaflow wants to read from environment.py script. I did a small test. If I change the name of environment in line 27 from environmnet_decorator.py to anything else and then import that, it works. Could you please check it or correct me if I miss sth regarding the import?
Hello! @christineyu-coveo and I have been using metaflow recently and really enjoy it. We also face another issue related to using @batch
and @environemnt
.
Consider the following
@batch
@environment(vars={‘var_1’:os.getenv(‘var_1’)})
def step_A(self):
….
self.next(self.step_B)
@batch
@environment(vars={‘var_2’:os.getenv(‘var_2’)})
def step_B(self):
…
Metaflow initializes decorators for all steps before running any step. For @environment
this includes running step_init
, where it updates the environment variables
based on the vars
passed in the decorator. Following the above flow, when we are running step_A, the environment decorator for step_B will also be initialzied, and an exception will occur because var_2
is None in the batch enviornment for step_A, since it was not included in the @environment
decorator for step_A. Our current fix involves disabling enitrely step_init
for @environment
. While this works for our use case (i.e. >1 @batch
steps, with use of @environment
in either or both @batch
steps), I suspect this might disable some of the other usecases of @environment
. Do you have any alternate solutions to this problem? Prehaps batch decorator could be modified to also allow for inclusion of environemnt variables that we want to ship with the job.
metaflow could not install or find cuda in GPU environment and pytorch could not use GPU at all, issue was marked as resolved on Netflix/metaflow#250 but I could not replicate it.
sample code test_gpu.py
I used
from metaflow import FlowSpec, step, batch, IncludeFile, Parameter, conda, conda_base
class TestGPUFlow(FlowSpec):
@batch(cpu=2, gpu=1, memory=2400)
@conda(libraries={'pytorch': '1.5.1', 'cudatoolkit': '10.1.243'})
@step
def start(self):
import os
import sys
import torch
from subprocess import call
print(os.popen("nvidia-smi").read())
print(os.popen("nvcc --version").read())
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv",
"--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print('Available devices ', torch.cuda.device_count())
print('Current cuda device ', torch.cuda.current_device())
print(f"GPU count: {torch.cuda.device_count()}")
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
TestGPUFlow()
cmd line I used
USERNAME=your_name CONDA_CHANNELS=default,conda-forge,pytorch METAFLOW_PROFILE=your_profile AWS_PROFILE=your_profile python test_gpu.py --datastore=s3 --environment=conda run --with batch:image=your_base_image_with_cuda_support
metaflow output
2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: N/A |
2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |-------------------------------+----------------------+----------------------+
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | | | MIG M. |
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |===============================+======================+======================|
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | N/A 43C P0 41W / 300W | 0MiB / 16160MiB | 0% Default |
2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | | | N/A |
2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-------------------------------+----------------------+----------------------+
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38]
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-----------------------------------------------------------------------------+
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Processes: |
2021-03-10 18:38:13.788 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU GI CI PID Type Process name GPU M
Any idea what's wrong?
Hi all,
I'm having an issue with IncludeFile. When I'm trying to pass my training data file as an input, it throws an error:
AttributeError: 'bytes' object has no attribute 'path'
I am trying to read the file in one of the steps using Pandas. I'd really appreciate any suggestions to deal with this issue.
Thank you,
Vishal
In the step function role in the metaflow cloudformation template:
- PolicyName: AllowCloudwatch
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: CloudwatchLogDelivery
Effect: Allow
Action:
- "logs:CreateLogDelivery"
- "logs:GetLogDelivery"
- "logs:UpdateLogDelivery"
- "logs:DeleteLogDelivery"
- "logs:ListLogDeliveries"
- "logs:PutResourcePolicy"
- "logs:DescribeResourcePolicies"
- "logs:DescribeLogGroups"
Resource: '*'
What is the action logs:PutResourcePolicy
used for?
Hey, I want to dynamically change the required resources for a step and found this example Netflix/metaflow#431 for a workaround:
@resources(cpu=8, memory=os.environ['MEMORY'])
and then starting the flow with MEMORY=16000 python myflow.py run
. This works fine locally but fails when running with batch. Am I missing something?
Or is there any other way to change the resources using parameters or similar without creating different sized steps?
Hey @savingoyal and other Metaflow developers, myself and some colleagues are getting to the point where we'll have a PR ready for the metaflow-tools
repo. This PR will add a deploy-able Terraform stack.
We've read through the CONTRIBUTING.md
file and found this older issue that documents asking for a Terraform stack:
Our goal is to have this PR submitted by the end of this week and just wanted to start the dialogue with you guys. Super excited to see what happens :)
--namespace
that sets the run to the global namespace?--namespace=
) but when I run get_namespace()
within the Flow (or current.namespace), I'm still getting the user namespace. I also tried setting it to --namespace=None
but that uses the string 'None' vs. NoneType. As per the Metaflow docs, I'm hesitant to hardcode namespace(None)
into my code as a workaround.