I'm having issues with metaflow not finding previous runs. I am trying to do this via a jenkins pipeline. I have a training flow that I'm trying to reference in my inference flow. I am reading in the same config file. The strange part is that the same flows/runs are available before I read in the config file as are available after I've read in the config file. So before I read in the config file I get the following (setting namespace to none to check all available flows)
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: None
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
After I read in the config file
get_namespace(): None
list(Metaflow()): [Flow('training_flow_1')]
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-j8dasuvadiq/metaflow
get_metadata(): local@/home/ec2-user/workspace/models-training_staging
So the metadata is still showing local which I think may be related to the issue, but the DATASTORE_SYSROOT_S3 is updated after the config is read in so it definitely is reading in the file. But trying to find something run in a production namespace (i.e. that I ran via stepfunctions) returns an empty list.
When I try to run my inference flow I get the following (again after reading in the config and setting namespace to none):
get_namespace(): None
metaflow_config.DATASTORE_SYSROOT_S3: s3://metaflow-staging-uat-metaflows3bucket-v49t2hau629c/metaflow
get_metadata() local@/home/ec2-user/workspace/models-inference_staging
list(Metaflow()) []
So it seems the issue here is that even though the config file is read in and DATASTORE_SYSROOT_S3 is set correctly, the metadata points to a local folder, which is different between training and inference. So they are isolated. I tried setting the metadata manually by using the ServiceUrl from my cloudformation stack:
metadata('https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api')
but I get the error
Metaflow service [https://tid44ehxm0.execute-api.us-west-2.amazonaws.com/api] unreachable.
Any idea what's going on here? Again, from the metadata and the fact that the same flows are listed before and after I read in the config file it seems like it is somehow ignoring the config settings when reading/writing flows, so I am unable to find my training run when I'm running my inference flow. Thanks
from metaflow import FlowSpec, step, environment
, it gives an error that environment is not callable and it makes sense, because when import like this, metaflow wants to read from environment.py script. I did a small test. If I change the name of environment in line 27 from environmnet_decorator.py to anything else and then import that, it works. Could you please check it or correct me if I miss sth regarding the import?
Hello! @christineyu-coveo and I have been using metaflow recently and really enjoy it. We also face another issue related to using @batch
and @environemnt
.
Consider the following
@batch
@environment(vars={‘var_1’:os.getenv(‘var_1’)})
def step_A(self):
….
self.next(self.step_B)
@batch
@environment(vars={‘var_2’:os.getenv(‘var_2’)})
def step_B(self):
…
Metaflow initializes decorators for all steps before running any step. For @environment
this includes running step_init
, where it updates the environment variables
based on the vars
passed in the decorator. Following the above flow, when we are running step_A, the environment decorator for step_B will also be initialzied, and an exception will occur because var_2
is None in the batch enviornment for step_A, since it was not included in the @environment
decorator for step_A. Our current fix involves disabling enitrely step_init
for @environment
. While this works for our use case (i.e. >1 @batch
steps, with use of @environment
in either or both @batch
steps), I suspect this might disable some of the other usecases of @environment
. Do you have any alternate solutions to this problem? Prehaps batch decorator could be modified to also allow for inclusion of environemnt variables that we want to ship with the job.
metaflow could not install or find cuda in GPU environment and pytorch could not use GPU at all, issue was marked as resolved on Netflix/metaflow#250 but I could not replicate it.
sample code test_gpu.py
I used
from metaflow import FlowSpec, step, batch, IncludeFile, Parameter, conda, conda_base
class TestGPUFlow(FlowSpec):
@batch(cpu=2, gpu=1, memory=2400)
@conda(libraries={'pytorch': '1.5.1', 'cudatoolkit': '10.1.243'})
@step
def start(self):
import os
import sys
import torch
from subprocess import call
print(os.popen("nvidia-smi").read())
print(os.popen("nvcc --version").read())
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv",
"--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print('Available devices ', torch.cuda.device_count())
print('Current cuda device ', torch.cuda.current_device())
print(f"GPU count: {torch.cuda.device_count()}")
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
TestGPUFlow()
cmd line I used
USERNAME=your_name CONDA_CHANNELS=default,conda-forge,pytorch METAFLOW_PROFILE=your_profile AWS_PROFILE=your_profile python test_gpu.py --datastore=s3 --environment=conda run --with batch:image=your_base_image_with_cuda_support
metaflow output
2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: N/A |
2021-03-10 18:38:13.783 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |-------------------------------+----------------------+----------------------+
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
2021-03-10 18:38:13.784 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | | | MIG M. |
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] |===============================+======================+======================|
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
2021-03-10 18:38:13.785 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | N/A 43C P0 41W / 300W | 0MiB / 16160MiB | 0% Default |
2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | | | N/A |
2021-03-10 18:38:13.786 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-------------------------------+----------------------+----------------------+
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38]
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] +-----------------------------------------------------------------------------+
2021-03-10 18:38:13.787 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | Processes: |
2021-03-10 18:38:13.788 [82/start/876 (pid 8796)] [2bb1b538-fe24-4174-9066-94fe53629e38] | GPU GI CI PID Type Process name GPU M
Any idea what's wrong?
Hi all,
I'm having an issue with IncludeFile. When I'm trying to pass my training data file as an input, it throws an error:
AttributeError: 'bytes' object has no attribute 'path'
I am trying to read the file in one of the steps using Pandas. I'd really appreciate any suggestions to deal with this issue.
Thank you,
Vishal
In the step function role in the metaflow cloudformation template:
- PolicyName: AllowCloudwatch
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: CloudwatchLogDelivery
Effect: Allow
Action:
- "logs:CreateLogDelivery"
- "logs:GetLogDelivery"
- "logs:UpdateLogDelivery"
- "logs:DeleteLogDelivery"
- "logs:ListLogDeliveries"
- "logs:PutResourcePolicy"
- "logs:DescribeResourcePolicies"
- "logs:DescribeLogGroups"
Resource: '*'
What is the action logs:PutResourcePolicy
used for?
Hey, I want to dynamically change the required resources for a step and found this example Netflix/metaflow#431 for a workaround:
@resources(cpu=8, memory=os.environ['MEMORY'])
and then starting the flow with MEMORY=16000 python myflow.py run
. This works fine locally but fails when running with batch. Am I missing something?
Or is there any other way to change the resources using parameters or similar without creating different sized steps?
Hey @savingoyal and other Metaflow developers, myself and some colleagues are getting to the point where we'll have a PR ready for the metaflow-tools
repo. This PR will add a deploy-able Terraform stack.
We've read through the CONTRIBUTING.md
file and found this older issue that documents asking for a Terraform stack:
Our goal is to have this PR submitted by the end of this week and just wanted to start the dialogue with you guys. Super excited to see what happens :)
--namespace
that sets the run to the global namespace?--namespace=
) but when I run get_namespace()
within the Flow (or current.namespace), I'm still getting the user namespace. I also tried setting it to --namespace=None
but that uses the string 'None' vs. NoneType. As per the Metaflow docs, I'm hesitant to hardcode namespace(None)
into my code as a workaround.
I am trying to run metaflow inside a docker container, but I have to run it as non-root user. When I try to import with
"metaflow configure import metaflow_config/config.txt"
I get "PermissionError: [Errno 13] Permission denied: '/.metaflowconfig'"
I have tried changing the permissions with chown and chmod, currently set to
drwxrwxrwx 2 1000 1000 6 Apr 5 19:01 .metaflowconfig
But no luck. Can I run metaflow inside a docker container without being a root user?
Separately, anyone had this issue?
I can run aws c3 cp (my file) (remote bucket) and it pulls from the $AWS_PROFILE variable correctly (with an alias set on aws command to do so). However, running METAFLOW_PROFILE=personal python 05-helloaws/helloaws.py --datastore=s3 run I'm getting a token expired error which, I think is because it is using the wrong profile. Any tips on how to debug without just switching profile names. I will need to use a named profile.
I've created an Netflix/metaflow#473 to stop supporting Python 2.x from the next Metaflow release. Mostly because it will allow type annotations from Python 3, and make codebase more contributor friendly.
It seems like a pretty conservative move given that Python 2.7 has been EOL'ed more than a year ago. But I'm curious if anyone here is still using Metaflow with Python 2.7 and would be affected by this change?
I am getting internal server errors when I add 'METAFLOW_DEFAULT_METADATA': 'service' to my config file. My config file contains METAFLOW_SERVICE_URL, METAFLOW_SERVICE_INTERNAL_URL and METAFLOW_SERVICE_AUTH_KEY and I have verified they match what is in the cloudformation stack output.
when I try to run a script locally with
python inference-flow.py --environment=conda --datastore=s3 run
I get the following
Bootstrapping conda environment...(this could take a few minutes)
Metaflow service error:
Metadata request (/flows/inference-flow) failed (code 500): {"message": "Internal server error"}
If I try to run step functions create I get the following:
Running pylint...
Pylint is happy!
Deploying inference_flow to AWS Step Functions...
Internal error
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/metaflow/cli.py", line 930, in main
start(auto_envvar_prefix='METAFLOW', obj=state)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context().obj, args, kwargs)
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/aws/step_functions/step_functions_cli.py", line 88, in create
check_metadata_service_version(obj)
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/aws/step_functions/step_functions_cli.py", line 120, in check_metadata_service_version
version = metadata.version()
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 41, in version
return self._version(self._monitor)
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 288, in _version
(path, resp.status_code, resp.text),
NameError: name 'path' is not defined
list(Metaflow()) gives the following
Traceback (most recent call last):
File "cleanup.py", line 14, in <module>
print('list(Metaflow())',list(Metaflow()))
File "/usr/local/lib/python3.7/site-packages/metaflow/client/core.py", line 245, in __iter__
all_flows = self.metadata.get_object('root', 'flow')
File "/usr/local/lib/python3.7/site-packages/metaflow/metadata/metadata.py", line 357, in get_object
return cls._get_object_internal(obj_type, type_order, sub_type, sub_order, filters, *args)
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 116, in _get_object_internal
return MetadataProvider._apply_filter(cls._request(None, url), filters)
File "/usr/local/lib/python3.7/site-packages/metaflow/plugins/metadata/service.py", line 247, in _request
resp.text)
metaflow.plugins.metadata.service.ServiceException: Metadata request (/flows) failed (code 500): {"message": "Internal server error"}
script returned exit code 1
Any ideas what might be happening?
self.next
?
if x:
self.next(...)
else
self.next(...)