Step end_train joins steps from unrelated splits. Ensure that there is a matching join for every split.
I checked the code and the self.next looks correct, is there a way to check the DAG that is created from the code and see exactly where the issue is?
Hi, according to Netflix/metaflow#193 then ThrottlingException
's don't cause task failure; however I am reliably getting ThrottlingException
followed by a task failure a second later, e.g.
2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)] AWS Batch job error:
2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)] ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')
2020-11-30 15:16:58.539 [659/link_finder/4638 (pid 1526921)]
2020-11-30 15:16:59.030 [659/link_finder/4638 (pid 1526921)] Task failed.
2020-11-30 15:16:59.092 [659/link_finder/4638 (pid 1535437)] Task is starting (retry).
Perhaps, a flurry of logging error events are being masked by the throttling exception hiding the source of the failure?
Hi I am trying to install metaflow
for R following the doc, but ran into this following error when trying to test with metaflow::test()
:
Metaflow 2.2.0 executing HelloWorldFlow for user:ji.xu
Validating your flow...
The graph looks good!
2020-11-30 10:50:33.216 Workflow starting (run-id 1606762233207871):
2020-11-30 10:50:33.222 [1606762233207871/start/1 (pid 49822)] Task is starting.
2020-11-30 10:50:34.783 [1606762233207871/start/1 (pid 49822)] Fatal Python error: initsite: Failed to import the site module
2020-11-30 10:50:34.785 [1606762233207871/start/1 (pid 49822)] Traceback (most recent call last):
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 550, in <module>
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] main()
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 531, in main
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] known_paths = addusersitepackages(known_paths)
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 282, in addusersitepackages
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] user_site = getusersitepackages()
2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 258, in getusersitepackages
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] user_base = getuserbase() # this will also set USER_BASE
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 248, in getuserbase
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] USER_BASE = get_config_var('userbase')
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 609, in get_config_var
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] return get_config_vars().get(name)
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 588, in get_config_vars
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] import _osx_support
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/_osx_support.py", line 4, in <module>
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] import re
2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/re.py", line 123, in <module>
2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] import sre_compile
2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sre_compile.py", line 17, in <module>
2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] assert _sre.MAGIC == MAGIC, "SRE module mismatch"
2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] AssertionError: SRE module mismatch
2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] Error: Error 1 occurred creating conda environment r-reticulate
2020-11-30 10:50:34.817 [1606762233207871/start/1 (pid 49822)] Execution halted
2020-11-30 10:50:34.820 [1606762233207871/start/1 (pid 49822)] Task failed.
2020-11-30 10:50:34.820 Workflow failed.
2020-11-30 10:50:34.820 Terminating 0 active tasks...
2020-11-30 10:50:34.820 Flushing logs...
Step failure:
Step start (task-id 1) failed.
It seems the problem is on the python side. Has anyone seen the same issue and has a solution?
I've been having loads of problems getting my AWS Batch job working through Metaflow. Since I'm not overly experienced with AWS CloudOps, it's hard for me to tell whether the issue is an AWS issue or a Metaflow limitation.
Here is one thing I experienced which may help others:
@batch
decorator, I noticed that GPU instances were getting launched even when I explicitly denoted gpu=0
as a parameter in the decorator. This appears to be happening for a couple of reasons:
I maxed out my vCPU limit on my CPU ComputeEnvironment which is forcing a job to launch on the GPU ComputeEnvironment. After talking with AWS support, if any of you here are really wanting to crank up the number of batch workers, make sure the MaxVCPUBatch
parameter in the CloudFormation template is also adjusted upwards accordingly. For me, I'm running Dask parallelization within each Batch task, so I'm using up the MaxVCPUBatch
pretty quickly and was only seeing one c5.18xlarge instance launch at any one time when I had a MaxVCPUBatch
value of 96 in my CloudFormation template. So ... even though the Metaflow documentation lists a --max-workers
parameters in the CLI, the number of maximum workers will also be throttled by MaxVCPUBatch
in the CloudFormation Template.
The explicit denotion of gpu=0
does nothing within the Metaflow @batch
decorator (BatchJob
class). I know there are a lot of ways to correct for this (separate job queues, solution mentioned above, etc.) but was curious what the Metaflow devs on this forum think of possibly changing line 150 in batch_client.py
to read if int(gpu) >= 0
to protect from GPU instances being launched "unnecessarily".
Metaflow 2.2.5 executing HelloAWSFlow for user:jpujari
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2020-11-30 17:12:42.323 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] Setting up task environment.
2020-11-30 17:12:42.325 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -le: unexpected operator
2020-11-30 17:12:44.974 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -gt: unexpected operator
2020-11-30 17:12:44.975 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: job.tar: Cannot open: No such file or directory
2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: Error is not recoverable: exiting now
2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)] AWS Batch error:
2020-11-30 17:12:45.225 [58/hello/251 (pid 14705)] Essential container in task exited This could be a transient error. Use @retry to retry.
2020-11-30 17:12:45.233 [58/hello/251 (pid 14705)]
2020-11-30 17:12:47.600 [58/hello/251 (pid 14705)] Task failed.
2020-11-30 17:12:47.878 [58/hello/251 (pid 28819)] Task is starting (retry).
2020-11-30 17:12:48.588 [58/hello/251 (pid 28819)] Sleeping 2 minutes before the next AWS Batch retry
Task is starting.
<flow UserProfileFlow step make_user_profile[14] (input: [UserList(user_id=18...)> failed:
Internal error
Traceback (most recent call last):
File "/metaflow/metaflow/datatools/s3.py", line 588, in _read_many_files
stdout, stderr = self._s3op_with_retries(op,
File "/metaflow/metaflow/datatools/s3.py", line 658, in _s3op_with_retries
time.sleep(2i + random.randint(0, 10))
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
self.close()
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
self._closer.close()
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
unlink(self.name)
OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3op.stderrn16glb60'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/metaflow/metaflow/cli.py", line 883, in main
start(auto_envvar_prefix='METAFLOW', obj=state)
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(args, kwargs)
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context().obj, args, kwargs)
File "/metaflow/metaflow/cli.py", line 437, in step
task.run_step(step_name,
File "/metaflow/metaflow/task.py", line 394, in run_step
self._exec_step_function(step_func)
File "/metaflow/metaflow/task.py", line 47, in _exec_step_function
step_function()
File "train.py", line 121, in make_user_profile
files = s3.get_many(user_keys, return_missing=True)
File "/metaflow/metaflow/datatools/s3.py", line 417, in get_many
return list(starmap(S3Object, _get()))
File "/metaflow/metaflow/datatools/s3.py", line 411, in _get
for s3prefix, s3url, fname in res:
File "/metaflow/metaflow/datatools/s3.py", line 597, in _read_many_files
yield tuple(map(url_unquote, line.strip(b'\n').split(b' ')))
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
self.close()
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
self._closer.close()
File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
unlink(self.name)
OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3.inputs._ztikcnn'
service@http://Metaflo-XXXXX.elb.us-east-1.amazonaws.com
Hey Metaflow,
How can I disable the timestamp in the printout? I tried to monkeypatch the logger as follows but it messed all the parameter taking up and didn't work
from metaflow import cli
from functools import partial
cli.logger = partial(cli.logger, timestamp=False)
Is there any other way?
from https://docs.metaflow.org/metaflow/tagging
" if you have separate training and prediction flows in production, the prediction flow can access the previously built model as long as one exists in the same namespace"
I have two such flows, but I can't figure out how to have them in the same namespace. I've tried --authorize but it seems it creates a unique production token (i.e. namespace("production:flow1-0-zjgv")) for every unique flow name. I'm able to get around this by changing namespaces inside the script, but it sounds from the documentation there should be a way to have them in the same namespace so that I can more easily access trained models from the traning flow when I run the prediction flow. Am I misunderstanding something here?
Hi here, we're developing the plugin to run the Metaflow flows in the k8s using Argo Workflows. It's similar to the AWS Step Functions plugin but generates an Argo's WorkflowTemplate instead of the SFN's StateMachine. Also it adds an extra @argo decorator to specify k8s resources:
@argo(image='tensorflow/tensorflow:2.2.1-gpu-py3', nodeSelector={'gpu': 'nvidia-tesla-v100'})
@resources(gpu=1, cpu=2, memory=6000)
@step
def training(self):
...
Would you be interested to make such plugin a part of the Metaflow project?
I’ve been running into this issue quite a bit recently.
AWS Batch Error: CannotCreateContainerError: Error response from daemon: devmapper: Thin Pool has 4115 free data blocks which is less than minimum required 4449 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior This could be a transient error. Use @retry to retry.
Removing any EC2s controlled by Metaflow and letting the ASG create new ones seems to temporarily solve the problem but it keeps reappearing. Any advice on how to mitigate?
python flow.py run
causes the machine I'm running this on to go OOM after launching a Python process for each job. This is after setting --max-workers
and --max-num-splits
high enough to allow this many jobs.
I got this very weird issue running metaflow locally with a step that import pandas... while running the code out of metaflow it all works fine... but the same code in the same env with metaflow I get
ImportError: cannot import name 'Collection' from 'typing' (/home/hexa/.cache/pypoetry/virtualenvs/nima-images-H6fn72k1-py3.7/lib/python3.7/site-packages/typing.py)
Ideas ?
Hey Metaflow, one unexpected discovery I've stumbled upon is the orchestration of the DAG is performed locally, even when running on AWS Batch.
Additionally, I created an access list for our AWS API Gateway that our Metaflow API uses, as one layer of security.
This means that we have to leave our data scientists machines running and on the VPN, during long running training flows.
Is there any configuration of Metaflow to allow the orchestration to be performed remotely and allow our development machines to be disconnected from the VPN or even shut down during execution?
Has anyone run into intermmitent errors from the Metaflow service?
failed (code 500): {“message”: “Internal server error”}
This seems to happen on flows with multiple steps where the pipeline starts fine and runs a few steps, but then fails with the above error
/migration_service/migration_server.py:17: DeprecationWarning: loop argument is deprecated
app = web.Application(loop=loop)
/migration_service/migration_server.py:28: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
AttributeError: 'NoneType' object has no attribute 'cursor'
/bin/sh: 1: metadata_service: not found
@batch(image='custom-image')
. What would be the best way to build on top of the default image to ensure requirements for Metaflow are preserved? How does Metaflow install necessary dependencies on the default image?
curl
is not installed in the image. Seems to be used to lookup the dynamodb host. FWIW not an unreasonable dependency, and I was surprised to find that curl
wasn't already baked into the continuumio/miniconda3:latest
image. Once installing it in the docker image the dynamodb host resolution worked as expected.
Hello all, I'm trying to take a look at the artifacts associated with one of our SFN executions but when trying to call:
Step(...).task.data
We run into the following error:
ServiceException: Metadata request (/flows/{Flow}/runs/{Run}/steps/start/tasks/{Task}/artifacts) failed (code 500): 500 Internal Server Error
Looking at the logs from our metadata service, I see the following error:
Traceback (most recent call last):
File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 140, in get_artifacts_by_task
artifacts.body)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 355, in _filter_artifacts_by_attempt_id
attempt_id = ArtificatsApi._get_latest_attempt_id(artifacts)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 349, in _get_latest_attempt_id
if artifact['attempt_id'] > attempt_id:
TypeError: string indices must be integers
Hello, I'm trying to run metaflow on a docker alpine image with python and node. I've installed the metaflow required dependencies in the image along with other node requirements for my use case; and ran my script using --with batch:image=my-custom-image
. It resulted in this error
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] Setting up task environment.
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] /usr/bin/python: No module named pip
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] tar: can't open 'job.tar': No such file or directory
2020-12-23 00:14:08.205 [4747/start/30948 (pid 15530)] AWS Batch error:
2020-12-23 00:14:08.439 [4747/start/30948 (pid 15530)] Essential container in task exited This could be a transient error. Use @retry to retry.
2020-12-23 00:14:08.440 [4747/start/30948 (pid 15530)]
2020-12-23 00:14:08.791 [4747/start/30948 (pid 15530)] Task failed.
My question is, pip and the required python dependencies are installed in the container so what is causing the No module named pip
error? Thanks
Anyone had problems with multi GPU on Aws Batch ? : I get like:
CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr This could be a transient error. Use @retry to retry.
1 GPU works fine
CONDA_CHANNELS
but that doesn't work for me.METAFLOW_RUN_
to the desired variable but this doesn't seem to bring the variable into the AWS Batch environment for me.environment
decorator but I'm getting a linting error when attempting to use that. Then, when I use --no-pylint
to override, none of my Flow steps work.os.environ.get
a custom environment variable from within one of my Metaflow steps that was created in my local environment and passed to the AWS Batch environment. I feel like I'm missing something rather obvious.hey guys – curious if anyone else would find value in exposing the batch job parameter for sharedMemorySize
? It looks like the AWS batch team added the parameter towards the end of last year and it's a passthrough to docker run --shm-size
, which can really speed up the performance of pytorch parallel dataloaders (especially to saturate multiple GPUs) and some boosting libraries.
ECS defaults the instance shm to 50% of memory allocation, but docker will only expose 64mb of that by default to running containers.
📣 Metaflow was just included in the Netflix's security bug bounty program! Find vulnerabilities in the code and get paid for it 💰(Or just enjoy Metaflow getting more secure over time)
https://bugcrowd.com/netflix/updates/59a4e5dc-5e79-4965-9289-ae5a0d9de044
Hey Metaflow! I have a pretty specific question:
I find myself having trouble running a flow on AWS Batch that uses a container with pre-installed Python libraries. I happen to be using conda
to install a few extra libraries in this step but by doing so, it seems I now have a fragmented environment.
Any advice on how one can use a Docker container as a base environment and then seemingly add a few more packages in a specific step using conda?
The success criteria here would be to successfully import a package installed by the Docker image as well as a different package installed by the conda
decorator
--max-workers
between the local runtime and when deployed via SFN, specifically when having nested foreach fanouts. Locally, the runtime will enforce the parallelization at the task level so it will never go beyond that, however the SFN concurrency limit is enforced per-split, so the nested fanout will result in an effective parallelism of max-workers^2. Similarly, normal fanouts in a SFN deployment are not rate limited. Not sure it’s worth explicitly stating this in the docs, but thought I’d mention it just in case