/migration_service/migration_server.py:17: DeprecationWarning: loop argument is deprecated
app = web.Application(loop=loop)
/migration_service/migration_server.py:28: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
AttributeError: 'NoneType' object has no attribute 'cursor'
/bin/sh: 1: metadata_service: not found
@batch(image='custom-image')
. What would be the best way to build on top of the default image to ensure requirements for Metaflow are preserved? How does Metaflow install necessary dependencies on the default image?
curl
is not installed in the image. Seems to be used to lookup the dynamodb host. FWIW not an unreasonable dependency, and I was surprised to find that curl
wasn't already baked into the continuumio/miniconda3:latest
image. Once installing it in the docker image the dynamodb host resolution worked as expected.
Hello all, I'm trying to take a look at the artifacts associated with one of our SFN executions but when trying to call:
Step(...).task.data
We run into the following error:
ServiceException: Metadata request (/flows/{Flow}/runs/{Run}/steps/start/tasks/{Task}/artifacts) failed (code 500): 500 Internal Server Error
Looking at the logs from our metadata service, I see the following error:
Traceback (most recent call last):
File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/opt/latest/lib/python3.7/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 140, in get_artifacts_by_task
artifacts.body)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 355, in _filter_artifacts_by_attempt_id
attempt_id = ArtificatsApi._get_latest_attempt_id(artifacts)
File "/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py", line 349, in _get_latest_attempt_id
if artifact['attempt_id'] > attempt_id:
TypeError: string indices must be integers
Hello, I'm trying to run metaflow on a docker alpine image with python and node. I've installed the metaflow required dependencies in the image along with other node requirements for my use case; and ran my script using --with batch:image=my-custom-image
. It resulted in this error
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] Setting up task environment.
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] /usr/bin/python: No module named pip
2020-12-23 00:14:06.109 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] sh: 5: unknown operand
2020-12-23 00:14:08.204 [4747/start/30948 (pid 15530)] [a3c2939b-9256-41e5-8c7e-f076261e2739] tar: can't open 'job.tar': No such file or directory
2020-12-23 00:14:08.205 [4747/start/30948 (pid 15530)] AWS Batch error:
2020-12-23 00:14:08.439 [4747/start/30948 (pid 15530)] Essential container in task exited This could be a transient error. Use @retry to retry.
2020-12-23 00:14:08.440 [4747/start/30948 (pid 15530)]
2020-12-23 00:14:08.791 [4747/start/30948 (pid 15530)] Task failed.
My question is, pip and the required python dependencies are installed in the container so what is causing the No module named pip
error? Thanks
Anyone had problems with multi GPU on Aws Batch ? : I get like:
CannotStartContainerError: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr This could be a transient error. Use @retry to retry.
1 GPU works fine
CONDA_CHANNELS
but that doesn't work for me.METAFLOW_RUN_
to the desired variable but this doesn't seem to bring the variable into the AWS Batch environment for me.environment
decorator but I'm getting a linting error when attempting to use that. Then, when I use --no-pylint
to override, none of my Flow steps work.os.environ.get
a custom environment variable from within one of my Metaflow steps that was created in my local environment and passed to the AWS Batch environment. I feel like I'm missing something rather obvious.hey guys – curious if anyone else would find value in exposing the batch job parameter for sharedMemorySize
? It looks like the AWS batch team added the parameter towards the end of last year and it's a passthrough to docker run --shm-size
, which can really speed up the performance of pytorch parallel dataloaders (especially to saturate multiple GPUs) and some boosting libraries.
ECS defaults the instance shm to 50% of memory allocation, but docker will only expose 64mb of that by default to running containers.
📣 Metaflow was just included in the Netflix's security bug bounty program! Find vulnerabilities in the code and get paid for it 💰(Or just enjoy Metaflow getting more secure over time)
https://bugcrowd.com/netflix/updates/59a4e5dc-5e79-4965-9289-ae5a0d9de044
Hey Metaflow! I have a pretty specific question:
I find myself having trouble running a flow on AWS Batch that uses a container with pre-installed Python libraries. I happen to be using conda
to install a few extra libraries in this step but by doing so, it seems I now have a fragmented environment.
Any advice on how one can use a Docker container as a base environment and then seemingly add a few more packages in a specific step using conda?
The success criteria here would be to successfully import a package installed by the Docker image as well as a different package installed by the conda
decorator
--max-workers
between the local runtime and when deployed via SFN, specifically when having nested foreach fanouts. Locally, the runtime will enforce the parallelization at the task level so it will never go beyond that, however the SFN concurrency limit is enforced per-split, so the nested fanout will result in an effective parallelism of max-workers^2. Similarly, normal fanouts in a SFN deployment are not rate limited. Not sure it’s worth explicitly stating this in the docs, but thought I’d mention it just in case
I just noticed Batch has started hitting the Docker free tier rate limit. What’s the best way to mitigate this?
CannotPullContainerError: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Any chance we can get a copy of the Metaflow docker image hosted on the new Public ECR repos?
METAFLOW_INIT_
. A parameter with a name like "my-param", which is otherwise perfectly valid for Metaflow when using the local runtime, will result in an error when running via SFN, because many shells won't allow env vars with dashes in the name.
Hi, what is the way to run a nodejs process in the background in metaflow? I am running on batch using a custom docker image that has nodejs and python dependencies. The node app, once started, waits for a json post which is done by a task later in the metaflow python process.
The way to start the node app to is "npm run dev" however when I use os.system('npm run dev')
, the metaflow process gets paused at "App listening on http: / / localhost :8888" (as below) since it starts the node app right away which is then waiting for the json on port 8888. However this will be calculated in a later metaflow step and posted via requests.post("http://localhost:8888/savings-report", json=self.json_structure)
2021-01-26 00:29:49.835 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] Starting report generator ...
2021-01-26 00:29:49.835 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] > report-generator@1.0.0 dev /usr/src/app
2021-01-26 00:29:49.836 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] > ts-node src/server.ts
2021-01-26 00:29:49.836 [4816/start/31201 (pid 78223)] [94fd75e2-9b6a-4c13-87a9-57f6e6d4b811] App listening on http://localhost:8888
I would like to start the nodejs app using npm run dev
via metaflow and leave it running in the background and continue to the next steps in metaflow
Wondering if there's a more efficient way to implement the following design pattern directly in metaflow such that it would utilize multiprocessing to load and combine multiple dataframes after a foreach fanout:
df = [input.partition_df for input in inputs]
df = pd.concat(df, ignore_index=True, sort=False, copy=False)
A hacky way that's coming to mind is to just use joblib.Parallel
or metaflow's parallel_map
to access the artifacts in parallel, but it feels a bit odd. This pattern may also be related to the roadmap effort to open source your all's in-house goodies for dataframes. I use partitioned parquet files in a couple places to split out data, pass references around, and load in parallel – but there's a couple use cases where I'd prefer to stay within the metaflow ecosystem if possible :smiley:
Curious what your all's thoughts are, and just want to make sure I'm not missing something like a clever usage of s3.get_many
.
--with batch
it takes around 8 minutes) most of the time is consumed to bootstrap conda environment for each step before running it!!!I was in our AWS Batch console and I noticed two jobs that were seeming stuck in RUNNING
. The individual who kicked off those jobs says all his terminal sessions have been ended, even to go as far to restart his PC/sever internet connection.
I figure this is more of an AWS situation I'm debugging but has anyone witnessed flows being stuck in RUNNING
?
I know the jobs will die when the timeout is reached, just want to understand what may have caused this
02-statistics/stats.py
.
The tutorial 4 seems to be failing attempting to create a conda environment. The funny thing is that if I run that command directly it seems to succeed. Not sure how to get the conda errors:
python 04-playlist-plus/playlist.py --environment=conda runMetaflow 2.2.6 executing PlayListFlow for user:...
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Bootstrapping conda environment...(this could take a few minutes)
Conda ran into an error while setting up environment.:
Step: start, Error: command '['/opt/miniconda/condabin/conda', 'create', '--yes', '--no-default-packages', '--name', 'metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361', '--quiet', b'python==3.8.5', b'click==7.1.2', b'requests==2.24.0', b'boto3==1.17.0', b'coverage==5.4', b'pandas==0.24.2']' returned error (-9): b''
Note that the following command succeeds:
/opt/miniconda/condabin/conda create --yes --no-default-packages --name metaflow_PlayListFlow_linux-64_c08336d0946efed6e92f165475dfc0d181f64361 --quiet python==3.8.5 click==7.1.2 requests==2.24.0 boto3==1.17.0 coverage==5.4 pandas==0.24.2
Note: that I had to make a few minor changes to the demo to refer to Python 3.8.5 and to add a dependency to more recent versions of boto3 and coverage than what metaflow was requesting otherwise the generated conda create command would fail even on the command line.
...
File "/metaflow/metaflow/plugins/aws/step_functions/step_functions_decorator.py", line 54, in task_finished self._save_foreach_cardinality(os.environ['AWS_BATCH_JOB_ID'],
...
requests.exceptions.ConnectionError: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /latest/meta-data/placement/availability-zone/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7c0f12e3d0>: Failed to establish a new connection: [Errno 22] Invalid argument'))
Hi, I am working on getting the metaflow artifacts from S3. The code is deployed on AWS lambda I set the environment variable “METAFLOW_DATASTORE_SYSROOT_S3” to the s3 location. Our use case requires us to change the datastore environ variable in every iteration so that different flows and runs’ artifacts can be accessed as follows:
def _queryMetaflow(self, appName, starflowResp):
metaflow_run_id = starflowResp["details"]["frdm"]["metaflowRunNumber"]
metaflow_name = starflowResp["details"]["frdm"]["metaflowId"]
os.environ['METAFLOW_DATASTORE_SYSROOT_S3'] = "{}/artifacts/{}/higher".format(getMetadataLocation(), appName)
from metaflow import Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
metadata1 = metadata(getMetadataURL())
namespace(None)
mf = Metaflow()
# call metaflow and get results and send success or error
try:
metaflowResp = Run(metaflow_name + '/' + metaflow_run_id).data
print(metaflowResp)
del Metaflow, get_metadata, metadata, namespace, Run, get_namespace, Flow
return metaflowResp
except Exception as e:
print("Exception occured in query metaflow: {}".format(e))
raise CapAppFailure("Exception occured in metaflow response, S3 datastore operation _get_s3_object failed likely")
When this method is called the first time, it doesn’t fail in the first iteration but fails in the second iteration. I inspected the environ variable and the location is the correct in every iteration but this error is encountered in the second iteration:
S3 datastore operation _get_s3_object failed (An error occurred (404) when calling the HeadObject operation: Not Found). Retrying 7 more times..
I am unable to fix this issue. Can you please help?
Hello Netflix employees, can someone please share about Metaflow's adoption at Netflix? In late 2018 it was used in 134 projects, how has it grown since then? What percentage of Netflix data scientists use metaflow?
We're considering Metaflow at my organization, so I'd just like to get a sense of the adoption rate we can hope for at my employer.
ModuleNotFoundError: No module named 'pandas'
when the step function is triggered. I tried running with the commands python 02-statistics/stats.py --environment=conda step-functions create --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"
as well as python 02-statistics/stats.py step-functions create --max-workers 4
and both give the same error message.python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"
it works fine.