Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Nrithya M
    @MNrithya_twitter
    This message was deleted
    1 reply
    Matt Corley
    @corleyma
    Hi there. I see in the codebase that there seems to be support for monitoring and event logging plugins, both of which have existing debug implementations, but I can't find any mention in the documentation. What would be the best entry point in Metaflow today to handle common cross-cutting concerns for flows like monitoring/performance profiling? My goal is to create a re-usable abstraction that can be enabled to profile on a per step basis things like execution timings (not just of the step overall, but e.g. time spent serializing/deserializing state), peak memory utilization, etc.
    23 replies
    acsecond
    @acsecond
    Hi guys I hope it is the correct place to post, I am new to metaflow I configure it with s3 and batch but now I want to start using step functions in the documentation i read python parameter_flow.py --with retry step-functions create but I get no such command step function, can some one maybe refer me to a good documntation?
    1 reply
    Revaapriyan
    @Revaapriyan

    Hey People. I would like to know a way to restrict the amount of parallelization that should be done in my local instance at any point in time. Parallelization meaning amount of cpu-cores used by the program. Say I have a task that has to executed parallelly as 50 threads, each requires 2 core to process and if my machine is a 32 core machine, Metaflow runs ~15-16 threads at a time utilizing all the processing-cores in the machine. I would like to restrict this parallelization to, say 12 threads at any given point of time.

    In python's multiprocessing library, there is an option of setting the number of pool workers as a required number. Is there a way to achieve the same with Metaflow?

    5 replies
    Carter Kwon
    @CarterKwon

    Hello, I see that Metaflow snapshots the code used in a run

    From the docs: "Code package is an immutable snapshot of the relevant code in the working directory, stored in the datastore, at the time when the run was started. A convenient side-effect of the snapshot is that it also works as a code distribution mechanism for runs that happen in the cloud."

    How would I access the code from previous runs?

    Thanks!

    8 replies
    Richard Decal
    @crypdick

    Hey all, I assessed MetaFlow as an alternative to our Kedro + Airflow infra. Thought I'd share my assessment. One blocker for adopting MetaFlow is the inability to separate parameters from pipeline definitions.

    For context, we currently use Kedro to generate many "flavors" of the same pipeline for different scenarios. For instance, we use the same template inference pipeline for model validation, active learning, detecting label noise, etc. We do this by defining our parameters separately from our DAGs. It would be nice if MetaFlow had integrations with (say) Facebook's Hydra so that we could easily compose config files and separate parameter definitions from DAG definitions.

    image.png

    7 replies
    Bahattin Çiniç
    @bahattincinic

    Hey all, I have a question about logging. In our project, we are using python standard logging. (https://docs.python.org/3/howto/logging.html) When we send a warning, debug vs. logs with it, Metaflow overrides these logs and sends it's to info.

    Here is a code example;

    import logging.config
    
    from metaflow import FlowSpec, step
    
    LOGGING_CONFIG = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'standard': {
                'format': '[%(levelname)s] %(name)s: %(message)s'
            },
        },
        'handlers': {
            'default': {
                'level': 'INFO',
                'formatter': 'standard',
                'class': 'logging.StreamHandler',
                'stream': 'ext://sys.stdout',
            },
        },
        'loggers': {
            '': {  # root logger
                'handlers': ['default'],
                'level': 'INFO',
                'propagate': False
            },
        }
    }
    
    class DebugFlow(FlowSpec):
    
        @step
        def start(self):
            self.next(self.a, self.b)
    
        @step
        def a(self):
            logger.debug("Hello Debug log")
            self.x = 1
            self.next(self.join)
    
        @step
        def b(self):
            self.x = int('2')
            logger.warning("Hello warning log")
            self.next(self.join)
    
        @step
        def join(self, inputs):
            logger.info('a is %s', inputs.a.x)
            logger.info('b is %s', inputs.b.x)
            logger.info('total is %d', sum(input.x for input in inputs))
            logger.error("Hello error log")
            self.next(self.end)
    
        @step
        def end(self):
            pass
    
    if __name__ == '__main__':
        logger = logging.getLogger('DebugFlow')
        DebugFlow()

    When I took a look at how Metaflow handles logging, I realized that Metaflow uses different logging systems. I also tested logging configuration with --event-logger. it looks like it doesn't work.

    import logging.config
    
    from metaflow.plugins import LOGGING_SIDECAR, SIDECAR
    
    from metaflow import FlowSpec, step
    
    
    LOGGING_CONFIG = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'standard': {
                'format': '[%(levelname)s] %(name)s: %(message)s'
            },
        },
        'handlers': {
            'default': {
                'level': 'INFO',
                'formatter': 'standard',
                'class': 'logging.StreamHandler',
                'stream': 'ext://sys.stdout',
            },
        },
        'loggers': {
            '': {  # root logger
                'handlers': ['default'],
                'level': 'INFO',
                'propagate': False
            },
        }
    }
    
    
    class DebugFlow(FlowSpec):
    
        @step
        def start(self):
            self.next(self.a, self.b)
    
        @step
        def a(self):
            logger.debug("Hello Debug log")
            self.x = 1
            self.next(self.join)
    
        @step
        def b(self):
            self.x = int('2')
            logger.warning("Hello warning log")
            self.next(self.join)
    
        @step
        def join(self, inputs):
            logger.info('a is %s', inputs.a.x)
            logger.info('b is %s', inputs.b.x)
            logger.info('total is %d', sum(input.x for input in inputs))
            logger.error("Hello error log")
            self.next(self.end)
    
        @step
        def end(self):
            pass
    
    
    class CustomEventLogger(object):
        TYPE = 'customEventLogger'
    
        def __init__(self):
            self.logger = logging.getLogger('DebugFlow')
    
        def log(self, msg):
            self.logger.info('event_logger: %s', str(msg))
    
        def process_message(self, msg):
            # type: (Message) -> None
            self.log(msg.payload)
    
        def shutdown(self):
            pass
    
    
    def setup_logger():
        logger_config = {
            'customEventLogger': CustomEventLogger
        }
    
        LOGGING_SIDECAR.update(logger_config)
        SIDECAR.update(logger_config)
        logging.config.dictConfig(LOGGING_CONFIG)
    
    
    if __name__ == '__main__':
        setup_logger()
        logger = logging.getLogger('DebugFlow')
        DebugFlow()
    python debug_flow.py --event-logger=customEventLogger run

    How can I configure the Metaflow logger? if it is not possible, how can I send debug, warning logs with Metaflow logger? Thanks.

    10 replies
    Apoorv Sharma
    @sharma_apoorv_twitter

    Hello everyone ! I am exploring options for my next project implementation. Based on initial documentation metaflow seems to hit all the points my team is looking for in a framework. The only question I have is:

    Our team uses Azure and not AWS. Are there going to be issues in deploying and scaling metaflow based solutions on Azure ?

    6 replies
    Calum Macdonald
    @calmacx_gitlab

    hi all,
    I'd like to know what the best way of passing a variable defined in a step that gets split and then use it after joining.

    I could do something like use self.merge_artifacts(inputs,include=[<vars>])? Im sure inputs[0].<var> also works. These are fine, but Im not sure how efficient it is, or how it will cope with many more splits

    Fuller simple example to see what I mean:

    from metaflow import FlowSpec, step
    
    class Foo(FlowSpec):
        @step
        def start(self):
            self.msg = 'hi %s'
            self.steps = list(range(0,10))
            self.next(self.bar, foreach='steps')
        @step
        def bar(self):
            print (self.input)
            print (self.msg%(' from bar'))
            self.next(self.join)
        @step
        def join(self,inputs):
            #to be able to use self.mg in the next step, use merge_artifacts
            self.merge_artifacts(inputs,include=['msg'])
            self.next(self.end)
        @step
        def end(self):
            print (self.msg%(' from end'))
            print ('end')
    
    
    if __name__ == "__main__":
        Foo()

    I want to make sure I'm doing this in the best way

    cheers. Loving metaflow btw , top work on all the docs!

    2 replies
    jonathan-atom
    @jonathan-atom

    Hello Metaflow community! After setting up Airflow for a proof of concept and evaluating the other obvious/recent options, I am trying to decide between Prefect (self-hosted) and Metaflow for next steps.

    There seems to be a gap when it comes to monitoring Metaflow jobs (no ui/dashboard). How do you handle this? Am I missing something or do you fall back on AWS monitoring features?

    1 reply
    Richard Decal
    @crypdick
    ^ looks like my message got chopped. if I factor out a step as a separate imported module, I have to just make sure to return all the artifacts I want to persist and do something like self.x, self.y, self.z, ... = imported_node() ?
    2 replies
    joe153
    @joe153
    I am starting to see this docker error: You have reached your pull rate limit.. I believe this is due to the recent (November 2, 2020) change: https://www.docker.com/increase-rate-limits. What is recommended approach to resolve this? Do you guys have a step by step instruction how we can set up a private account?
    6 replies
    Malay Shah
    @malay95
    Hello everyone,
    I wanted your advise on setting up a devops infrastructure for our team in our company. We want to run the tests in the aws batch and get the artifacts as files back to the caller (either cmd or a script). I know that metaflow shows the stdout from the batch instance in the cmd line and we wanted to do something similar. Can you guys shed some light on this? And what are your thoughts?
    5 replies
    acsecond
    @acsecond

    I have the following folder structure:
    -metaflow project/

    - flow_a.py
    - flow_b.py
    - helpers.py

    Flow a and flow b are separated independent flow, but there some functions that occurs both in a and b,
    For avoiding duplicate code I made helper function in helpers.py which I import in both flow a and b.
    My problem is, when I deploy on AWS step function with python flow_a.py step-functions create
    The flow Is uploaded but helpers.py not, therefore when I try to import in my steps function from the helpers.py the code fail,

    What is the correct approach to address this problem?
    Thx

    3 replies
    Apoorv Sharma
    @sharma_apoorv_twitter
    Can I start a new flow as part of a task ? Or is that considered bad design ?
    2 replies
    Wooyoung Moon
    @wmoon5
    Hi I was wondering if I could see a basic example of parameterizing the @batch decorator of a step from a json file that gets read in as a Parameter. Savin mentioned something about being able to do this by defining a separate function, but now that I'm actually trying to do it, it's not super obvious to me how to actually do it.
    3 replies
    Matt Corley
    @corleyma
    Is there currently a mechanism to change the pickle protocol level for metaflow? Would be great to be able to use protocol 5 where supported.
    15 replies
    Antoine Tremblay
    @hexa00
    Question is the best way to run a long running task to use step functions ? Or there's a way to run things in "daemon" mode... say a user works on his laptop and starts something that takes 3 days... then his laptop is closed etc... what's the proper workflow there ?
    2 replies
    Note we just installed metaflow on aws using terraform.. works great so far ! :) faster than I thought to spin up jobs etc
    2 replies
    acsecond
    @acsecond
    Hi guys short question, In the example of sending Parameters to step function there is the following: {"Parameters": "{\"key1\": \"value1\", \"key2\": \"value2\"}"} always with those "\.... but how do I send a dictionary as a params I tried several method and nothing works.. {"Parameters": "{\"key1\": \"value1\", \"key2\": \"value2\, \"key3\": \"json.dumps(my dict\"}"} this is not working, what is the corrct way?
    1 reply
    karimmohraz
    @karimmohraz
    Hi, we are working on a plugin for creating argo workflows out of metaflow scripts.
    When using "foreach" the method decompress_list(input_paths) in cli.py is called. Unfortunately the argo output parameters do not match the expected "input mode".
    I was wondering if there is a way similar to "task_pre_step" where our ArgoInternalStepDecorator could convert the input_paths into the desired format. Or maybe you have another hint.
    (We want to prevent persisting the childrens' output as step_functions does.)
    1 reply
    Greg Hilston
    @GregHilston

    Is there any reason why one would be unable to store a function in Flows' step and call said function in a subsequent step?

    from metaflow import FlowSpec, step, conda_base
    
    
    @conda_base(python="3.8.3")
    class FunctionStateFlow(FlowSpec):
        """Explores how one can pass functions through Metaflow's state from one
        step to another.
        """
    
        def simple_function(self):
            """Defines a simple function that we can use to pass throughout
            Metaflow
            """
            return 42
    
        @step
        def start(self):
            """Initial step in DAG."""
            self.fun = self.simple_function
    
            print(f"is the variable 'fun' available in 'self'? {hasattr(self, 'fun')}")  # prints true
    
            self.next(self.should_print_forty_two)
    
        @step
        def should_print_forty_two(self):
            """Prints forty two as it leverages the pickled function from the start step"""
            print(f"is the variable 'fun' available in 'self'? {hasattr(self, 'fun')}")  # prints false
    
            print(self.fun())  # AttributeError: Flow FunctionStateFlow has no attribute 'fun'
    
            self.next(self.end)
    
        @step
        def end(self):
            """Does nothing, exists as a formality"""
            pass
    
    
    if __name__ == "__main__":
        FunctionStateFlow()

    I know Metaflow does not support the storage generators but cannot see why storing this function would not work.

    4 replies
    Alireza Keshavarzi
    @isohrab

    Using SAP HANA in one step

    Hi, I need to connect to SAP Hana in one of my steps. I followed official documentation and tested it in sample flow and it works well with batch decorator.

    The problem arise when I use @conda (or @conda_base) decorator. The error is ModuleNotFoundError No module named 'hana_ml'.

    I think I need something like os.system(path/to/conda install hana-ml)
    I posted my code inside this thread.
    I appreciate your help.

    5 replies
    joe153
    @joe153
    Hello, any idea why I am getting this error below when conda environment is enabled? METAFLOW_PROFILE=prod python helloworld.py --environment=conda run
    I am specifying export METAFLOW_HOME=. and the config_prod.json file exists. It works just fine without conda env.
    Bootstrapping conda environment...(this could take a few minutes)
        S3 access failed:
        Uploading S3 files failed.
        First key: s3://folder_name/conda/conda.anaconda.org/anaconda/linux-64/bzip2-1.0.8-h7b6447c_0.tar.bz2/f52e60deb7f4c82821be9a868e889348/bzip2-1.0.8-h7b6447c_0.tar.bz2
        Error: Traceback (most recent call last):
          File "/project_name/lib/python3.8/site-packages/metaflow/datatools/s3op.py", line 32, in <module>
            from metaflow.util import url_quote, url_unquote
          File "/project_name/lib/python3.8/site-packages/metaflow/__init__.py", line 45, in <module>
            from .event_logger import EventLogger
          File "/project_name/lib/python3.8/site-packages/metaflow/event_logger.py", line 1, in <module>
            from .sidecar import SidecarSubProcess
          File "/project_name/lib/python3.8/site-packages/metaflow/sidecar.py", line 14, in <module>
            from .debug import debug
          File "/project_name/lib/python3.8/site-packages/metaflow/debug.py", line 45, in <module>
            debug = Debug()
          File "/project_name/lib/python3.8/site-packages/metaflow/debug.py", line 22, in __init__
            import metaflow.metaflow_config as config
          File "/project_name/lib/python3.8/site-packages/metaflow/metaflow_config.py", line 34, in <module>
            METAFLOW_CONFIG = init_config()
          File "/project_name/lib/python3.8/site-packages/metaflow/metaflow_config.py", line 28, in init_config
            raise MetaflowException('Unable to locate METAFLOW_PROFILE \'%s\' in \'%s\')' %
        metaflow.exception.MetaflowException: Unable to locate METAFLOW_PROFILE 'prod' in '.')
    3 replies
    Alexander Myltsev
    @alexander-myltsev

    hello,

    do you know any community implementations of Metaflow without AWS? using only backends (Kubernetes?) that can be deployed at Azure or own cluster?

    or maybe a comprehensive description of what should be done to Metaflow source code to make it possible?

    3 replies
    Greg Hilston
    @GregHilston

    Hey guys, I'm using the newest version of Metaflow (2.2.5) on macOS Catalina 10.15.5 and running my jobs on remote AWS Batch infrastructure.

    I witnessed an error I have never experienced before and would appreciate any thoughts on what may have happened:

    I kicked off a step that leveraged Metaflow's foreach command to process ~100 .json files. The long running foreach successfully finished processing many files and on one of the final files it threw the following error:

    4 replies
    2020-11-17 13:31:13.397 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c] Downloading code package.
    2020-11-17 13:31:13.398 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c] Code package downloaded.
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c] Bootstrapping environment.
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c] Traceback (most recent call last):
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]   File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]     "__main__", mod_spec)
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]   File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    2020-11-17 13:31:14.696 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]     exec(code, run_globals)
    2020-11-17 13:31:14.697 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]   File "/metaflow/metaflow/plugins/conda/batch_bootstrap.py", line 52, in <module>
    2020-11-17 13:31:14.697 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]     bootstrap_environment(sys.argv[1], sys.argv[2])
    2020-11-17 13:31:15.892 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73] Bootstrapping environment.
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73] Traceback (most recent call last):
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]   File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]     "__main__", mod_spec)
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]   File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]     exec(code, run_globals)
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]   File "/metaflow/metaflow/plugins/conda/batch_bootstrap.py", line 52, in <module>
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]     bootstrap_environment(sys.argv[1], sys.argv[2])
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]   File "/metaflow/metaflow/plugins/conda/batch_bootstrap.py", line 17, in bootstrap_environment
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]     packages = download_conda_packages(flow_name, env_id)
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]   File "/metaflow/metaflow/plugins/conda/batch_bootstrap.py", line 33, in download_conda_packages
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73]     env = json.load(f)[env_id]
    2020-11-17 13:31:15.893 [1020/step_name/3453 (pid 74544)] [290048aa-1288-4e15-8560-6563a1ab9f73] KeyError: 'metaflow_FlowName_linux-64_73cd09227bd95230e7515d91e8720e033d626911'
    2020-11-17 13:31:18.462 [1020/step_name/3454 (pid 74625)] [551def93-7276-42b7-a4a3-249f9cb465c4] Task is starting (status RUNNABLE)...
    2020-11-17 13:31:23.218 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4959-a188-e58ed0deb62c]   File "/metaflow/metaflow/plugins/conda/batch_bootstrap.py", line 17, in bootstrap_environment
    2020-11-17 13:31:23.218 [1020/step_name/3452 (pid 74533)] [c77d2c49-84f3-4

    I'm specifically reaching out for advice here as this error bubbles up from the metaflow plugins/conda/batch_bootstrap.py script. The only thing my google foo has been able to uncover is how in older versions of metaflow, one would not want to use a dash in their parameter names.

    Appreciate any thoughts or advice!

    karimmohraz
    @karimmohraz
    Good day, I have nested 2 foreachs: outer -> inner -> innerjoin -> outerjoin
    inputs in innerjoin can access input.variables both from inner and outer foreach. But when looping over inputs in outerjoin I only get back this type: <class '__main__.Nestedflow'>. I expected to be able to loop over the outerjoin inputs?
    How are variables handled in nested foreach joins?
    6 replies
    David Patschke
    @dpatschke

    It looks like AWS Batch is trying to outsmart me by spinning up a larger instance and then spawning multiple 'foreach' steps within that larger instance. This is exactly what I DO NOT want to happen since I want to be able to launch a dask multiprocessing compute function with scheduler=processesparameter within each of my foreachsteps. By default, dask will look to use all compute cores and if two dask calls are made independently on the same system without knowledge of one another, then double the compute and memory will be requested and resource problems will occur.

    I suppose I could limit my num_workers parameter within my dask call to make things work but I'm not really solving the problem I'm wanting to solve for.

    Does anyone have insight into how to tell Metaflow/AWS Batch to explicitly launch a single instance for every foreach step rather combining multiple foreach steps under the hood into a single compute instance?

    5 replies
    Antoine Tremblay
    @hexa00
    Hello, I use poetry for most of my project and would like to run like os.system('poetry install') at the start of my steps... however it seems that only .py files are added to the image's working dir... is there a way to include other files ? I see IncludeFile.. Ibut seems to be this will become an object in s3 and thus wont be accessible via the shell ? I would need to add like poetry.lock and pyproject.toml
    4 replies
    Antoine Tremblay
    @hexa00
    Ideas on how I would use my own decorator for a step ? using a basic decorator after @step isn't good it seems....?
    2 replies
    Wooyoung Moon
    @wmoon5
    Hi, what is currently the best way to get past the GetLogEvents throttling issue when running a lot of batch jobs in parallel? Tempted to modify my Metaflow code (maybe in batch_client.py) to swallow that particular exception, but was wondering if there's a better way?
    2 replies
    Malay Shah
    @malay95
    Hello, I wanted to use the parallel_map functionality by metaflow in a step. I could not find detailed documentation of the function just its usage with lambda, I want to be able to run the same function with different parameters in parallel and get the result in a list. (like pool.map_async from the multiprocessing library) .
    Thanks in advance
    4 replies
    Malay Shah
    @malay95
    Hello, I am working on a flow and I am getting this error Step end_train joins steps from unrelated splits. Ensure that there is a matching join for every split. I checked the code and the self.next looks correct, is there a way to check the DAG that is created from the code and see exactly where the issue is?
    4 replies
    jaiprasad Reddy
    @jaiprasadreddy
    Hi team, I started working on the metaflow recently. I am trying to use S3 and AWS batch only. I have setup accordingly the S3 and AWS batch and also ECS IAM role access. I can see that it is able to create Cluster in ECS, upload the flow in the S3. But its not able to submit a job in AWS Batch. when I am running it on the terminal "python helloaws.py run" i am getting segmentation fault. How to debug this.?
    2 replies
    Wooyoung Moon
    @wmoon5
    Anyone ever run into an error message like this?:
    Metaflow service error:
    Metadata request (/flows/PoIExperimentFlow/run) failed (code 500): "{\"err_msg\": \"__init__() got an unexpected keyword argument 'run_id'\"}"
    16 replies
    Malay Shah
    @malay95
    Hello everyone, happy Thanksgiving, I am working on a metaflow flow which uses high RAM and processors, I want to monitor the usage on batch as well as on my local, Can I use something which can generate the usage of RAM and processors at the end of each step or a decorator that does it? I want to estimate a good amount of RAM that would be good in such cases.
    2 replies
    Denis Maciel
    @denismaciel
    hi there, is it possible to pass the default Docker image to run on AWS Batch from the command line? Something like python flow.py --with batch --dockerimage <image-url> or is it only possible using the batch decorator?
    2 replies
    bishax
    @bishax

    Hi, according to Netflix/metaflow#193 then ThrottlingException's don't cause task failure; however I am reliably getting ThrottlingException followed by a task failure a second later, e.g.

    2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)]     AWS Batch job error:
    2020-11-30 15:16:58.243 [659/link_finder/4638 (pid 1526921)]     ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')
    2020-11-30 15:16:58.539 [659/link_finder/4638 (pid 1526921)] 
    2020-11-30 15:16:59.030 [659/link_finder/4638 (pid 1526921)] Task failed.
    2020-11-30 15:16:59.092 [659/link_finder/4638 (pid 1535437)] Task is starting (retry).

    Perhaps, a flurry of logging error events are being masked by the throttling exception hiding the source of the failure?

    6 replies
    Ji Xu
    @xujiboy

    Hi I am trying to install metaflow for R following the doc, but ran into this following error when trying to test with metaflow::test():

    Metaflow 2.2.0 executing HelloWorldFlow for user:ji.xu
    Validating your flow...
        The graph looks good!
    2020-11-30 10:50:33.216 Workflow starting (run-id 1606762233207871):
    2020-11-30 10:50:33.222 [1606762233207871/start/1 (pid 49822)] Task is starting.
    2020-11-30 10:50:34.783 [1606762233207871/start/1 (pid 49822)] Fatal Python error: initsite: Failed to import the site module
    2020-11-30 10:50:34.785 [1606762233207871/start/1 (pid 49822)] Traceback (most recent call last):
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 550, in <module>
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     main()
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 531, in main
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     known_paths = addusersitepackages(known_paths)
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 282, in addusersitepackages
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]     user_site = getusersitepackages()
    2020-11-30 10:50:34.786 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 258, in getusersitepackages
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     user_base = getuserbase() # this will also set USER_BASE
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/site.py", line 248, in getuserbase
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     USER_BASE = get_config_var('userbase')
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 609, in get_config_var
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     return get_config_vars().get(name)
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sysconfig.py", line 588, in get_config_vars
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     import _osx_support
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/_osx_support.py", line 4, in <module>
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]     import re
    2020-11-30 10:50:34.787 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/re.py", line 123, in <module>
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]     import sre_compile
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]   File "/Users/ji.xu/anaconda3/envs/r-metaflow/lib/python3.6/sre_compile.py", line 17, in <module>
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)]     assert _sre.MAGIC == MAGIC, "SRE module mismatch"
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] AssertionError: SRE module mismatch
    2020-11-30 10:50:34.788 [1606762233207871/start/1 (pid 49822)] Error: Error 1 occurred creating conda environment r-reticulate
    2020-11-30 10:50:34.817 [1606762233207871/start/1 (pid 49822)] Execution halted
    2020-11-30 10:50:34.820 [1606762233207871/start/1 (pid 49822)] Task failed.
    2020-11-30 10:50:34.820 Workflow failed.
    2020-11-30 10:50:34.820 Terminating 0 active tasks...
    2020-11-30 10:50:34.820 Flushing logs...
        Step failure:
        Step start (task-id 1) failed.

    It seems the problem is on the python side. Has anyone seen the same issue and has a solution?

    45 replies
    David Patschke
    @dpatschke

    I've been having loads of problems getting my AWS Batch job working through Metaflow. Since I'm not overly experienced with AWS CloudOps, it's hard for me to tell whether the issue is an AWS issue or a Metaflow limitation.

    Here is one thing I experienced which may help others:

    1. As per @russellbrooks suggestion in this thread, I have created a single job queue which contains both CPU and GPU compute environments. When I use the @batch decorator, I noticed that GPU instances were getting launched even when I explicitly denoted gpu=0 as a parameter in the decorator.

    This appears to be happening for a couple of reasons:

    • I maxed out my vCPU limit on my CPU ComputeEnvironment which is forcing a job to launch on the GPU ComputeEnvironment. After talking with AWS support, if any of you here are really wanting to crank up the number of batch workers, make sure the MaxVCPUBatch parameter in the CloudFormation template is also adjusted upwards accordingly. For me, I'm running Dask parallelization within each Batch task, so I'm using up the MaxVCPUBatch pretty quickly and was only seeing one c5.18xlarge instance launch at any one time when I had a MaxVCPUBatch value of 96 in my CloudFormation template. So ... even though the Metaflow documentation lists a --max-workers parameters in the CLI, the number of maximum workers will also be throttled by MaxVCPUBatch in the CloudFormation Template.

    • The explicit denotion of gpu=0 does nothing within the Metaflow @batch decorator (BatchJobclass). I know there are a lot of ways to correct for this (separate job queues, solution mentioned above, etc.) but was curious what the Metaflow devs on this forum think of possibly changing line 150 in batch_client.py to read if int(gpu) >= 0 to protect from GPU instances being launched "unnecessarily".

    11 replies
    jpcloudconsulting
    @jpcloudguru_twitter
    Hi trying to run a hello metaflow example on AWS batch. Getting the following error. Any ideas?
    4 replies
    Metaflow 2.2.5 executing HelloAWSFlow for user:jpujari
    Validating your flow...
        The graph looks good!
    Running pylint...
        Pylint is happy!
    2020-11-30 17:12:42.323 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] Setting up task environment.
    2020-11-30 17:12:42.325 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -le: unexpected operator
    2020-11-30 17:12:44.974 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] /bin/sh: 1: [: -gt: unexpected operator
    2020-11-30 17:12:44.975 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: job.tar: Cannot open: No such file or directory
    2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)] [c0cd1149-2c44-4178-bbfe-40179180c331] tar: Error is not recoverable: exiting now
    2020-11-30 17:12:44.977 [58/hello/251 (pid 14705)]     AWS Batch error:
    2020-11-30 17:12:45.225 [58/hello/251 (pid 14705)]     Essential container in task exited This could be a transient error. Use @retry to retry.
    2020-11-30 17:12:45.233 [58/hello/251 (pid 14705)]
    2020-11-30 17:12:47.600 [58/hello/251 (pid 14705)] Task failed.
    2020-11-30 17:12:47.878 [58/hello/251 (pid 28819)] Task is starting (retry).
    2020-11-30 17:12:48.588 [58/hello/251 (pid 28819)] Sleeping 2 minutes before the next AWS Batch retry
    beks
    @teki-b
    hello, we have been using the pip decorator to install versions of libraries as follows: @pip(libraries={"<library-name>":"<version>"}) - does anyone know if it's possible/ how to get this to use the latest version dynamically without specifying this value?
    1 reply
    Timothy Do
    @TDo13
    Hello, are there potentially good examples of how to use the step command to run an individual step or a subset of steps in a metaflow workflow?
    11 replies
    Ian Wesley-Smith
    @iwsmith
    Hello, I am running a Metaflow job on AWS Batch and am getting some weird S3 errors. One of my parallel steps fails, with one job returning:
    Task is starting.
    <flow UserProfileFlow step make_user_profile[14] (input: [UserList(user_id=18...)> failed:
        Internal error
    Traceback (most recent call last):
      File "/metaflow/metaflow/datatools/s3.py", line 588, in _read_many_files
        stdout, stderr = self._s3op_with_retries(op,
      File "/metaflow/metaflow/datatools/s3.py", line 658, in _s3op_with_retries
        time.sleep(2i + random.randint(0, 10))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
        self.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
        self._closer.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
        unlink(self.name)
    OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3op.stderrn16glb60'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/metaflow/metaflow/cli.py", line 883, in main
        start(auto_envvar_prefix='METAFLOW', obj=state)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 829, in __call__
        return self.main(args, kwargs)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, ctx.params)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/core.py", line 610, in invoke
        return callback(args, kwargs)
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
        return f(get_current_context().obj, args, kwargs)
      File "/metaflow/metaflow/cli.py", line 437, in step
        task.run_step(step_name,
      File "/metaflow/metaflow/task.py", line 394, in run_step
        self._exec_step_function(step_func)
      File "/metaflow/metaflow/task.py", line 47, in _exec_step_function
        step_function()
      File "train.py", line 121, in make_user_profile
        files = s3.get_many(user_keys, return_missing=True)
      File "/metaflow/metaflow/datatools/s3.py", line 417, in get_many
        return list(starmap(S3Object, _get()))
      File "/metaflow/metaflow/datatools/s3.py", line 411, in _get
        for s3prefix, s3url, fname in res:
      File "/metaflow/metaflow/datatools/s3.py", line 597, in _read_many_files
        yield tuple(map(url_unquote, line.strip(b'\n').split(b' ')))
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 492, in __exit__
        self.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 499, in close
        self._closer.close()
      File "/metaflow/metaflow_UserProfileFlow_linux-64_b2cb8ad829dfa351f545cf2d7738ca9eab794992/lib/python3.8/tempfile.py", line 436, in close
        unlink(self.name)
    OSError: [Errno 30] Read-only file system: '/metaflow/metaflow.s3.7chybqzr/metaflow.s3.inputs._ztikcnn'
    service@http://Metaflo-XXXXX.elb.us-east-1.amazonaws.com
    3 replies
    russellbrooks
    @russellbrooks
    PSA potentially related to :point_up:, looks like the Batch team got around to updating the default compute environment ECS-optimized AMIs to use Amazon Linux 2 :tada:
    https://aws.amazon.com/about-aws/whats-new/2020/11/aws-batch-now-has-integrated-amazon-linux-2-support/
    3 replies
    ayorgo
    @ayorgo
    Hey Metaflow,
    How can I approach a clean-up of my metadata service from old metadata and artifacts?
    2 replies