Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Kurt Schelfthout
    @kurtschelfthout
    Hey @joekraemer , great to hear. Re: question 1 - right now the only configuration option we have is the machine idle timeout. Meadowrun keeps idle machines around for 5 minutes by default - idle meaning they are not running any meadowrun job. If you set this to a high value using meadowrun-manage-ec2 edit-management-lambda-config --terminate-instances-if-idle-for-secs <some large number> (you can also give this option on install) then the machines stay around for longer. Do note that this applies to all machines indiscriminately, so if at some there are a number of machines created (because e.g. multiple people run a job at the same time), then those will all stay around for at least the idle timeout.
    I am currently working on better configuration options for exactly this case, so soon (tm) you'll be able to tell meadowrun "keep 5 chunks of 2 CPU/4Gb of memory around" regardless of the idle timeout. Due to me being out for a week, then a bout of illness, as well as this actually being a bit harder than I initially anticipated, that feature is taking me longer than expected.
    Kurt Schelfthout
    @kurtschelfthout
    Re: Q2. That seems like a bug. I'd like to see the contents of the DynamoDB table to see what the information is in there; I usually explore via the AWS Console. DynamoDB tab, Tables -> Explore Items meadowrun_ec2_alloc_table. You should be able to download the contents as CSV. I'm expecting as many lines as you have instances in there. Also, for the rows that have an empty public address column, is the corresponding EC2 instance running?
    EDIT: hm maybe disregard the following as you said you tried reinstall (I assume you mean meadowrun-manage-ec2 uninstall/install). Curious that it reappears, maybe something to do with how your AWS account is configured? END EDIT To get out of the situation, I think there's a good chance running "meadowrun-manage-ec2 clean" will clean up the table (as well as shut down all idle instances). If that doesn't work, "meadowrun-manage-ec2 uninstall" should remove all meadowrun resources from your account, including the offending dynamodb table, and then "meadowrun-manage-ec2 install" puts everything back in a clean state. If you don't mind some tinkering, you could also just delete the row(s) without public address from the DynamoDB table manually, and then "meadowrun-manage-ec2 clean" will shut down any meadowrun managed instances that don't appear in the table.
    Joe Kraemer
    @joekraemer

    Kurt, you are a hero. Thank you so much for responding insanely fast! I don't think I have ever gotten my questions answered so fast. This project is really really cool and I'm excited to see it grow. I wish I had known about this project last year. I was doing a Reinforcement learning project and each trial would take over 10 hours on my macbook. So I was launching instances, SSHing into the machine, manually cloning my repo, running a trial, then later SSHing in again, and finally transfering the files to a bucket. I was running 16 different instances. It was a nightmare.

    1. Sounds good, I will try this out! No worries with the delay. This tool has already helped me exponentially. Take care of yourself!

    2. Also, I reran my program and it seems to be working now... I didn't run any of the commands you suggested so I'm not too sure what happened here.

    EDIT: JK it just took longer to hit the error. I'm going through your troubleshooting steps now. I'll let you know what happens.

    1 reply
    Joe Kraemer
    @joekraemer

    Results:

    • Running meadowrun-manage-ec2 clean gave me a another KeyError: 'running_jobs' at "/aws_integration/management_lambdas/adjust_ec2_instances.py"
    • Running meadowrun-manage-ec2 uninstall resulted in a clean uninstall with no errors
    • Running meadowrun-manage-ec2 install resulted with no errors

    • Running my program after the reinstallation seems to be working fine. Thanks for the help Kurt. When in doubt, turn it off and then back on again.

    On another note, one of the packages I'm using requires gcc for installation. Is there a way to add this at run time to the instance without having to create a separate container?
    Kurt Schelfthout
    @kurtschelfthout
    ok, still a bit mysterious how the row without public address got there in the first place. I'll have a look at clean to make it a bit more robust to such problems.
    Kurt Schelfthout
    @kurtschelfthout
    For gcc: we do support specifying apt packages for installation. Not sure what you're using as deployment option of run_map or run_function, but I'll assume the default which is Deployment.mirror_local(). If you're not specifying a deployment then that is what you're using. Deployment.mirror_local has a keyword argument interpreter - and interpreters have an additional_software option which specifies extra apt packages to install. As an example, to mirror a local conda environment, and also install build-essentials: deployment=Deployment.mirror_local(interpreter=LocalCondaInterpreter(environment_name_or_path="my-conda-env", additional_software=["build-essentials"]))
    Here's a bit more info about deployment options: https://docs.meadowrun.io/en/stable/reference/deployment/ and here are the docs for LocalCondaInterpreter and LocalPipInterpreter: https://docs.meadowrun.io/en/stable/reference/apis/#meadowrun.LocalInterpreter. If you're using Deployment.git_repo that has similar options. I'll make a note to write a howto about this, we are lacking a bit of documentation about this feature it seems.
    Joe Kraemer
    @joekraemer
    Awesome Kurt! Adding the additional software worked! Thank you!
    Justin Riddiough
    @neural-loop
    Hi, I came across meadowrun and it seems pretty nice. I'm trying to understand what kind of wizardry is behind it and how I might use it. Using it for automating the generation of stable diffusion images, and it's nice that it powers down and conserves the use.
    Kurt Schelfthout
    @kurtschelfthout
    hi @neural-loop. Sure - what would you like to know? The 10,000 foot overview is that meadowrun uses the boto3/AWS API to take an existing python code base, create EC2 virtual machine(s) and runs some python code on the VM with the existing python code base. There's some more detail here: https://docs.meadowrun.io/en/stable/explanation/how_it_works/ and https://docs.meadowrun.io/en/stable/explanation/deployment/ There is also an overview of all the AWS resources that meadowrun uses/creates in your account here: https://docs.meadowrun.io/en/stable/reference/aws_resources/
    Kurt Schelfthout
    @kurtschelfthout
    As for when you might want to use it. Generally if you have some python local codebase or git repository and you don't have the resources (or the inclination) to run it locally, for example you need more memory, powerful GPUs etc, meadowrun's goal is to make it very easy to run that codebase on one or more cloud machines. We're targeting mostly "batch" type workloads for e.g. data analysis, data science, AI training, parallelizing test runs, ... E.g. if you're looking for something that sets up an internet-facing service or a website, you're probably better off looking elsewhere. That said it is possible to run e.g. a jupyter notebook kernel on a cloud machine via meadowrun, and then connect to it, which is a little bit like running a service.
    Justin Riddiough
    @neural-loop
    I have an internet facing service, but the use case is making a registry of models. When a new model is added, I'd like to generate some examples of its output. So I think this could be a good fit, an easy way to run those generations in the background. That the service will shutdown after idle gives some peace of mind that the costs won't blow up as well. The actual web service is hosted elsewhere
    Kurt Schelfthout
    @kurtschelfthout
    @neural-loop yeah that makes sense. You could set up git repos or containers for each of the models and meadowrun should happily run those for you, and manage the temporary EC2 instances.
    Justin Riddiough
    @neural-loop
    I am getting some weird output from https://medium.com/@meadowrun/how-to-run-stable-diffusion-2-on-ec2-5f7c4a0d65db Can someone confirm if it's still correct? The output looks very grainy
    I'm not sure if they have changed some of the config on the stabilityai side. The stable-diffusion 1.5 worked great
    Kurt Schelfthout
    @kurtschelfthout
    @neural-loop just tried it, works for me. Are you using the snippet verbatim? What instance type did you get?
    Justin Riddiough
    @neural-loop
    @kurtschelfthout Let me try again. I'm getting g4dn.2xlarge
    Possibly, I was using the 768, now I notice there is a different config with 'v' on it, maybe that was the issue
    Kurt Schelfthout
    @kurtschelfthout
    Same instance type as what I just ran it on (just checking for nvidia issues, it can be picky sometimes). Yes, according to the repo, the 2.0 or 2.1 model needs a different config, I haven't tested that one though.
    sorry the 768 model I meant
    they're obviously all 2.x :)
    Justin Riddiough
    @neural-loop
    If it's the config v that's needed, maybe it would help to update the docs. It mentions that they can be used the same way so I didn't think to change that. I'm running it again now to see. I might recommend also that in the docs sometimes it says 'x was explained in other document', I think it's better maybe to include the steps in each document independently - because each has multiple steps for a newbie and it can be hard to figure out exactly what steps are being referenced
    I think it was probably that I was missing the inference-v.yaml, will let you know if that fixed it
    If I can get it to work with xformers down the road I could send over the notes
    Kurt Schelfthout
    @kurtschelfthout
    Good points. I've recently expanded the prerequisites section along those lines (it no longer refers to the DALL-E post). I'll add a note on how to change it to use the 768 model as well when I have a little time to test it.
    Justin Riddiough
    @neural-loop
    Thanks yeah, the first time I went through it, I didn't have aws s3 set up, so there was a bit to pick up in one sitting, and then opening the DALL-E post - I did figure out what was referenced, but I wasn't completely sure where the referenced steps started and ended. worked out though
    Justin Riddiough
    @neural-loop
    Is eviction rate, something like your instance can get bumped if a higher bidder comes along?
    Kurt Schelfthout
    @kurtschelfthout
    yes. AWS gives you an estimate of how likely it is that a spot instance will be interrupted.
    max_eviction_rate sets the max that you're willing to tolerate. max_eviction_rate=0 means only use on-demand instances, which won't be interrupted.
    Justin Riddiough
    @neural-loop
    Is there any indication, or would it appear as an error? If an instance was interrupted is there a message somewhere "instance interrupted"
    Kurt Schelfthout
    @kurtschelfthout
    should show up as connection lost or something, the machine goes away abruptly and the SSH connection dies.
    Justin Riddiough
    @neural-loop
    So if someone is running a large batch with many instances, is it normal practice that they scan the outputs to re-queue anything that might have been cancelled?
    Or would they set up something to retry until it does complete (but that would require some knowledge of the reason it failed, if it was evicted vs some other issue)
    Kurt Schelfthout
    @kurtschelfthout
    meadowrun.run_function or run_command will throw an exception in that case (MeadowrunException I think), you can react to that by retrying or moving on to the next task etc. It's true that if you wanted to distinguish between interruption and some other failure, you'd have to scan the error message text or something.
    Justin Riddiough
    @neural-loop
    btw: the issue was the inference-v, the images were generated correctly with the 768 model
    Kurt Schelfthout
    @kurtschelfthout
    awesome
    Torphix
    @torphix
    Hi everyone, firstly a big thank you for the library super useful.
    With regards to latency reduction is it possible to cache a container for longer periods of time ie: a particular settings that can be toggled as it seems that rebuilding the container upon fresh instance launches is the greatest source of latency
    Kurt Schelfthout
    @kurtschelfthout
    @torphix I expect that the container is only rebuilt when the conda/pip environment spec changes - meadowrun hashes the environment spec (i.e. output of conda export or pip freeze).
    Torphix
    @torphix
    Ahh very clever.. are there any suggestions for improving latency? I read the meadowrun medium article but wondered if there were any additional findings that you came across since then.. especially with regards to cold starts?
    Kurt Schelfthout
    @kurtschelfthout
    not really - those machines start at the pace AWS sets (and probably their hands are tied too - depends on the OS as well). You can keep machines around for longer though, so you don't get cold starts as often (of course AWS will charge you for the time). see the edit-management-lambda-config command that allows configuring the TERMINATE_INSTANCES_IF_IDLE_FOR_SECS parameter.
    keeping machines around for longer is your best bet really, as it also saves the time for docker pull, which can be quite slow too. Docker images are cached locally on the machine as well.
    Torphix
    @torphix
    Makes sense.. thank you!
    Torphix
    @torphix
    Hi again, I am would like to add a feature wherein its possible to rerun jobs that have previously been run on an instance ie: if the jobs already been executed on an instance to prioritise executing on that instance if resources are available, this should prevent the function having to rebuild on multiple instances if I am not mistaken? Do you have any suggestions / possible caveats W.R.T to this? Right now my naive approach is just to pass a function_uid in the run_function command and store it in the instance table when first building an instance. then in the section of code where you get the sort key to choose the most viable instance based on resources available and resources required to prioritise instances that have previously run that particular function. This of course can be kept separate to the main branch or integrated as you see fit.. its really just for a use case I have wherein I may have multiple instances running at a time.. thanks!
    Kurt Schelfthout
    @kurtschelfthout
    @torphix yes, that could work. That would save on pulling the image on multiple hosts. Just to check: it's also possible you're being bitten by the fact that meadowrun doesn't currently avoid building the same image multiple times, if several jobs with the same environment are started at the same time. In other words, if 2 or more jobs that run in the same environment start at the same time, they'll both separately check if an image is cached. If it's not, they'll both build the image. The easiest way to avoid that is to run a dummy job first, which builds and caches the image in ECR.
    Torphix
    @torphix
    Thanks! I noticed some of the comments where table.scan is performed to select the available instance, its suggested that performing a query with available resources as a secondary key in order to filter out the instances that can perform the job may be preferable due to the reduced expense of the operation. I'm thinking to take this approach and then add the function_uid as an optional additional filter in order to select instances that have already been built for the job.. would you say this is preferable but slightly more complex when compared to doing client side instance selection?
    Thanks for the last tip, ill ensure to prevent running multiple identical jobs at the same time. Warming up the instance in the background is certainly helpful, I appreciate the design decision of being able to toggle waiting for the result
    Kurt Schelfthout
    @kurtschelfthout
    @torphix yeah I think that's better in the long run, right now things might get painful if there's a lot of instances. Has t be REALLY a lot though, so not an immediate issue.
    Torphix
    @torphix
    Understood, Thanks!