Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Shiva Manne
    @manneshiva
    Screen Shot 2020-11-17 at 8.36.34 PM.png
    Screen Shot 2020-11-17 at 8.36.45 PM.png
    Shiva Manne
    @manneshiva
    One more question, is there a way to monitor the cluster so that I can set alarms? 2 use cases:
    i. I want to see the current resource utilization of the cluster - like CPU usage per instance, GPU usage, etc.
    ii. I want to set alarms based on custom-defined metrics - say if the number of instances in the cluster hits a certain value or the average GPU usage of the cluster is more than 90% for 10 minutes.
    How can I define these custom metrics to monitor and alert when the API is in production?
    Madison Bahmer
    @madisonb

    Hey folks - I’ve got a couple use cases that have emerged within my team as we’ve been using (and loving) cortex in production for a couple months now.

    1. With how easy it is to create and deploy an api, our team has been using it for more than just model deployment. In fact we have a number of non-ML apis running right now doing various enrichments on our data. We use custom built containers that still inherit from the cortex official ones, but contain our packages and libraries. To me this ends up competing largely with other more generic FAAS offerings, have you considered going down that route? We would also love a slim container that contains no ml packages that just contains the interfaces needed, and built on that for our own packages and api needs

    2. We have a use case where various customer sets would like to run different ml enrichment jobs, and these jobs need to be billed back to them for the resources they used. Basically, I need to know how I can portion out of the cortex cluster bill from AWS to each customer, based on a series of metrics we log or generate. Given that the api requires both compute AND time of execution, we were trying to figure out a series of logs or metrics we could use to then calculate the portion of our cortex bill for each customer. You can imagine customer A has 3 endpoints of various size and calls, and customer B only has 1 endpoint of a certain size and calls. What metrics do we need to log or generate so I can accurately say “customer A gets 72% of the bill, and customer b gets the remainder?” Obviously I can annotate the logs with other pieces of information (like which customer called the endpoint) but using raw counts or raw execution time seems to not give the complete picture.

    3. This one is minor, but if there was a way to turn off the Json wrapper of log messages in cloud watch, that would be helpful. Our systems already log in json format and being able to toggle that so we don’t have to configure our logging ingest process to process the nested json would make it easier to integrate directly into things like Elasticsearch (which is what we use to aggregate all metrics from our endpoints)

    Feel free to ping me if you have any questions

    4 replies
    Jin Pan
    @jinpan

    Hi Cortex team,

    In https://docs.cortex.dev/deployments/realtime-api#how-does-it-work, it says that When a request is made to the HTTP endpoint, it gets routed to one your API's replicas (at random).. Is there a way to configure the load balancer to use a round robin policy instead of random?

    4 replies
    synapsta
    @synapsta
    Hello - I've been working with Cortex for a couple weeks now and I'm getting to understand it better. But one thing that I keep running into is that if my API fails for one reason or another and I delete it, make a fix, and then try to deploy it again, it almost always fails to deploy after several minutes with a 'compute unavailable' error message. Then I have to bring the cluster down and back up to get it working again. I have it set to have min 0 and max 2 spot GPU instances, but I also have on-demand backup set to yes. Does this behavior make sense? I'm going to stop using spot instances for now and see if the problem goes away, but any ideas you have would be greatly appreciated. Thanks!
    6 replies
    Irineu Licks Filho
    @irineul
    Hi Cortex team, I started to use it yesterday and it seems promising for us. I would like to have a better overview of all AWS components created by Cortex and its responsibilities, but I couldn't find them on the documentation, only a simple graph. Do you have this kind of documentation? Thanks!
    2 replies
    Greg Tarr
    @Greg-Tarr
    Thanks @miguelvr !
    I have another question :D Is there any way for a worker/predictor to know the API address? I'd like one worker to be able to share it's work if what it receives in it's payload is too much to process.
    6 replies
    Greg Tarr
    @Greg-Tarr

    I'm currently having an issue with my GPU setup in AWS, when importing onnxruntime-gpu I get:

    OSError: libcudnn.so.8: cannot open shared object file: No such file or directory

    think it's worth noting that I'm using a python predictor and there are ~8 models (most Onnx, some pytorch) in the pipeline, each heavily dependent on each other so I cannot split the API.

    onnxruntime-gpu==1.5.2 with a fairly standard PythonPredictor setup, I think the issue is the version of Cuda & Cudnn that cortex by default uses, will changing the predictor.image to cortexlabs/python-predictor-gpu-slim:0.21.0-cuda11.0 work?
    Robert Lucian Chiriac
    @RobertLucian

    @Greg-Tarr to confirm this, are you trying to import the ONNX runtime with import onnxruntime as rt? Currently, on our ONNX predictor images (should be the same for your Python predictor as well), we install version 1.4.0 of the ONNX runtime onnxruntime-gpu==1.4.0 and it does work with it. You will also have to set the compute.gpu field to a non-zero value in your API spec, otherwise, it won't work.

    And for reference, the default version of CUDA that we use is 10.1. And the cuDNN is 7.

    Greg Tarr
    @Greg-Tarr
    i just 'import onnxruntime' and I have set comute.gpu to 1 and I'm using g4dn.xlarge instances
    Rowlando13
    @Rowlando13
    Quick recommendation for installing with windows guide: I don't think Ubuntu 20.04, which is probably the most common option for WLS, ships with pip for Python 3. Recommend adding sudo apt-get install python3-pip and changing pip install cortex to pip3 install cortex . By the way, it is very cool that you now support windows.
    1 reply
    Greg Tarr
    @Greg-Tarr
    @Rowlando13 cortex supports windows?!!?
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr yes, you can run the Cortex CLI on a Windows machine as well. We have a guide that explains how Cortex can be configured for Windows. This is the guide.
    Greg Tarr
    @Greg-Tarr
    @RobertLucian Oh yea, using WSL - that's what I do currently :P I thought you meant native windows support :D
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr I see. Would you have preferred native support on Windows, or do you think the WSL support is already good? Our thinking is that by focusing on a single OS (especially on Linux) we can ensure a better user experience (not riddled with bugs).
    Greg Tarr
    @Greg-Tarr
    @RobertLucian Windows support definitely isn't necessary, even though I'm a windows user I do all of my ML on Linux (as everyone should). I entirely agree with focusing on a single OS :D
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr got it. Thanks for the feedback Greg!
    Greg Tarr
    @Greg-Tarr
    I have a problem that I can't pinpoint. GPU cluster goes up fine but when I check cortex logs then it just stops on "fetching logs..." for ever. I SSH'ed into the worker machine and ran a top command in the docker container for the predictor and it seems as though a conda instance is starting up and crashing every minute or so.
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr that's odd. I would recommend you to set up kubectl so you can directly see the logs of your API pod. Here's the guide on how to set up kubectl. Once that's done, you would do a kubectl get pods, and then take the ID of the pod that starts with api-<api-name>-... and then run kubectl logs -f --all-containers <api-<api-name>-...> to see the logs of your pod. Most probably, there's something wrong with the implementation of the API. Let me know if you find anything!
    David Eliahu
    @deliahu

    @Greg-Tarr following up on your onnx runtime / cuda question: It seems that onnxruntime v1.5.3 is built for cuda 10.2 and cudnn 8. However, currently we only build cuda 10.2 with cudnn 7, and cuda 11.0 with cudnn 8 (both of which do not work with onnxruntime v1.5.3).

    I've made cortexlabs/cortex#1575 to start building additional combinations going forward (i.e. in cortex v0.23+).

    In the meantime, I've built the image that should be the one you need for cortex v0.22, and uploaded it to our dev docker hub. Can you confirm that using this image in your API works with onnxruntime v1.5.3: cortexlabsdev/python-predictor-gpu-slim:0.22.0-cuda10.2-cudnn8?

    If so, you can use that image until cortex v0.23 is released, after which you'd switch to cortexlabs/python-predictor-gpu-slim:0.23.0-cuda10.2-cudnn8 (after upgrading your cluster).

    Jinhyung Park
    @jinhyung
    g3s.xlarge      spot        0          0m / 3190m                            0 / 30161868Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     spot        0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g3s.xlarge      spot        0          0m / 3190m                            0 / 30161868Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g3s.xlarge      spot        0          0m / 3190m                            0 / 30161868Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     spot        0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     spot        0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g3s.xlarge      spot        0          0m / 3190m                            0 / 30161868Ki                           0 / 0
    g4dn.xlarge     spot        0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g3s.xlarge      spot        0          0m / 3190m                            0 / 30161868Ki                           0 / 0
    g4dn.xlarge     spot        0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    g4dn.xlarge     on-demand   0          0m / 3190m                            0 / 14893392Ki                           0 / 0
    All GPUs seems are not available, and all our batch tasks stuck. I tried to kill all instances manually, and Cortex launched them all again (probably due to queued batch tasks), but Cortex still says all instances's GPU are not available. Is there any workaround to fix this?
    13 replies
    James Madison
    @MadJames_gitlab
    I was wondering, is it desirable /. possible to have multiple cortex clusters running in the same VPC on AWS. We made a few development cortex clusters in addition to our production one and it seems like every cortex cluster creates a new VPC and we have reached the limit of 5 VPC (that it looks like AWS does not let you raise). I didn't see anything in the cortex docs about specifying an existing VPC to run the cluster in.
    3 replies
    Vaclav Kosar
    @vackosar
    @deliahu Hi! I am on 17.1. I am seeing a lot of spot instances running in AWS Instances screen, but for some reason my job is not getting them to execute on them. It keeps waiting in updating state say 20mins. Also do we pay for those spot instances if we cannot use them? Thanks a lot.
    9 replies
    Vaclav Kosar
    @vackosar
    @deliahu 2 question: instance_distribution - Do I need to list there my primary type or is it included automatically? e.g. if my primary is instance_type: g4dn.xlarge so do I have to add it to the list instance_distribution: [g4dn.xlarge, g4dn.2xlarge] or is it the same as instance_distribution: [ g4dn.2xlarge]?
    2 replies
    Irineu Licks Filho
    @irineul

    hey guys, I deployed the pytorch/text-generator example to AWS and tested 50 simultaneous users with Locust, Load Balancer worked, and the replicas started (now I have 3 EC2), I started to receive HTTP 5XX for all requests, so I would like to know:

    • Why it wasn't logged to CloudWatch Dashboard the 5XX calls
    • How can I discover why 3 EC2 wasn't able to handle only 50 simultaneous users

    I'm concerned about the scalability using cortex so that's why I'm testing this scenario, seeing potential errors and more importantly, what can I do when that happens.
    Thanks!

    11 replies
    Greg Tarr
    @Greg-Tarr

    @RobertLucian and @deliahu THANK YOU SO MUCH! My deadline is tomorrow and you've both saved my life! 158ms for 8 models... not bad :D If there's any way I can repay you I'd be happy to do so!

    I have one more question, can multiple replicas use the same GPU, and is there a way to monitor GPU usage (RAM/Util) without SSH?

    Greg Tarr
    @Greg-Tarr
    @RobertLucian I have a suggestion for cortex. I have two deployment yaml files, one for gpu in the cloud, and one for cpu locally. I have a problem because ONNX runtime's GPU library has a different name for it's gpu and cpu versions, meaning the requirements.txt needs to be different for the two of them. I would like to be able to have a requirements-cpu.txt and a requirements-gpu.txt or something of the sort, perhaps requirements / dependencies file path can be added to the predictor config?
    Robert Lucian Chiriac
    @RobertLucian

    @Greg-Tarr right so in that case, you could have these 2 requirements files requirements-cpu.txt and requirements-gpu.txt in your project's directory. You would then add a dependencies.sh script which would have the following contents:

    if [ "$CORTEX_PROVIDER" = "aws" ]; then
      pip install -r requirements-gpu.txt
    else
      # when CORTEX_PROVIDER is set to local
      pip install -r requirements-cpu.txt
    fi

    When it's deployed locally it will only install the CPU version - when it's on AWS, it's gonna install the GPU version. This is a bit hacky though. Would this acceptable for you?

    Greg Tarr
    @Greg-Tarr
    @RobertLucian Nice! I'll definitely do that, kinda annoyed at myself for not thinking about that :/
    You've been really good help and you've saved me looooads of time, does cortex accept donations?
    Greg Tarr
    @Greg-Tarr
    In the documentation here: https://docs.cortex.dev/deployments/realtime-api/predictors#chaining-apis
    It specifies "you could make a request to it from a different API", but how can I make a request to the same type of predictor?
    David Eliahu
    @deliahu

    @Greg-Tarr We're glad to hear that you've enjoyed using Cortex! The best way to "repay" us would be to spread the word :)

    It is not possible for two separate APIs to share the same GPU; GPU must be requested in integer multiples. That said, it is easy to run two models in the same API, which would let you share the GPU across both models. You could then pass in a query parameter or a field in the request body to indicate which model to use. In terms of specifying the actual models to serve, you can do that in the API configuration or in the predictor implementation. This guide shows both. Let us know if you have any questions! Also, currently the best way to check GPU usage is to SSH in.

    Thanks for your suggestion about the configurable requirements.txt. We have this ticket to track that (cortexlabs/cortex#1191), and in the meantime, @RobertLucian's suggestion should work.

    What do you mean exactly by "how can I make a request to the same type of predictor"? What is your use case / what are you trying to achieve?

    Greg Tarr
    @Greg-Tarr
    My payload is a list of N urls (lets say 128) but one GPU only has enough VRAM to process 32, so I want the predictor that receives the payload to batch into batches of 32 and send the payload through the API while it processes the first batch (so it would send 3 batches to the API and process 1 on its own, then conglomerate the result and return the response). So essentially I want to be able to do: requests.post("http://api-predictor:8888/predict, json=batch) for 3 of the batches concurrently, but I'm getting an error that says that address doesn't exist. The predictor's name is 'predictor', is my url correct?
    21 replies
    David Eliahu
    @deliahu
    @/all If you have set subnet_visibility: private in your cluster configuration file, due to Docker Hub's newly enforced IP-based rate limiting policy, it is likely that you will encounter issues when scaling your cluster beyond ~10-20 nodes. We are working on moving our images for our next release. In the meantime, we have created this guide with instructions for how to avoid this issue (thanks @RobertLucian!). Feel free to reach out here or at dev@cortex.dev if you have any questions.
    10 replies
    Greg Tarr
    @Greg-Tarr
    Hi again! I'm chaining two APIs together and one is passing an array of images as a pickled object to the other one. However I'm getting a 413 (request too large) error.
    3 replies
    Vaclav Kosar
    @vackosar

    @deliahu Hi, on Friday evening without any change except for perhaps traffic no instances are getting scaled. The kubeclt says following: `Events:
    Type Reason Age From Message


    Normal NotTriggerScaleUp 2m30s (x51 over 22m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient nvidia.com/gpu, 1 max node group size reached
    Warning FailedScheduling 2m18s (x43 over 22m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Insufficient nvidia.com/gpu, 1 node(s) didn't match node selector.
    `

    54 replies
    Greg Tarr
    @Greg-Tarr
    Hello again! Loving cortex!
    I've got to make some architectural decisions and I'd like your input. I've got a bunch of streams (m3u8 urls in SQS) that must be monitored in real time until they end. I have a predictor that can grab urls (batch of 32) from SQS, starts watching them and monitoring them. If one of the streams end, it grabs another one off the queue. There is not request/reply pattern here as it's a continuous operation that it starting when the predictor is initialized. This is clearly not the right way to do it as autoscaling looks at request times etc. My questions are these: Can cortex autoscale/support non request/reply pattern (continuous) workers; would a batched job work better; how can I do this?
    Regards,
    Greg.
    Robert Lucian Chiriac
    @RobertLucian

    @Greg-Tarr it depends on your use case: if you have large amounts of data that you need to process but don't care about the latency at all, then you can go with a batch API and then submit jobs. If you do care about latency, then you'll want to go with a real-time API. There's also the overhead which has to be accounted for a batch API: if there are too few inferences to run (say just 10 batches of size 32 that only take 2 minutes to process), then running them using a real-time API may be more cost-effective - because for each submitted job, resources have to allocated and also instances have to be spun up, and that takes time.

    With all that being said, you would still have an external component to Cortex that monitors the SQS streams and as URLs come in (in batches of 32), the component would then make the appropriate requests to a Cortex API. Basically, this external component is totally separate from the Cortex stack.

    Theoretically, we could also have a Queue-Polling API that would listen on the specified queue(s) and run predictions whenever something is pushed to the queue. We have created this ticket for this here cortexlabs/cortex#1586.

    Also, one thing you could look into is setting a hook on an SQS queue to run a lambda whenever something is pushed into the queue. We know this is possible with S3 buckets, so it may also be possible with SQS queues. That lambda would then be responsible for making requests to the Cortex API.

    Greg Tarr
    @Greg-Tarr
    @RobertLucian thanks for the response! A Queue-Polling API would be exactly what I'm looking for :D Glad that it's in the works. Since the component that pulls from the SQS queue needs to do intensive preprocessing before calling a chained predictor it must be able to scale (and I hate complicating stacks) so I'm going to find a workaround to utilize cortex to do said task. SQS hooks into lambda may be what I'm looking for, how much compute power does lambda provide; enough to handle 32 livestreams? My current (very crude) workaround involves a while True in the predictors init to pull from sqs, process streams, post responses to external api, and repeat :P
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr I think you'll want to have hooks on each SQS queue so that you have many lambdas running at once. As far as I understand, AWS lambdas have access to a single core if their memory requests are up to 1.5GB or 2 cores if it's higher than that. Here are the limits I found for AWS Lambda. The advantage of AWS Lambda is that you can have lots of them.
    Greg Tarr
    @Greg-Tarr
    @RobertLucian That's great, I'll be sure to look into Lambda. One more question, if I wanted to implement custom logic behind autoscaling (CPU utilization for example) into cortex, how would I go about doing it?
    Robert Lucian Chiriac
    @RobertLucian
    @Greg-Tarr at the moment, you can configure the autoscaling using the autoscaling field as described in here. Is this enough or do you need more customization? And if so, what kind of customization would you require?
    Greg Tarr
    @Greg-Tarr
    @RobertLucian Hi, I'd like to be able to base autoscaling based off resources besides in-flight requests, particularly CPU utilization, GPU utilization, and/or the sum of a variable contained within a predictor (customizable).
    Greg Tarr
    @Greg-Tarr
    This could be accomplished by abstracting the pkg\operator\resources\realtimeapi\autoscaler.go file which might as well be done when implementing the QueueAPI predictor
    Vaclav Kosar
    @vackosar
    @deliahu @RobertLucian do you have any estimate when will be version 0.23 available? I am looking forward not having to think about the Docker rate limiting.