Hey folks - I’ve got a couple use cases that have emerged within my team as we’ve been using (and loving) cortex in production for a couple months now.
With how easy it is to create and deploy an api, our team has been using it for more than just model deployment. In fact we have a number of non-ML apis running right now doing various enrichments on our data. We use custom built containers that still inherit from the cortex official ones, but contain our packages and libraries. To me this ends up competing largely with other more generic FAAS offerings, have you considered going down that route? We would also love a slim container that contains no ml packages that just contains the interfaces needed, and built on that for our own packages and api needs
We have a use case where various customer sets would like to run different ml enrichment jobs, and these jobs need to be billed back to them for the resources they used. Basically, I need to know how I can portion out of the cortex cluster bill from AWS to each customer, based on a series of metrics we log or generate. Given that the api requires both compute AND time of execution, we were trying to figure out a series of logs or metrics we could use to then calculate the portion of our cortex bill for each customer. You can imagine customer A has 3 endpoints of various size and calls, and customer B only has 1 endpoint of a certain size and calls. What metrics do we need to log or generate so I can accurately say “customer A gets 72% of the bill, and customer b gets the remainder?” Obviously I can annotate the logs with other pieces of information (like which customer called the endpoint) but using raw counts or raw execution time seems to not give the complete picture.
This one is minor, but if there was a way to turn off the Json wrapper of log messages in cloud watch, that would be helpful. Our systems already log in json format and being able to toggle that so we don’t have to configure our logging ingest process to process the nested json would make it easier to integrate directly into things like Elasticsearch (which is what we use to aggregate all metrics from our endpoints)
Feel free to ping me if you have any questions
Hi Cortex team,
In https://docs.cortex.dev/deployments/realtime-api#how-does-it-work, it says that
When a request is made to the HTTP endpoint, it gets routed to one your API's replicas (at random).. Is there a way to configure the load balancer to use a round robin policy instead of random?
I'm currently having an issue with my GPU setup in AWS, when importing onnxruntime-gpu I get:
OSError: libcudnn.so.8: cannot open shared object file: No such file or directory
think it's worth noting that I'm using a python predictor and there are ~8 models (most Onnx, some pytorch) in the pipeline, each heavily dependent on each other so I cannot split the API.
@Greg-Tarr to confirm this, are you trying to import the ONNX runtime with
import onnxruntime as rt? Currently, on our ONNX predictor images (should be the same for your Python predictor as well), we install version
1.4.0 of the ONNX runtime
onnxruntime-gpu==1.4.0 and it does work with it. You will also have to set the
compute.gpu field to a non-zero value in your API spec, otherwise, it won't work.
And for reference, the default version of CUDA that we use is 10.1. And the cuDNN is 7.
sudo apt-get install python3-pipand changing
pip install cortexto
pip3 install cortex. By the way, it is very cool that you now support windows.
kubectlso you can directly see the logs of your API pod. Here's the guide on how to set up kubectl. Once that's done, you would do a
kubectl get pods, and then take the ID of the pod that starts with
api-<api-name>-...and then run
kubectl logs -f --all-containers <api-<api-name>-...>to see the logs of your pod. Most probably, there's something wrong with the implementation of the API. Let me know if you find anything!
@Greg-Tarr following up on your onnx runtime / cuda question: It seems that onnxruntime v1.5.3 is built for cuda 10.2 and cudnn 8. However, currently we only build cuda 10.2 with cudnn 7, and cuda 11.0 with cudnn 8 (both of which do not work with onnxruntime v1.5.3).
I've made cortexlabs/cortex#1575 to start building additional combinations going forward (i.e. in cortex v0.23+).
In the meantime, I've built the image that should be the one you need for cortex v0.22, and uploaded it to our dev docker hub. Can you confirm that using this image in your API works with onnxruntime v1.5.3:
If so, you can use that image until cortex v0.23 is released, after which you'd switch to
cortexlabs/python-predictor-gpu-slim:0.23.0-cuda10.2-cudnn8 (after upgrading your cluster).
All GPUs seems are not available, and all our batch tasks stuck. I tried to kill all instances manually, and Cortex launched them all again (probably due to queued batch tasks), but Cortex still says all instances's GPU are not available. Is there any workaround to fix this?
g3s.xlarge spot 0 0m / 3190m 0 / 30161868Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge spot 0 0m / 3190m 0 / 14893392Ki 0 / 0 g3s.xlarge spot 0 0m / 3190m 0 / 30161868Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g3s.xlarge spot 0 0m / 3190m 0 / 30161868Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge spot 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge spot 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g3s.xlarge spot 0 0m / 3190m 0 / 30161868Ki 0 / 0 g4dn.xlarge spot 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g3s.xlarge spot 0 0m / 3190m 0 / 30161868Ki 0 / 0 g4dn.xlarge spot 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0 g4dn.xlarge on-demand 0 0m / 3190m 0 / 14893392Ki 0 / 0
instance_type: g4dn.xlargeso do I have to add it to the list
instance_distribution: [g4dn.xlarge, g4dn.2xlarge]or is it the same as
instance_distribution: [ g4dn.2xlarge]?
hey guys, I deployed the
pytorch/text-generator example to AWS and tested 50 simultaneous users with Locust, Load Balancer worked, and the replicas started (now I have 3 EC2), I started to receive HTTP 5XX for all requests, so I would like to know:
I'm concerned about the scalability using cortex so that's why I'm testing this scenario, seeing potential errors and more importantly, what can I do when that happens.
@RobertLucian and @deliahu THANK YOU SO MUCH! My deadline is tomorrow and you've both saved my life! 158ms for 8 models... not bad :D If there's any way I can repay you I'd be happy to do so!
I have one more question, can multiple replicas use the same GPU, and is there a way to monitor GPU usage (RAM/Util) without SSH?
@Greg-Tarr right so in that case, you could have these 2 requirements files
requirements-gpu.txt in your project's directory. You would then add a
dependencies.sh script which would have the following contents:
if [ "$CORTEX_PROVIDER" = "aws" ]; then pip install -r requirements-gpu.txt else # when CORTEX_PROVIDER is set to local pip install -r requirements-cpu.txt fi
When it's deployed locally it will only install the CPU version - when it's on AWS, it's gonna install the GPU version. This is a bit hacky though. Would this acceptable for you?
@Greg-Tarr We're glad to hear that you've enjoyed using Cortex! The best way to "repay" us would be to spread the word :)
It is not possible for two separate APIs to share the same GPU; GPU must be requested in integer multiples. That said, it is easy to run two models in the same API, which would let you share the GPU across both models. You could then pass in a query parameter or a field in the request body to indicate which model to use. In terms of specifying the actual models to serve, you can do that in the API configuration or in the predictor implementation. This guide shows both. Let us know if you have any questions! Also, currently the best way to check GPU usage is to SSH in.
Thanks for your suggestion about the configurable requirements.txt. We have this ticket to track that (cortexlabs/cortex#1191), and in the meantime, @RobertLucian's suggestion should work.
What do you mean exactly by "how can I make a request to the same type of predictor"? What is your use case / what are you trying to achieve?
subnet_visibility: privatein your cluster configuration file, due to Docker Hub's newly enforced IP-based rate limiting policy, it is likely that you will encounter issues when scaling your cluster beyond ~10-20 nodes. We are working on moving our images for our next release. In the meantime, we have created this guide with instructions for how to avoid this issue (thanks @RobertLucian!). Feel free to reach out here or at email@example.com if you have any questions.
@deliahu Hi, on Friday evening without any change except for perhaps traffic no instances are getting scaled. The kubeclt says following: `Events:
Type Reason Age From Message
Normal NotTriggerScaleUp 2m30s (x51 over 22m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient nvidia.com/gpu, 1 max node group size reached
Warning FailedScheduling 2m18s (x43 over 22m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 Insufficient nvidia.com/gpu, 1 node(s) didn't match node selector.
@Greg-Tarr it depends on your use case: if you have large amounts of data that you need to process but don't care about the latency at all, then you can go with a batch API and then submit jobs. If you do care about latency, then you'll want to go with a real-time API. There's also the overhead which has to be accounted for a batch API: if there are too few inferences to run (say just 10 batches of size 32 that only take 2 minutes to process), then running them using a real-time API may be more cost-effective - because for each submitted job, resources have to allocated and also instances have to be spun up, and that takes time.
With all that being said, you would still have an external component to Cortex that monitors the SQS streams and as URLs come in (in batches of 32), the component would then make the appropriate requests to a Cortex API. Basically, this external component is totally separate from the Cortex stack.
Theoretically, we could also have a Queue-Polling API that would listen on the specified queue(s) and run predictions whenever something is pushed to the queue. We have created this ticket for this here cortexlabs/cortex#1586.
Also, one thing you could look into is setting a hook on an SQS queue to run a lambda whenever something is pushed into the queue. We know this is possible with S3 buckets, so it may also be possible with SQS queues. That lambda would then be responsible for making requests to the Cortex API.
autoscalingfield as described in here. Is this enough or do you need more customization? And if so, what kind of customization would you require?