No, you don't need kubectl, the
cortex cluster info --debug command should have run all of the necessary kubernetes commands.
cortex-debug/k8s/pod contains the description for each pod that was running on your cluster at the time you ran the
cortex cluster info --debug
I would find the pod(s) for your API. You can filter for your pods in the list by searching for
apiName=iris-classifier in the labels section for the pod. Once you find the your API, can you share the events section for that pod?
Additionally, you can find the resource utilization for the pod in
cortex-debug/k8s/pods.metrics. You can find the metrics for your API pod by search the file for your API name.
@noisyneuron so judging from what I see with those 2 projects, this is what I can tell:
https://github.com/minimaxir/gpt-2-simple is for generating/fine-tuning a GPT-2 model. This is supposed to be done separately on a dev machine. The resulting model should then be loaded in a Cortex API - by using any of the available predictors. Since this would result in a SavedModel model, you'd just probably go with the TensorFlow predictor.
https://github.com/minimaxir/gpt-2-cloud-run appears to a project intended for GCP's Cloud Run - which I think it's sort of an AWS Lambda. Either way, this would be totally incompatible with Cortex and redundant anyway.
In conclusion, the only thing you need is the fine-tuned model in step 1. We've already got an example using GPT-2. Check it out here.
@noisyneuron Cortex's Python Predictor interface is pretty flexible, check out the docs here: https://docs.cortex.dev/deployments/predictors#python-predictor
In summary, you can download / initialize your model in
__init__() (storing any state, like the loaded model itself, in
self), and then generate and return your prediction in
predict(). Do you have a script that you use to generate predictions locally? If so, you should be able to fit it into the pattern described above.
@sm-hossein Cortex currently uses FastAPI/Uvicorn to respond to HTTP requests. Some of the Cortex API configuration, metrics, monitoring and, autoscaling functionality assume that FastAPI/Uvicorn is being used.
I don't believe adding gRPC support is on the immediate roadmap. However, it may be possible to get gRPC working for your use case.
It looks like you have two options:
Deploy a gRPC container that accepts gRPC requests and makes HTTP requests to Cortex APIs. This approach will add an extra hop in traffic and adds organizational complexity. We would love to take a look at this code if gRPC becomes a part of the roadmap.
You can build a new Docker container from scratch that uses gRPC. You can configure
predictor.image in your api configuration to use your Docker image. You can look at this code to see how your api configuration is used to deploy Docker containers. If you take this approach, you will have to do additional work to get some of the Cortex features such as metrics (request count and avg latency) working.
cortex cluster downto take down the cluster on AWS. You will be prompted for your cluster name and region. If you forgot your cluster name, you can find it by looking at the value of the tag
alpha.eksctl.io/cluster-nameassociated with any of the the cortex cluster's EC2 instances.
oh i double checked this morning and the EC2 are still running :'( not sure how to stop this at all
You will have to go to CloudFormation and delete the stack(s) for the associated region. Once triggered, that will take some time (10-15 minutes), so you'll have to keep an eye on it. Did you already try
cortex cluster down as @vishalbollu has suggested?
predictmethod upon receiving the payload. Then, once the preprocessing is done, the result is handed off to
client.predictwhich does the actual inference. Why not have the pre-processing done in the
predictmethod? Could you tell us more what you're trying to achieve?
@alexdiment I hear you. I see, it looks like the local install can only run on MacOS, whereas the container can be run on anything. I wonder if giving access to the docker socket would be a good idea for you. Like, make it such that you can start containers within the serving container with this Python SDK: https://docker-py.readthedocs.io/en/stable/.
You'd start the container in the constructor and then interact with it like you'd usually do in the predict method.
What do you think?
@alexdiment While we think of ways to supporting the use of your own container with Cortex, it looks like it may be possible to create a Cortex compatible image based on https://hub.docker.com/r/lowerquality/gentle/dockerfile.
Looking at their Dockerfile, https://hub.docker.com/r/lowerquality/gentle/dockerfile, we see that they're compiling a bunch of stuff and then they use the python3 executable to run the server.
On the opposite, we've got the Cortex Dockerfile, https://github.com/cortexlabs/cortex/blob/master/images/python-predictor-cpu/Dockerfile, which is built from the same base image (ubuntu:18.04) and which uses a conda-installed Python runtime with conda-provisioned packages.
In order for this to work, we'd change the base image of the Cortex Dockerfile from
lowerquality/gentle and then in
dependencies.sh, we'd run the server in the background using the system-wide Python3 runtime and not the conda-provisioned one, which is used by default. You will probably have to look at the state of environment variables (pertaining to the Python runtime) before and after creating the customized version of the serving image - this will probably be required for putting in the right variables when running the Python server. Here's what I'm thinking of:
# dependencies.sh cd /gentle && PYTHONPATH=/my/dir /usr/bin/python3 server.py &
@ishanShahzad Sorry for the delay, I was in a meeting which just ended.
The situation you described will be automatically handled by Cortex, so it should not be necessary to use SQS on top of Cortex for your use case. There is a queue within Cortex, so if the instances are at the
max_instances limit, requests will be queued while waiting for previous requests to complete. Eventually if the queue grows too large, requests will be responded to with HTTP error code 503 (the queue size is configurable via
max_replica_concurrency which is described here.