Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    disregard, sorry, seems that my code just crashes due to missing imports, will update if I still have issues with cortex, thanks!
    Nick Lindberg
    So— if I have two API’s and I want to have them run on two seperate clusters, can I create them and get two env variables to run on the cluster I want (using the same cortex install)
    14 replies
    The IAM I am using has access to multiple accounts, can I specify which account to spin up the cluster via an arn in the config?
    2 replies

    Hello, I am currently running cortex 0.19 version. Until this morning, the api for "image-classifier-resnet50" was successfully uploaded, but from the afternoon suddenly

    $ cortex-dev get

    env realtime api status up-to-date requested last update avg request 2XX
    aws image-classifier-resnet50 updating 0 1 3m--
    The status does not change from "updating" to "live". So if you look at the log

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "/src/cortex/serve/serve.py", line 314, in start_fn
    predictor_impl = api.predictor.initialize_impl(project_dir, client, raw_api_spec, None)
    File "/src/cortex/lib/type/predictor.py", line 89, in initialize_impl
    class_impl = self.class_impl(project_dir)
    File "/src/cortex/lib/type/predictor.py", line 137, in class_impl
    impl = self._load_module("cortex_predictor", os.path.join(project_dir, self.path))
    File "/src/cortex/lib/type/predictor.py", line 180, in _load_module
    raise UserException(str(e)) from e
    cortex.lib.exceptions.UserException: error: error in predictor.py: No module named'imageio'

    It is stopped like this, but has something changed?
    Every time I run
    I am running it with $make cluster-up.

    ubuntu@ip-172-31-19-133:~/cortex$ cat dev/config/cluster.yaml
    instance_type: p3.2xlarge
    min_instances: 1
    max_instances: 1
    bucket: cortex-cluster-019001virginiend1
    region: us-east-1
    log_group: cortex019001virginiaend1
    cluster_name: cortex019001virginiaend1
    availability_zones: [us-east-1a, us-east-1c]

    spot: true
    on_demand_base_capacity: 0
    on_demand_percentage_above_base_capacity: 0

    image_operator: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/operator:latest
    image_manager: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/manager:latest
    image_downloader: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/downloader:latest
    image_request_monitor: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/request-monitor:latest
    image_cluster_autoscaler: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/cluster-autoscaler:latest
    image_metrics_server: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/metrics-server:latest
    image_inferentia: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/inferentia:latest
    image_neuron_rtd: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/neuron-rtd:latest
    image_nvidia: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/nvidia:latest
    image_fluentd: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/fluentd:latest
    image_statsd: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/statsd:latest
    image_istio_proxy: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/istio-proxy:latest
    image_istio_pilot: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/istio-pilot:latest
    image_istio_citadel: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/istio-citadel:latest
    image_istio_galley: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/istio-galley:latest

    I would like to know why it has been running without problems for the past month, but it doesn't work from today.

    3 replies
    Abdoulaye Faye
    Hi everyone, it's possible to get from client making request informations like ADDRESS_IP REQUEST_METHOD, REQUEST_PATH , USER_AGENT etc
    2 replies
    i would have to log this kind of informations
    Hello Cortex, I'm noticing an issue (screenshot above) where the cluster is trying to add more replicas to an instance than could possibly fit. For example, if the instance has only 1 GPU and each replica requires (in the cortex.yaml config) 1 GPU. Is this behavior normal? Is there something I can configure to avoid it? Thanks.
    3 replies
    Vaclav Kosar
    I am on Cortex version 0.25 and I experieced following rare issues.

    Cortex AWS.SimpleQueueService.NonExistentQueue

    QueueDoesNotExist: An error occurred (AWS.SimpleQueueService.NonExistentQueue) when calling the GetQueueAttributes operation: The specified queue does not exist for this wsdl version.
      File "batch.py", line 330, in <module>
      File "batch.py", line 326, in start
      File "batch.py", line 168, in sqs_loop
        visible_messages, invisible_messages = get_total_messages_in_queue()
      File "batch.py", line 140, in get_total_messages_in_queue
        attributes = sqs_client.get_queue_attributes(QueueUrl=queue_url, AttributeNames=["All"])[
      File "botocore/client.py", line 337, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "botocore/client.py", line 656, in _make_api_call
        raise error_class(parsed_response, operation_name)
    2 replies

    Cortex exit_code=134

    Operator log in one case:

    {    "log": "started enqueuing batches"}
    {    "log": "partitioning 16 items found in job submission into 16 batches of size 1"}
    {    "log": "completed enqueuing a total of 16 batches"}
    {    "log": "spinning up workers..."}
    {    "log": "at least one worker had status Failed and terminated for reason error (exit_code=134)"}

    and in another case:

    { "log": "started enqueuing batches" }
    { "log": "partitioning 16 items found in job submission into 16 batches of size 1" }
    { "log": "completed enqueuing a total of 16 batches" }
    { "log": "spinning up workers..." }
    { "log": "at least one worker had status Succeeded and terminated for reason completed (exit_code=0)" }
    { "log": "at least one worker had status Failed and terminated for reason error (exit_code=134)" }

    Another case had the same operator log, but also had API logs, which were not present in the other two. Two of the apis had following error:

    {"log":"2021-01-14 02:40:27.621795: F tensorflow/stream_executor/gpu/redzone_allocator.cc:289] Check failed: !lhs_check.ok() || !rhs_check.ok() Mismatched results with host and device comparison"}

    and in another tow API log:

    {"log":"s6-svscanctl: fatal: unable to control /var/run/s6/services: supervisor not listening"}

    One of the apis had no error.

    6 replies
    Vaclav Kosar

    Error: Failed to Get Queue Metrics

    Occurred during call of "get" command on Cortex 0.25.

    cli get --env=aws -o=json exceeded: Non zero return code for command .../cli get --env=aws -o=json! Output: {"stdout": "", "stderr": "error: failed to get queue metrics: unable to get queue attributes: https://sqs.us-east-1.amazonaws.com/xxxxxx/xxxxx.fifo: AWS.Simp...}

    Operator log sample (I will send rest of the log via mail):

    2021-01-04T15:32:54.248874609Z 2021/01/04 15:32:54 Running on port 8888
    2021-01-04T16:02:07.083410038Z error: xxxxxxxx is not deployed
    2021-01-04T16:15:34.191525303Z error: runtime error: invalid memory address or nil pointer dereference
    2021-01-04T16:15:34.191526390Z runtime error: invalid memory address or nil pointer dereference
    2021-01-04T16:15:34.191632391Z github.com/cortexlabs/cortex/pkg/lib/errors.Wrap
    2021-01-04T16:15:34.191636260Z  /go/src/github.com/cortexlabs/cortex/pkg/lib/errors/error.go:78
    2021-01-04T16:15:34.191639049Z github.com/cortexlabs/cortex/pkg/lib/errors.CastRecoverError
    2021-01-04T16:15:34.191641655Z  /go/src/github.com/cortexlabs/cortex/pkg/lib/errors/error.go:193
    2021-01-04T16:15:34.191644420Z github.com/cortexlabs/cortex/pkg/operator/endpoints.recoverAndRespond
    2021-01-04T16:15:34.191646811Z  /go/src/github.com/cortexlabs/cortex/pkg/operator/endpoints/respond.go:73
    2021-01-04T16:15:34.191649128Z runtime.gopanic
    2021-01-04T16:15:34.191651340Z  /usr/local/go/src/runtime/panic.go:969
    2021-01-04T16:15:34.191653540Z runtime.panicmem
    2 replies
    I'm occasionally getting a "workers were killed for unknown reason" error, is there a common cause for this on AWS?
    3 replies
    I got the error error: your Cortex operator version (0.26.0) doesn't match your predictor image version (0.22.1); please update your predictor image by modifying the `image` field in your API configuration file (e.g. cortex.yaml) and re-running `cortex deploy`, or update your cluster by following the instructions at https://docs.cortex.dev/cluster-management/update#upgrading-to-a-newer-version-of-cortex
    6 replies
    but there doesn't appear to be a later predictor image version
    what should I do
    Hi, isn't it possible to get cortex to deploy an api locally as in 0.21
    I'm trying to deploy to a cluster

    ```status up-to-date requested last update avg request 2XX
    updating 0 1 1m23s - -

    metrics dashboard: https://eu-west-1.console.aws.amazon.com/cloudwatch/home#dashboards:name=cortex-prod```

    do I need to downgrade the cluster
    1 reply
    Robert Lucian Chiriac

    @umariyoob the local support has been dropped in 0.26 - so if you really need the local provider, you can go with 0.25.

    Part of the reason why we dropped the local provider was that if the user wanted to test an API predictor (as it was facilitated until 0.26), they could just run it as a Python application or in a notebook. Another reason was that the local provider didn't really have its place with Cortex ("Cortex is an open source platform for large-scale inference workloads") - removing it allows us to spend time on features that really matter to large-scale inference workloads-oriented users.

    2 replies
    Nick Lindberg
    Hey— anybody have issues in GCP saying there is insufficient compute? I started a cluster with 4 max instances, and have an API running right now with 1 replica on it so there should be three left, but when I try to deploy another API it’s telling me it failed with error “compute unavailable"
    3 replies

    @umariyoob the local support has been dropped in 0.26 - so if you really need the local provider, you can go with 0.25.

    Part of the reason why we dropped the local provider was that if the user wanted to test an API predictor (as it was facilitated until 0.26), they could just run it as a Python application or in a notebook. Another reason was that the local provider didn't really have its place with Cortex ("Cortex is an open source platform for large-scale inference workloads") - removing it allows us to spend time on features that really matter to large-scale inference workloads-oriented users.

    Thank you so much for that explanation! I'm looking forward to working with cortex!

    Does cortex support joblib models?
    1 reply
    Would it be possible to add GZIPCompression to responses , e.g. https://fastapi.tiangolo.com/advanced/middleware/#gzipmiddleware
    I can put in a PR if it is something likely to be accepted
    1 reply
    when i run $ cortex cluster up without deploy the money is being taken?
    1 reply
    also, how to stop deploy or remove deploy and that's all?
    1 reply
    how to resolve this warning when run cluster up
    [ℹ] CloudWatch logging will not be enabled for cluster "cortex-test" in "eu-central-1"?
    without [ℹ] you can enable it with 'eksctl utils update-cluster-logging --region=eu-central-1 --cluster=cortex-test'?
    1 reply
    why i can't deploy realtimeapi on r5.large? which instances should use?
    5 replies
    I see the monitor model cli aren't present anymore. Do we have any other way to monitor models?
    1 reply
    Hi, anybody uses cortex to deploy ~1000 models in parallel with individual scaling based on the usage of a model? Can someone share a best practice? :)
    15 replies
    Santiago Andres Rodriguez Gonzalez
    Hi, I had an api running for some days receiving requests, after a while, I got error status, with the following log,
    runtime: mlock of signal stack failed: 12
    runtime: increase the mlock limit (ulimit -l) or
    runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
    fatal error: mlock failed
    runtime stack:
    runtime.throw(0x7f6ae1a05eb6, 0xc)
        /root/.go/src/runtime/panic.go:1112 +0x74
        /root/.go/src/runtime/os_linux_x86.go:72 +0x109
        /root/.go/src/runtime/os_linux.go:341 +0x7a
        /root/.go/src/runtime/proc.go:630 +0x10c
    runtime.allocm(0x0, 0x0, 0x7f6a46ffb020)
        /root/.go/src/runtime/proc.go:1390 +0x152
        /root/.go/src/runtime/proc.go:1529 +0x2f
        /root/.go/src/runtime/proc.go:1517 +0x83
        /root/.go/src/runtime/asm_amd64.s:370 +0x63
    goroutine 52 [running, locked to thread]:
        /root/.go/src/runtime/asm_amd64.s:330 fp=0xc0003e0ec0 sp=0xc0003e0eb8 pc=0x7f6ae19ca940
        /root/.go/src/runtime/cgocall.go:226 +0x292 fp=0xc0003e0f58 sp=0xc0003e0ec0 pc=0x7f6ae1973ef2
        /root/.go/src/runtime/cgocall.go:207 +0xc7 fp=0xc0003e0fc0 sp=0xc0003e0f58 pc=0x7f6ae1973bc7
    runtime.cgocallback_gofunc(0x0, 0x0, 0x0, 0x0)
        /root/.go/src/runtime/asm_amd64.s:793 +0x9a fp=0xc0003e0fe0 sp=0xc0003e0fc0 pc=0x7f6ae19cc2ea
        /root/.go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0003e0fe8 sp=0xc0003e0fe0 pc=0x7f6ae19cca81
    19 replies
    Madison Bahmer

    Hey folks - our systems have had some large volume spikes due to US events going on today. In doing so our cortex cluster seemed to scale fine - we've only got 1 model deployed at a replica baseline of three which normally services all requests. The spikes today have seen that surge to 5-7 replicas, however during the actual replica creation (or traffiic routing) there is a very definite perioid of failures or errors within our api calls. Like traffic is being routed to a new replica that is not ready yet. For context we've typically not messed with the default predictor settings but we would like to ensure 0 failures for replica allocation (both up and down).

    My second question again involves traffic load, and for a while today our cluster was unreachable via the cortex cli due to the volume of traffic we were sending to the api endpoint. I was wondering if there were additional traffic routing node configurations/sizing that we need to adjust in order to handle larger and larger network volumes into the eks cluster.

    For example, here is a cli command that timed out, under a lighter network load it returns fine

    ❯ cortex get -e <clustername>
    error: Get "https://xxxxxxxx-xxxxxxxxx.elb.us-east-1.amazonaws.com/get": dial tcp x.x.x.x:443: connect: operation timed out
    unable to connect to your cluster in the <clustername> environment (operator endpoint: https://xxxxxxxxxx-xxxxxxx.elb.us-east-1.amazonaws.com)
    if you don't have a cluster running:
        → if you'd like to create a cluster, run `cortex cluster up --configure-env <clustername>`
        → otherwise you can ignore this message, and prevent it in the future with `cortex env delete <clustername>`
    if you have a cluster running:
        → run `cortex cluster info --configure-env <clustername>` to update your environment (include `--config <cluster.yaml>` if you have a cluster configuration file)
        → if you set `operator_load_balancer_scheme: internal` in your cluster configuration file, your CLI must run from within a VPC that has access to your cluster's VPC (see https://docs.cortex.dev/v/0.23/aws/vpc-peering)
    4 replies
    Madison Bahmer
    Screen Shot 2021-01-20 at 3.46.33 PM.png
    @deliahu Not sure if Gitter supports uploading images to threads -> this is a screenshot of our cloudwatch logs. Note the large spike of the p99 graph, which correlates to a "0" count of responses per min, which also is darn near close to when the cluster decided to scale from 5 to 7 replicas for a very short period of time.
    the "avg in-flight requests per replica" graph also drops oddly to near zero during that time, which may? have cause the autoscaler to drop the replicas all the way back down to baseline 3
    David Eliahu
    @madisonb do you mind running cortex cluster info --debug and sending us the debug file to dev@cortex.dev?
    3 replies
    Madison Bahmer
    sure thing, give me a few mins and you'll see it come through
    David Eliahu

    @/all we just released v0.27.0! Here is the full changelog, and here's a summary:

    New features

    • Add new API type TaskAPI for running arbitrary Python jobs (docs) (requested by @jinhyung)
    • Write Cortex's logs as structured logs, and allow use of Cortex's structured logger in predictors (supports adding extra fields) (aws docs, gcp docs) (requested by @madisonb)
    • Support preemptible instances on GCP (docs) (requested by @fandy, @aPaleBlueDot)
    • Support private load balancers on GCP (docs) (requested by @HodorTheCoder)
    • Support GCP instances with multiple GPUs (docs) (requested by @dkashkin)

    Breaking changes

    • cortex logs now streams logs from a single replica at random when there are multiple replicas for an API. The recommended way to analyze production logs is via a dedicated logging tool (by default, logs are sent to CloudWatch on AWS and StackDriver on GCP)

    Bug fixes

    • Misc Python client fixes (reported by @imagine3D-ai)


    • Document the shared /mnt directory for TensorFlow predictors (suggested by @lminer)


    • Improve out-of-memory status reporting (suggested by @sudoPete)
    • Improve batch job cleanup process (suggested by @vackosar)
    • Remove grpc msg send/receive limit (suggested by @lminer)
    Hamza Tahir

    Hey everyone! I tried using the create_api(api_spec=api_config, predictor=PythonPredictor) function from a python script. The only difference from the normal examples is that rather than importing PythonPredictor like normal import PythonPredictor i'm using importlib due to some other reasons [which is really the same thing]

    However, I am getting the following error after deployment even with a simple predictor:

    Traceback (most recent call last):
      File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/serve/serve.py", line 295, in start_fn
        predictor_impl = api.predictor.initialize_impl(project_dir, client)
      File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/api/predictor.py", line 184, in initialize_impl
        class_impl = self.class_impl(project_dir)
      File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/api/predictor.py", line 254, in class_impl
        "cortex_predictor", os.path.join(project_dir, self.path), target_class_name
      File "/opt/conda/envs/env/lib/python3.6/site-packages/cortex_internal/lib/api/predictor.py", line 275, in _get_class_impl
        raise UserException("unable to load pickle", str(e)) from e
    cortex_internal.lib.exceptions.UserException: error: error in predictor.pickle: unable to load pickle: No module named 'examples'

    The predictor:

     class PythonPredictor:
        def __init__(self, config):
            # Initialize your model here
            print("Got config to be %s" % str(config))
        def predict(self, payload):
            print("Got config to be %s" % str(payload))
            prediction = ...
            return prediction

    Can anyone help? :-)

    25 replies
    Hamza Tahir
    @deliahu looks like the examples is coming from the fact that thats the root module where predictor exists
    just updated to .27 from .24 I'm using the batch api, and almost the exact code from the batch api to submit a job, any thoughts?
    28 replies
    Abdoulaye Faye
    Hi everyone, i would know to use the prediction monitoring features, il already setup this on my cortex.yaml and i can get some metrics when i run "cortex get my_api" but i would know how i could retrieve it and make dashboard on it! Thanks! i'm using the version 0.20!
    3 replies
    Luke Miner
    What happens if a connection is unexpectedly closed during a request to the realtime api? It seems like the job just dies. Is there some way to add some extra logging in the event of a lost connection?
    3 replies
    hi all - looking to speed up my worker spin up time using the BatchAPI. I'm already using a custom docker image, but the workers take approx 5 minutes to spin up. Has anyone got any tips?
    3 replies
    Hamza Tahir
    Is there a full end-to-end example somewhere of using TensorflowPredictor, starting from a Tensorflow SavedModel to testing it using TFExamples?
    7 replies
    Luke Miner
    Is there any recommended documentation for getting cortex to work with cloudwatch? I'm using from cortex_internal.lib.log import logger as cortex_logger and while I can see a cloudwatch dashboard, there isn't a log group in the web console, so I can't access any individual logs.
    7 replies