Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Hamza Tahir
    @htahir1
    image.png
    Robert Lucian Chiriac
    @RobertLucian
    Screenshot 2021-02-01 at 21.09.53.png
    Max Halford
    @MaxHalford
    Hey there! I'm one of the creators of River (https://riverml.xyz/latest/), which is a Python library for online machine learning. Basically this involves models that can learn with one sample at a time. We're at a point in time where we want to offer a canonical way for users to deploy online models. I really like what Cortex has done for inference, and I was wondering if there was any interest to extend to online updating of models? I know this is a bit of a stretch and it's most certainly out of scope, but I rather ask before reinventing the wheel.
    5 replies
    Jake Cowton
    @JakeCowton
    Hey all, I've had some success with Cortex over the past couple of weeks, especially with the update allowing it to be deployed into an existing VPC.
    Is there a way to assign additional security groups to the cluster?
    5 replies
    jgingerexscientia
    @jgingerexscientia
    I'm trying to deploy cortex in CI and i'm getting the following error after runnning yes | cortex env configure prod -p aws -o https://xxxxxx.elb.eu-west-1.amazonaws.com -k $AWS_ACCESS_KEY_ID -s $AWS_SECRET_ACCESS_KEY
    9 replies
    aws secret access key:
    error: file descriptor 0 is not a terminal
    David Eliahu
    @deliahu

    @/all we just released v0.28.0! Here is the full changelog, and here's a summary:

    New features

    • Support installing Cortex on an existing Kubernetes cluster (on AWS or GCP) (docs)

    Breaking changes

    • The cloudwatch dashboard has been removed as a result of our switch to Prometheus for metrics aggregation. The dashboard will be replaced with an alternative in an upcoming release.

    Bug fixes

    • Fix bug which can cause requests to APIs from a Python client to timeout during cluster autoscaling (reported by @madisonb)
    • Fix bug which can cause downscale_stabilization_period to be disregarded during downscaling

    Misc

    • AWS credentials are no longer required to connect the CLI to the cluster operator. If you need to restrict access to your cluster operator, configure the operator's load balancer to be private by setting operator_load_balancer_scheme: internal in your cluster configuration file, and set up VPC Peering. We plan in supporting a new auth strategy in an upcoming release.
    2 replies
    Madison Bahmer
    @madisonb
    Nice! Thanks @deliahu for all the work you guys continue to put into this
    Completely unrelated question: I'm trying to use the "Chaining APIs" setup to have endpoints that call other endpoints. I think this only works when actually deployed to a real cluster, not in a local environment, correct? Is there an internal shortcut I can use to do local testing as well since http://api-<api_name>:8888/predict is not the correct endpoint for my local replicas?
    2 replies
    Madison Bahmer
    @madisonb
    Most of the time it's just http://localhost:<port> for the various api's called. But, if there is a way I dont have to use an if/else or another cortex get <api> command inside of my replica to do local vs cluster testing - that's my thought process here
    Alex Malone
    @WebPerfTest_twitter
    Hey all, is it possible to run cortex on an EKS optimized AMI that doesn't support GPUs? Currently we have no requirement for a GPU and would prefer not to use it in order to save money.
    Robert Lucian Chiriac
    @RobertLucian

    @WebPerfTest_twitter if you don't spin up a cluster with GPU instances, then it won't use GPUs and you won't be taxed by AWS for that. You will only be taxed for the resources you use. Check out the instance_type field on the Cortex Cloud on AWS page.

    You could even subscribe to EKS Optimized AMI with GPU Support and if you're not specifying GPU instances, then you won't be billed for that.

    All you have to do is specify the CPU-only instance type in your cluster config and you should be good to go (i.e. c5.xlarge).

    Alex Malone
    @WebPerfTest_twitter
    @RobertLucian Excellent, thank you!
    Vaclav Kosar
    @vackosar
    Cortex 0.25
    ValueError: sleep length must be non-negative
      File "threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "batch.py", line 86, in renew_message_visibility
        time.sleep((cur_time + interval) - time.time())
    1 reply
    Cortex 0.25
    ClientError: An error occurred (InvalidParameterValue) when calling the DeleteMessage operation: Value <XXXXXXXXX> for parameter ReceiptHandle is invalid. Reason: The receipt handle has...
      File "batch.py", line 330, in <module>
        start()
      File "batch.py", line 326, in start
        sqs_loop()
      File "batch.py", line 190, in sqs_loop
        handle_batch_message(message)
      File "batch.py", line 230, in handle_batch_message
        sqs_client.delete_message(QueueUrl=queue_url, ReceiptHandle=receipt_handle)
      File "botocore/client.py", line 337, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "botocore/client.py", line 656, in _make_api_call
        raise error_class(parsed_response, operation_name)
    1 reply
    Vaclav Kosar
    @vackosar
    Cortex 0.25
    job_json {"kind": "errors.unexpected", "message": "unable to find log stream named '<XXXX>_operator' in log group <XXXX>/<XXXX>"}
    1 reply
    Madison Bahmer
    @madisonb

    Hey folks - another question from me again :)

    Our cortex endpoints would like to use custom caching or config fetches from other machines. Think of something like Redis being deployed on an EC2 instance that has the VPC peering setup. I have followed this guide, but I can't seem to reach out to my EC2 instance from Cortex (Ec2 -> cortex works fine). i was wondering if there was an additional step that allowed two way comms.

    cortex v0.23

    4 replies
    Madison Bahmer
    @madisonb
    Thanks, I’ll move over to slack for questions in the future. Happy to know that the typical vpc peering setup works for a normal setup, I’ll check the inbound connections again and see if I missed something!
    David Eliahu
    @deliahu

    @/all we just released v0.29.0! Here is the full changelog, and here's a summary:

    New features

    • Add Grafana dashboard for APIs (docs)
    • Support API autoscaling in GCP clusters (docs)
    • Support traffic splitting in GCP clusters (docs)

    Breaking changes

    • The default Docker images for APIs have been slimmed down to not include packages other than what Cortex requires to function. Therefore, when deploying APIs, it is now necessary to include the dependencies that your predictor needs in requirements.txt (docs) and/or dependencies.sh (docs).

    Bug fixes

    • Support empty directory objects for models saved in S3/GCS (reported by @htahir1)
    • Disable dynamic batcher for TensorFlow predictor type
    • Fix bug which prevented Task APIs on GCP from being cleaned up after completion

    Docs

    • Add documentation for using a version of Python other than the default via dependencies.sh (docs) or custom images (docs)

    Misc

    • Support deploying predictor Python classes from more environments (e.g. from separate Python files, AWS Lambda) (suggested by @htahir1 and @imagine3D-ai)
    @/all also, please note that we are transitioning our community to from Gitter to Slack. Here is a link to join: community.cortex.dev
    ekCSU
    @ekCSU
    Hey, in v0.29 some Python packages are not preinstalled as before including OpenCV. So they have to be added to the requirement file but when I add "opencv-python" to "requirement.txt" I get the following error when "import cv2". Could you please help? Thanks.
    ImportError: libGL.so.1: cannot open shared object file
    5 replies
    thomfoster
    @thomfoster
    Hi guys! I'm using the batchAPI and love the interface, but it's taking ages for jobs to start running. Checking the logs, it seems that most of this time is spent downloading my custom python serving image - which is hosted in the same region as the cluster. It takes over 5mins to download the image - which seems ridiculous to me. Is this normal? Is there anyway to speed this up? Can I reserve that instances keep the image downloaded?
    6 replies
    David J. Garcia
    @anduill
    .
    Vishal Bollu
    @vishalbollu
    @anduill is there is a specific message you are interested in getting a response to? We have completed our transition Gitter to Slack. Here is a link to join: community.cortex.dev. Feel free to reach out.
    David J. Garcia
    @anduill
    @vishalbollu oh, great! I have had some problems with VPC peering. I"ll post in slack
    @vishalbollu I think I have to be invited to join
    is that true
    there is probably something dumb I'm missing here
    Vishal Bollu
    @vishalbollu
    I don't think you need to invited, I do believe that you have to create a slack log in for the community. What kind of error are you encountering?
    David J. Garcia
    @anduill
    @vishalbollu does VPC peering only work with +0.28?
    Vishal Bollu
    @vishalbollu
    I believe it should be supported for older versions of cortex as well. Try going to https://docs.cortex.dev/ and selecting the version of cortex you are interested in. Then search for vpc peering to find the relevant docs for that version. If there are no vpc peering docs for your version, let us know.
    Ryan Frenz
    @rfrenz-avio

    Hi all - I'm using Cortex 0.29 to do batch prediction using a PythonPredictor and Tensorflow. It's working fine on the CPU with basic requirements.txt, but now I'm trying to use the GPU and getting errors in the worker setup logs:

    *"log": "2021-03-02 19:18:05.940507: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n",

    *Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n"
    *"log": "2021-03-02 19:18:05.940796: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n",
    *"log": "2021-03-02 19:18:05.945435: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64\n",
    *"log": "2021-03-02 19:18:05.945682: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n"
    This is with 'tensorflow-gpu' in requirements and the 'EKS-optimized AMI with GPU Support'
    instance type p3.2xlarge, us-east-2
    I've played with various versions of tensorflow-gpu with slightly different dynamic lib errors, but same overall result
    Any ideas on what to check or try is appreciated, thanks!
    jgingerexscientia
    @jgingerexscientia
    I noticed you removed cortex core, what is the reasoning behind this?
    Vishal Bollu
    @vishalbollu

    For this explanation, what you’ve referred to as Cortex Core, the ability to install Cortex on your own cluster, I am going to refer to as Cortex BYOCluster to make it more verbose. For the main version of Cortex that provisions its own cluster, I am going to refer to as Cortex Managed.

    BYOCluster is more flexible because it can be installed on your own EKS/GKE. This flexibility comes at the cost of not having some of the cluster aware functionality such as automatic GPU/ASIC setup, spot instances and upcoming functionality such as supporting multiple instance types. We’ve removed Cortex BYOCluster because it makes it harder for us to support and build some of the more complicated cluster-aware features because the cluster is no longer in Cortex’s control.

    Rather than providing a product that is more flexible but supports a subset of the current and upcoming features, we leaned towards improving Cortex Managed to more easily integrate into existing develops workflows. We could maintain two separate products, but the idea of focusing on a single product, a managed cluster optimized for model inference that makes at scale model deployment and management in production easy and integrates into your devops stack appealed to us more.

    Given that you’ve used Cortex Managed, it would be great to hear your thoughts on Cortex BYOCluster and Cortex Managed. Feel free to reach out to me at vishal@cortexlabs.com if you would like to have a chat.

    Irineu Licks Filho
    @irineul
    Hello, do we have an updated diagram of the cloud architecture for Cortex? I think the one I had was deleted from the docs.
    Rocco
    @rcammisola
    @vishalbollu do you think CortexBYO will ever come back?
    1 reply
    Hi, I'm wondering if anyone has any solid advice for how to set up first line support for a Cortex cluster - it seems to me like we'd be able to set some alarms on cloudwatch based on certain metrics but is there any best practice/guidance on what to alarm on OR what can be done to recover a non-responding cluster? (given the solidity of cortex clusters it feels like bringing the cluster down would be quite an impressive feat but we would like to have a plan in place just in case)
    1 reply
    Miguel Varela Ramos
    @miguelvr
    @/all Cortex gitter support is now deprecated. You should use our slack channel instead
    Rocco
    @rcammisola
    I can't seem to join the slack channel
    2 replies
    Christopher Shelley
    @basiclaser
    Hello, I've noticed that my cortex answer-generator example (v0.18) is "updating" for 30 minutes, is this typical? running on a macbook pro 2017
    (its never run before on this machine, first try)
    David Eliahu
    @deliahu
    As a reminder, please note that have now transitioned our community to from Gitter to Slack. Here's a link to join: https://community.cortex.dev