Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Felix Dittrich
    @felixdittrich92
    Hello :)
    How can i configure the docker image ? I want to use miniconda as Python Dist and install my dependencies with env.yml so i dont have a requirements.txt Thanks for your help :)
    Vishal Bollu
    @vishalbollu
    @felixdittrich92 You have a few options:
    • You can list the conda packages in a file called conda-packages.txt, more information here
    • If you have a more complicated conda env.yaml setup than just a list of packages, then you can specify a bash script dependencies.sh to apply your env.yaml because Cortex images already use conda. Make sure that your env.yaml file is in the same directory as your cortex.yaml file. Create a bash script called dependencies.sh (make sure to make it executable chmod +x dependencies.sh). In the dependencies.sh add the following line conda env create -f env.yaml. The dependencies.sh script is executed before your API initializes. Be careful about changing the python version. Cortex serving code currently uses 3.6.9 and is not tested on other versions of Python.
    • If you want to create your own image, the instructions can be found here
    Let us know how it goes!
    Felix Dittrich
    @felixdittrich92
    Wow thanks for the very fast answer i will try it :)
    Vishal Bollu
    @vishalbollu
    @felixdittrich92 a correction. For the second option the command conda env create -f env.yaml creates a new environment but doesn't apply it so the Cortex web server won't be using the new environment you created. The more accurate command is to use the command conda env update -f env.yaml --name env to update the existing Conda environment used by Cortex rather than creating a new environment. The Conda environment used by Cortex is called env so you'll want to use and update this environment.
    Felix Dittrich
    @felixdittrich92
    Nice thanks :)
    wja30
    @wja30

    when i tested cortex 0.19. i have one question. after throughput test, when i "cortex-dev get" the average latency is 2.9 sec (=2900 msec). I think average latency should be lower than 500 msec. but the result's latency is too long. do you know why? the experiment configuration is as follows : 1. throupuput test conf. : python3 ../../utils/throughput_test.py -i 60 -p 4 -t 60 2. cortex-dev get configure : configuration
    name: image-classifier-resnet50
    kind: RealtimeAPI
    predictor:
    type: tensorflow
    path: predictor.py
    model_path: s3://cortex-examples/tensorflow/resnet50
    processes_per_replica: 4
    threads_per_process: 192
    config:
    classes: https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json
    input_key: input
    input_shape:

    - 224
    - 224
    output_key: output

    image: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/tensorflow-predictor:latest
    tensorflow_serving_image: 392434356039.dkr.ecr.us-east-1.amazonaws.com/cortexlabs/tensorflow-serving-inf:latest
    networking:
    endpoint: /image-classifier-resnet50
    api_gateway: public
    compute:
    cpu: 3
    inf: 1
    mem: 4G
    autoscaling:
    min_replicas: 4
    max_replicas: 4
    init_replicas: 4
    target_replica_concurrency: 768.0
    max_replica_concurrency: 1024
    window: 1m0s
    downscale_stabilization_period: 5m0s
    upscale_stabilization_period: 1m0s
    max_downscale_factor: 0.75
    max_upscale_factor: 1.5
    downscale_tolerance: 0.05
    upscale_tolerance: 0.05
    update_strategy:
    max_surge: 25%
    max_unavailable: 25%

    thanks in advance.
    1. cluster conf: instance_type: inf1.2xlarge
      min_instances: 4
      max_instances: 4
    ...
    spot: true
    spot_config:
    on_demand_base_capacity: 0
    on_demand_percentage_above_base_capacity: 0
    wja30
    @wja30
    tps : 79.700 // total # of processed reqeust = 4782 per 60 seconds
    avg latency : 2900 msec
    Robert Lucian Chiriac
    @RobertLucian

    @wja30 a couple of questions first:

    1. Were there 4 API replicas of this API running when you run the test?
    2. Your dev box, is it on your laptop or on an instance running in the cloud? The reason I'm asking is that a cloud instance will have a much higher bandwidth on the network. I recommend a dev box in the cloud to perform this test.

    And here are my thoughts:

    1. When the latency is too high, it is generally as a result of making too many requests at the same time (overloading the instances with too many requests) and thus leading to significantly higher latencies. Even slight overloads will lead to high latencies (anecdotal evidence). Keep in mind that the latency shown in cortex get also counts time spent in queue.
    2. Try reducing the -t option's value down to 48 or 24. See if this improves the latency and throughput.
    3. 79.7 predictions/s is kinda low. Should be at least ~120 predictions/s/API replica.
    wja30
    @wja30

    @RobertLucian

    1. Were there 4 API replicas of this API running when you run the test? A) yes, I set up the min_replica : 4 and max_replica : 4 to prevent executing auto-scaling. Furthermore, I set up cluster : instance_type: inf1.2xlarge, min_instances: 4, max_instances: 4.
    2. Your dev box, is it on your laptop or on an instance running in the cloud? The reason I'm asking is that a cloud instance will have a much higher bandwidth on the network. I recommend a dev box in the cloud to perform this test. A) all tested instances are located in aws cloud in virginia.

    And here are my thoughts:

    1. When the latency is too high, it is generally as a result of making too many requests at the same time (overloading the instances with too many requests) and thus leading to significantly higher latencies. Even slight overloads will lead to high latencies (anecdotal evidence). Keep in mind that the latency shown in cortex get also counts time spent in queue.
    2. Try reducing the -t option's value down to 48 or 24. See if this improves the latency and throughput.
    3. 79.7 predictions/s is kinda low. Should be at least ~120 predictions/s/API replica.

    I tried to test the configurations as you mentioned above. And I will report the result to the Gitter. Thanks in advance.

    wja30
    @wja30
    @RobertLucian
    When I tuning the process / thread values, avg latency and tps is improved. Thanks
    Process / thread = (tps / avg. latency(ms))
    1. 4/60 = 80/2949
    2. 4/48 = 79.3/2337
    3. 4/24 = 77.9/1119
    4. 4/12 = 73.7/524
    5. 4/6 = 68.9/223
    :)
    aced125
    @aced125
    Hi guys - I was wondering if there is a way to check how many instances were running for a particular API in the last 24 hours?
    Robert Lucian Chiriac
    @RobertLucian
    @aced125 you can check that out in the API's CloudWatch dashboard. The active replicas is what you're looking for. Here's a guide on this here. By knowing how many replicas fit on an instance, you can then determine how many instances were running in the last 24 hours.
    Robert Lucian Chiriac
    @RobertLucian

    @wja30 so for 4 processes per replica with 6 threads on each process, you were getting a throughput of 68.9 predictions/s with an average latency of 223 ms (as reported by cortex get, which also includes the time spent in the queue). If that's what you were getting on all 4 API replicas, then that's kinda low - should be higher than that. Otherwise, that's an acceptable figure I think.

    I think that the latest updates to the Inferentia drivers (neuron-cc and tensorflow-neuron) may have had an impact on Inferentia's performance. Could you re-export the Inferentia resnet50 model using the latest drivers that are on our Cortex images? Once the dependencies are in, you'll need to export the model using this notebook:

    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com \
    neuron-cc==1.0.18001.0+0.5312e6a21 \
    tensorflow-neuron==1.15.3.1.0.1965.0

    Upload the model to your S3 bucket and then provide the S3 path to this cortex_inf.yaml config file. Retry and see if the throughput/latency improves even further.

    Also, another thing that might be worth checking out is the CPU usage on your dev box - maybe it's maxing out while the benchmark runs. Could you check that?

    One alternative to our benchmark script is to use the ab GNU utility to benchmark your API. The disadvantage to this one is that it doesn't necessarily simulate real-time workloads due to not being able to set a cooldown time. You could run it this way (the -c is the concurrency level):

    ab -n 10000 -c 24 -p sample.json -T 'application/json' -rks 120 <API-ENDPOINT>
    aced125
    @aced125
    Hi @RobertLucian - The dashboard doesn't seem to show the number of active replicas
    just responses_per_minute, total_in_flight_requests, median_response_time, p99_response_time
    Robert Lucian Chiriac
    @RobertLucian

    @aced125 I think you may be running this on a <0.19 version of Cortex. Is it possible for you to update the Cortex CLI (and the cluster) to the latest version (0.20)?

    If not, then you can add the widget yourself - keep in mind that every time you go back to your console, you'll have to add it back in manually. Go to your dashboard, click on add widget, and in the resource IDs add the following:

    • "api-name"
    • "in-flight"
    • Your API name
    • Your cluster's name (as passed in the cluster.yaml config file)

    With those in, also set the Statistic to Sample Count and the Period to 10 seconds. Type in the Active Replicas as the name of the title and then save the widget. That should be it. Let us know if this worked for you.

    wja30
    @wja30

    @RobertLucian

    Q1)so for 4 processes per replica with 6 threads on each process, you were getting a throughput of 68.9 predictions/s with an average latency of 223 ms (as reported by cortex get, which also includes the time spent in the queue). If that's what you were getting on all 4 API replicas, then that's kinda low - should be higher than that. Otherwise, that's an acceptable figure I think.

    A1)4 prosesses and 6 thread means that “python3 ../../utils/throughput_test.py -i 60 -p 4 -t 6” -> “-p 4 -t 6” not for cortex configuration. For replica configuration is as follows : (processes_per_replica: 4, threads_per_process: 192 in cortex file)

    Q2)Also, another thing that might be worth checking out is the CPU usage on your dev box - maybe it's maxing out while the benchmark runs. Could you check that?

    A1) When I tuning the process / thread values in (python3 ../../utils/throughput_test.py -i 60 -p 4 -t 60) , avg latency and tps is improved. And cpu and men utilization as follows
    Process / thread = (tps / avg. latency(ms)) = (node cpu avg. usage / node memory avg. usage)

    1. 4/60 = 80/2949 = 100/41.5
    2. 4/48 = 79.3/2337 = 101.5/39
    3. 4/24 = 77.9/1119 = 99.5/38.75
    4. 4/12 = 73.7/524 = 97.5 /40
    5. 4/6 = 68.9/223 = 85.5 / 41.75

    And other recommendations (updating the inferentia & ab benchmark) will be helpful to me.
    I will take care of my busy work right away, and I will check and report two recommendations.

    wja30
    @wja30
    ref. the unit of node cpu avg.usage and node memory avg. usage is (%)
    aced125
    @aced125
    thanks, will upgrade!
    Shiva Manne
    @manneshiva
    Hi guys! I recently stumbled upon Cortex and I am pretty excited to use it. I had a quick question before I got started - Does Cortex batch incoming requests for inference (on the same worker/machine)? If yes, what strategy does it use?
    Shiva Manne
    @manneshiva
    like one strategy for batching requests could be waiting for 200 ms, collecting all requests that hit the worker/machine within this "wait" period (or till we hit a MAX_BATCH size number of requests), batching and predicting on all these collected requests and return back the results. I have seen this to be much more efficient than predicting sequentially (with a large number of incoming requests per second). How does Cortex handle this?
    Shiva Manne
    @manneshiva
    This seems to be supported for "Tensorflow Predictor" (https://docs.cortex.dev/deployments/realtime-api/parallelism#server-side-batching) but not for "Python Predictor", any workaround possible for "Python Predictor"?
    Vaclav Kosar
    @vackosar
    @RobertLucian on_demand_base_capacity in https://docs.cortex.dev/cluster-management/spot-instances#example-spot-configuration configures "base capacity" per API or is it cluster wide? That is if I configure "on_demand_base_capacity: 1" will I get 1 on-demand instance per API or per whole cluster?
    Robert Lucian Chiriac
    @RobertLucian
    @vackosar on_demand_base_capacity is a cluster-wide setting. On the other hand, you can specify your API a minimum number of replicas autoscaling.min_replicas (and autoscaling.init_replicas) to have at all times regardless of the traffic your API is getting.
    Robert Lucian Chiriac
    @RobertLucian

    @manneshiva server-side batching is indeed only supported for the TensorFlow Predictor out of the box.

    For the Python Predictor, there is a workaround, with the recommendation to set the processes_per_replica field to 1. You could still set it to a value higher than 1, but the likelihood of a request landing on that specific process on an API replica drops the higher the processes_per_replica value is. This is due to the fact that requests are randomly distributed to all processes on an API replica.

    Here's a template of how you could implement server-side batching on the Python Predictor. This hasn't been tested, so let me know if you encounter any issue down the road:

    import threading as td
    
    class PythonPredictor:
        def __init__(self, config):
            self.model = None # initialize the model here
    
            self.waiter = td.Event()
            self.waiter.set()
    
            self.batch_max_size = config["batch_max_size"]
            self.batch_interval = config["batch_interval"] # measured in seconds
            self.barrier = td.Barrier(self.batch_max_size + 1)
    
            self.samples = {}
            self.predictions = {}
            td.Thread(target=self.batch_inference).start()
    
        def batch_inference(self):
            while True:
                self.barrier.wait(self.batch_interval)
                self.waiter.clear()
                self.predictions = {}
    
                # batch process self.samples
                # store results in self.predictions
    
                self.samples = {}
                self.waiter.set()
    
        def predict(self, payload):
            tID = td.get_ident()
    
            self.waiter.wait()
            self.samples[tID] = payload
            self.barrier.wait()
    
            prediction = self.predictions[tID]
            return prediction
    Shiva Manne
    @manneshiva
    @RobertLucian thanks for the prompt reply, will definitely test this out and let you know how it goes.
    Robert Lucian Chiriac
    @RobertLucian
    @manneshiva thank you! We've also added a ticket to track this cortexlabs/cortex#1470.
    Shiva Manne
    @manneshiva
    awesome! will try and create a PR for the same if I manage to get this running. Thanks again!
    Robert Lucian Chiriac
    @RobertLucian
    @wja30 I see, so that was the benchmark for a single API replica. In that case, it isn't that bad - not great either. Looking at your answer to the second question Q2, I see that for your dev box, the (node cpu avg. usage / node memory avg. usage) stats look pretty high. I think your dev box is maxed out, thus explaining the throughput you're getting. I recommend bumping up the specs of your dev box and/or trying this out with the ab GNU utility (since this one is easier on the resources).
    @manneshiva that sounds awesome! Please keep us posted :)
    Robert Lucian Chiriac
    @RobertLucian

    @wja30 tested the ResNet50 image classifier example (the one you are also trying) and with the benchmarking tool we have, I got about 400 predictions/second/API replica while the API replica's node had the CPU usage at about 70-80% - enough headroom to go even further. The limitation was probably on my local machine.

    As for testing it with the ab tool, I added a PR cortexlabs/cortex#1472 to make that clearer in the README. You can try testing with the sample payload sample.bin. This is now available on the master branch of Cortex. With the ab tool, I also got about 400 predictions/second/API replica.

    lc-ted
    @lc-ted
    Hi there, is there any way to set the python version cortex uses? I'm trying to use Huggingfaces transformers library and despite installing it for python3.6 and python3.7. python3.6 was my default, but since installing cortex I have changed it to python3.7. Any help would be great.
    David Eliahu
    @deliahu
    @lc-ted It might be possible by setting the python version in conda-packages.txt (here are the docs for using conda-packages.txt, and I believe one of our users had success with defaults::python=3.7.7=hcf32534_0_cpython). However we have not tested extensively on 3.7, so let us know if you run into any issues.
    Also, since API initialization will take longer due to installing Python 3.7, you may want to consider pre-building your Docker base image (docs)
    David Eliahu
    @deliahu
    Does the transformers library / the rest of your Predictor implementation rely on 3.7?
    wja30
    @wja30
    @RobertLucian Thanks for the advice. Based on the advice you gave, I will conduct additional experiments. thanks!
    lc-ted
    @lc-ted
    @deliahu Neither the transformers library or my implementation necessarily relies on 3.7, but it would be nice. That being said, I have the transformers library installed in 3.6 as well, which is why it's confusing to me. I'll fiddle around with it and let you know if anything comes up.
    wja30
    @wja30
    @RobertLucian to improve tps and avg.latency in inf instance, i changed the model_path from "model_path: s3://cortex-examples/tensorflow/resnet50" to "s3://cortex-examples/tensorflow/resnet50_neuron" and the tps = 200 and avg.latency is close to 50 ms. thanks very much.

    I have an additional question. In gpu, to use server_side_batching, in the cortex file "server_side_batching:
    max_batch_size: 32
    batch_interval: 0.1s" item was added. Inf simply added the above item, serverside batching did not work and a 500 error was sent out. So I looked for it,

    As in the “https://github.com/cortexlabs/cortex/blob/master/examples/tensorflow/image-classifier-resnet50/cortex_inf_server_side_batching.yaml” file, I saw that the model path should be changed to model_path: s3://cortex-examples/tensorflow/resnet50_neuron_batch_size_5. The batching size for inf is 2, 4 To change to, 8, 16, do I have to make each models? If so, is there a guide on how to make it?

    lc-ted
    @lc-ted
    @deliahu I was able to resolve the issue by adding a requirements.txt file next to the yaml. I have a path forward on resolving it, I'll address it via conda and docker directly. As an fyi, adding a conda-lists.txt and setting the default to 3.7.x, did not help me, it would download 3.7 every time I tried to deploy and would still fail to find the right library. Thanks for your help!
    Robert Lucian Chiriac
    @RobertLucian

    @wja30

    to improve tps and avg.latency in inf instance, i changed the model_path from "model_path: s3://cortex-examples/tensorflow/resnet50" to "s3://cortex-examples/tensorflow/resnet50_neuron" and the tps = 200 and avg.latency is close to 50 ms. thanks very much.

    Yes. That's the model it has to use when Inferentia instances are used. The s3://cortex-examples/tensorflow/resnet50_neuron model. Just to make this clear, when Inferentia instances are used, this is the cortex_inf.yaml API you have to deploy. 200 predictions/s is still not very high - you can push that to 400-500 predictions/s/(API replica) using the provided example.

    I have an additional question. In gpu, to use server_side_batching, in the cortex file "server_side_batching:
    max_batch_size: 32
    batch_interval: 0.1s" item was added. Inf simply added the above item, serverside batching did not work and a 500 error was sent out.

    If you use a GPU instance and you deploy using cortex_gpu_server_side_batching.yaml, it will work.
    Likewise, if you use Inferentia instances, you need to deploy using this cortex_inf_server_side_batching.yaml API config.

    The batching size for inf is 2, 4 To change to, 8, 16, do I have to make each models? If so, is there a guide on how to make it?

    Check out the Exporting SavedModels section. If you want to run a model compiled for different batch sizes, you need to run run_all from this directory.

    Robert Lucian Chiriac
    @RobertLucian

    @lc-ted

    As an fyi, adding a conda-lists.txt and setting the default to 3.7.x, did not help me, it would download 3.7 every time I tried to deploy and would still fail to find the right library. Thanks for your help!

    When you say that it would still fail to find the right library, is that regarding the python binary or about the transformers package when loading the Python Predictor (aka getting an import error)?

    As for the Python package, could you also try with conda-forge::python=3.7? When a switch in the Python's version occurs, you should notice the following logs from the API deployment:

    warning: you have changed the Python version from 3.6 to 3.7; this may break Cortex's web server"
    reinstalling core packages ...
    wja30
    @wja30
    @RobertLucian thank you^^ i'll try it!