Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Ke Zhang
    @linkerzhang
    about how to define a function, https://github.com/onnx/onnx/blob/master/onnx/defs/function.h#L61 I added an example as a comment. One of teammate is working on another PR to move MeanVarianceNormalization in experimental ops to a function.
    will let you know
    it's ok to have the customized layer discussion in this room too.
    Danilo Pau
    @danilopau
    @linkerzhang thanks so much, very glad to hear that. MeanVarianceNormalization as example of custom layer / function made of basic operators is great but than this requires just one input tensor and one output. It would be great also to have more complex examples >1 input tensors and >1 outputs with number of inputs notequal to number of outputs
    Eran Geva
    @MrGeva
    I think we should add attributes to tensors in order to store quantized models. This will allow us to store the quantization params such as scale and offset
    Ke Zhang
    @linkerzhang
    @MrGeva the current design is to have quantization schema as a sub-graph (Function) and have quantization params (scale/offset) as inputs/outputs of the sub-graph (Function).
    Danilo Pau
    @danilopau

    @linkerzhang @MrGeva
    hello, I would like you to consider to attach at each tensor to be quantized following info that are easy to collect running inference on customer data:
    Activation

    #

    Output layer name: dense_1
    Min value
    Max value
    Sigma
    Average

    #

    About weights and bias

    #

    weights layer name: dense_1
    Min value
    Max value
    Sigma
    Average

    #
    #

    bias layer name: dense_1
    Min value
    Max value
    Sigma
    Average

    #

    With those info, quantization can happen after ONNX import phase.
    What do you think ?

    Ran Cohen
    @RanACohen

    @danilopau I like where you are going with this and would like to take it further.

    I would like to attach to each intermediate tensor (ValueInfoProto) list of attributes (same as the op attributes).
    Then we can 'standardize' attributes on several phases of quantization:

    1. Before Quantization (like your proposal or even adding histograms)
    2. After Quantization (where we will add scale and offset to translate to original values)
    Ran Cohen
    @RanACohen
    @MrGeva Yes!
    @linkerzhang This is not related to the fact that we need to enable tracking of where those quantized values came from. That will allow us to debug and restore the original values (up to the relevant added quantization noise)
    Danilo Pau
    @danilopau
    @RanACohen that would be great indeed!
    Naman Maheshwari
    @namanmaheshwari
    Hi All, I'm trying to do some quantization analysis for some of the DNN models in ONNX and need to access the network parameters for the same. I see that the inputs, weights and biases are stored in the raw_data variable in model.graph.initializer. However, when I try to print it, it shows up in some incomprehensible format. Is there a method which I can use to parse it to float? Please let me know if you need any more information. Any help would be greatly appreciated.
    Jiong Gong
    @jgong5
    @RanACohen @danilopau I like your idea to annotate the graph with statistical info of tensors. This allows more freedom for backend optimizers to choose the right quantization approaches. But I think that these annotations are better bound to ops instead of tensors. Tensors may have aliases but their statistical info cannot be shared. Adding these info into the attributes of op (NodeProto.attribute) looks more straightforward. What do you think?
    Jiong Gong
    @jgong5
    @RanACohen @danilopau On the statistical info to be annotated, we should also allow finer-grained quantization schemes, e.g. supporting different scale for each output channel of a convolution filter. In general, statistical info along a particular axis of a tenor should be supported.
    Ran Cohen
    @RanACohen
    @jgong5 the actual statistical data is on the tensors, when an op has one output (most of them) then this is equivalent, but during graph transformations ops get's fused and only 'consistent' entity is the connecting tensor. ALso what happends with ops with multiple outputs? (Split)
    @jgong5 statistical info can also be extended to histogram per tensors/per feature map to allow for non linear quantization or smart threshold quantization ranges....
    Danilo Pau
    @danilopau

    @jgong5

    @RanACohen @danilopau I like your idea to annotate the graph with statistical info of tensors. This allows more freedom for backend optimizers to choose the right quantization approaches. But I think that these annotations are better bound to ops instead of tensors. Tensors may have aliases but their statistical info cannot be shared. Adding these info into the attributes of op (NodeProto.attribute) looks more straightforward. What do you think? [danilopau: yes I agree given that NodeProto captures the essence of such quantizable layer]

    @RanACohen @danilopau On the statistical info to be annotated, we should also allow finer-grained quantization schemes, e.g. supporting different scale for each output channel of a convolution filter. In general, statistical info along a particular axis of a tenor should be supported. [danilopau: uhmm depends on the complexity it would pushes on a hw accelerator vs its inner flexibility; I mean scale can be different for each channel as long as just for example all the channels are then (eg) 8bits, and convolutions are computed as such using 8 bits weights; if different scales imply mixed (8/16/32 or other combined precisions that would complicate the hw logic]
    Jiong Gong
    @jgong5
    @RanACohen I just realized that ONNX requires SSA so attaching the statistical info to tensors or ops is either fine. On the question of fusion, I think it is more related to the backend optimization and not necessarily included in ONNX semantics.
    Jiong Gong
    @jgong5
    @danilopau Yes, HW accelerators can decide whether such channel-wise statistical info is needed but it is no harm to provide such detailed info in ONNX. Such finer-grained scales would benefit accuracy anyway. Per our experiments, channel-wise scales improve accuracy for all CNN models we tried (>10 models, including image classification, object detection and image segmentation) and particularly useful for models like MobileNet-v2.
    Plinio Silveira
    @pliniosilveira

    @danilopau Yes, HW accelerators can decide whether such channel-wise statistical info is needed but it is no harm to provide such detailed info in ONNX.

    I agree with @jgong5. It can be a choice of the HW vendor/backend whether to use it or not. IMO ONNX should allow (but not require) as detailed as possible information.

    Ran Cohen
    @RanACohen
    Therefore that information shall be attached to TensorInfoProto...
    Eran Geva
    @MrGeva
    Looking at the ONNX spec, it contains composite ops like LSTM and GRU that HW accelerators often want to split into lower level ops. For instance, LSTM can be split into FC layers and activations. Therefore, to support such split the statistics or scale & zero_point of the inner tensors must be provided in the model. Having functions as proposed earlier in this thread will allow adding inner tensors statistics nicely. Seems to me we need to combine both approaches - Functions to break composite ops into lower level ops and TensorInfoProto to describe the statistics of all tensors including the inner tensors of the functions.
    Danilo Pau
    @danilopau
    Dears I am not sure where we stand on fixed point in ONNX. May be I am not that updated but would like to find some precise specs or ops or tests or nets to start playing with that assuming my implementation has got 8 bit fxp point support in some way ?
    Eran Geva
    @MrGeva
    @linkerzhang and all, what is your take on my comment above? my point is that in composite ops like LSTM we must provide the inner tensors quantization info (zero_point, scale or min/max) to allow HW accelerators to execute the lower level ops separately.
    Danilo Pau
    @danilopau
    @MrGeva @linkerzhang lowering LSTM into basic unary/binary input ONNX ops is great to have. even greater if with quantization (e.g. min max, the minimal ones just enough the better) info attached to them. If also you would like to provide an example/mini tutorial on how to generate lowering and min/max that would be awasome
    Sergei Gofman
    @sergeigofman
    Second @MrGeva. We also need calibration information on the internal tensors of compound constructs like LSTM and GRU.
    Hossein Askari
    @hossein1387
    So can someone give me an update regrading the status of quantizatin in ONNX?
    Hossein Askari
    @hossein1387
    my other question is, if I have a network that uses only int8 parameters, how can I export such network to ONNX?
    Shinichiro Hamaji
    @shinh
    I'd like to know the status, too. What I know are: 1. there is a PR onnx/onnx#1872 2. onnxruntime and ngraph seem to have some of these ops
    Anchal Bhalla
    @anchalbhalla
    hello
    Ke Zhang
    @linkerzhang
    @shinh the status is, 1. the first 6 quantized ops were merged in and will be released in ONNX 1.5 release. 2) the principles of adding more quantization support were agreed during the design as shown here https://github.com/onnx/onnx/wiki/Quantization-Support-In-ONNX
    Darren Crews
    @darrenscrews
    to follow up on the first ops that are added as mentioned above, we are planning additional ops (such as LSTM) as well as additional data types (like FP16 for quantize). Material based on what I presented at the quantization breakout session at the last ONNX workshop are posted from the last workshop at this link: https://drive.google.com/drive/folders/1hiFwI1-z86YJ50DIQYi1Z2nijdK00A4o. I'll try and organize a discussion about next steps on this
    Omar A. Elgendy
    @oelgendy
    FP16 inference is 10x slower than FP32!
    Hi,
    I am doing inference with Onnxruntime in C++. I converted the ONNX file into FP16 in Python using onnxmltools convert_float_to_float16. I obtain the fp16 tensor from libtorch tensor, and wrap it in an onnx fp16 tensor using
    g_ort->CreateTensorWithDataAsOrtValue(memory_info, libtorchTensor.data_ptr(), input_tensor_size * 2, input_node_dims.data(), input_node_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16, &onnxTensor)
    What am I missing?
    Thanks,
    -Omar
    Yufeng Li
    @yufenglee

    FP16 inference is 10x slower than FP32!
    Hi,
    I am doing inference with Onnxruntime in C++. I converted the ONNX file into FP16 in Python using onnxmltools convert_float_to_float16. I obtain the fp16 tensor from libtorch tensor, and wrap it in an onnx fp16 tensor using
    g_ort->CreateTensorWithDataAsOrtValue(memory_info, libtorchTensor.data_ptr(), input_tensor_size * 2, input_node_dims.data(), input_node_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16, &onnxTensor)
    What am I missing?
    Thanks,
    -Omar

    @oelgendy, are you using cpu? OnnxRuntime only has basic support of fp16 on CPU, i.e., only capable to run. Most of operators don’t have fp16 implementation. For general operators, ORT cast fp16 input to fp32 and cast fp32 output back to fp16. The perf is expected to be slower than float32.

    王振华 (Zhenhua WANG)
    @jackwish
    Hi, do we provide any prebuilt quantized ONNX model? I'd like to get to know how a quantized ONNX graph looked typically, and looks for something like the quantized MobileNetV1 TensorFlow Lite model.
    Prasanth Pulavarthi
    @prasanthpul
    @jackwish we don't have any currently in the model zoo. you can file a request at https://github.com/onnx/models.
    1 reply
    Prasanth Pulavarthi
    @prasanthpul
    posting in all channels:
    ONNX will be switching to the Apache 2.0 license by the end of this month. Instead of signing a CLA, contributors need to provide a DCO (developer certificate of origin). This is done by including a sign-off-by line in commit messages. Using the “-s” flag for “git commit” will automatically append this line. For example, running “git commit -s -m ‘commit info.’” it will produce a commit that has the message “commit info. Signed-off-by: First Last email@company.com”. The DCO bot will ensure commits are signed with an email address that matches the commit author before they are eligible to be merged.
    Diego Calvete
    @diecalsa

    Hi everybody!

    I have a yolov4-tiny model converted to onnx format and need to do full integer quantization in order to use it on an edge device. Therefore I am trying to do static quantization using 'quantize_static' and a test dataset for calibration. I am able to quantize the model, but when I check it, it raises an error:

    Exception has occurred: ValidationError
    No opset import for domain 'com.microsoft'

    ==> Context: Bad node spec: input: "StatefulPartitionedCall/functional_1/batch_normalization/FusedBatchNormV3:0_quantized" input: "StatefulPartitionedCall/functional_1/batch_normalization/FusedBatchNormV3:0_scale" input: "StatefulPartitionedCall/functional_1/batch_normalization/FusedBatchNormV3:0_zero_point" input: "StatefulPartitionedCall/functional_1/tf_op_layer_LeakyRelu/LeakyRelu:0_scale" input: "StatefulPartitionedCall/functional_1/tf_op_layer_LeakyRelu/LeakyRelu:0_zero_point" output: "StatefulPartitionedCall/functional_1/tf_op_layer_LeakyRelu/LeakyRelu:0_quantized" name: "StatefulPartitionedCall/functional_1/tf_op_layer_LeakyRelu/LeakyRelu_quant" op_type: "QLinearLeakyRelu" attribute { name: "alpha" f: 0.1 type: FLOAT } domain: "com.microsoft"
    File "/Users/diegocalvete/Desktop/COMPARE_MODELS/guarreo.py", line 7, in <module>
    onnx.checker.check_model(model)

    Here is the code used for quantizing:
    import os
    import sys
    import numpy as np
    import re
    import abc
    import subprocess
    import json
    from PIL import Image
    
    import onnx
    import onnxruntime
    from onnx import helper, TensorProto, numpy_helper
    from onnxruntime.quantization import quantize_static, calibrate, CalibrationDataReader, quantize_dynamic, QuantType
    
    
    class DataReader(CalibrationDataReader):
        def __init__(self, calibration_image_folder, augmented_model_path='augmented_model.onnx'):
            self.image_folder = calibration_image_folder
            self.augmented_model_path = augmented_model_path
            self.preprocess_flag = True
            self.enum_data_dicts = []
            self.datasize = 0
    
        def get_next(self):
            if self.preprocess_flag:
                self.preprocess_flag = False
                session = onnxruntime.InferenceSession(self.augmented_model_path, None)
                (_, height, width, _) = session.get_inputs()[0].shape
                nhwc_data_list = preprocess_func(self.image_folder, height, width, size_limit=0)
                input_name = session.get_inputs()[0].name
                self.datasize = len(nhwc_data_list)
                self.enum_data_dicts = iter([{input_name: nhwc_data} for nhwc_data in nhwc_data_list])
            return next(self.enum_data_dicts, None)
    
    
    def preprocess_func(images_folder, height, width, size_limit=0):
        '''
        Loads a batch of images and preprocess them
        parameter images_folder: path to folder storing images
        parameter height: image height in pixels
        parameter width: image width in pixels
        parameter size_limit: number of images to load. Default is 0 which means all images are picked.
        return: list of matrices characterizing multiple images
        '''
        image_names = os.listdir(images_folder)
        if size_limit > 0 and len(image_names) >= size_limit:
            batch_filenames = [image_names[i] for i in range(size_limit)]
        else:
            batch_filenames = image_names
        unconcatenated_batch_data = []
    
        for image_name in batch_filenames:
            image_filepath = images_folder + '/' + image_name
            pillow_img = Image.new("RGB", (width, height))
            pillow_img.paste(Image.open(image_filepath).resize((width, height)))
            input_data = np.float32(pillow_img) - \
            np.array([123.68, 116.78, 103.94], dtype=np.float32)
            # input_data = np.array(pillow_img).astype('int64')
            nhwc_data = np.expand_dims(input_data, axis=0)
            unconcatenated_batch_data.append(nhwc_data)
        batch_data = np.concatenate(np.expand_dims(unconcatenated_batch_data, axis=0), axis=0)
        return batch_data
    
    
    def main():
    
        input_model_path = './models/onnx_models/yolov4-416-opset12.onnx'
        output_model_path = './models/onnx_models/yolov4-416-opset12-quantized.onnx'
        calibration_dataset_path = './test_images'
    
        # static quantization
        dr = DataReader(calibration_dataset_path)
        quantize_static(input_model_path, output_model_path, dr, ['Conv', 'QLinearLeakyRelu', 'LeakyRelu'])
    
        print('Calibrated and quantized model saved.')
    
    
    if __name__ == '__main__':
        main()
    I guess there is some problem quantizing LeakyReLu
    Any idea how to fix it?
    Yufeng Li
    @yufenglee
    @diecalsa , the quantization tool under onnxruntime supports more quantized operators than onnx standard, like the LeakyRelu. Onnx check doesn't recognize those operators. That's why you see this check issue. You can run the model with OnnxRuntime well.
    You can get an Onnx model with only standard quantized operators by specifying 'Conv' only when calling quantize_static.
    And ORT quantization tool is adopting the QDQ (Quantizelinear/DeQuantizeLinear) pattern. With that, you can get a quantized model with only Quantizelinear and DeQuantizeLinear.
    Diego Calvete
    @diecalsa
    @yufenglee thanks for your fast response. Sorry, I am not an expert in quantization. I understand what you mentioned by quantizing only 'Conv' operators when calling quantize_static. I need to do full integer quantization in order to use the model on an edge TPU device such as Coral / Rockchip. Do you think it will work that way? I will try it as soon as I get the edge device.
    Yufeng Li
    @yufenglee
    @diecalsa , so the device doesn't support floating-point computation?
    Diego Calvete
    @diecalsa
    @yufenglee that's it. It does only support int8 operations.
    Mayank Sharma
    @mayank-nference
    Hi everyone, I recently came to know about this group. I just now joined it. Can anyone help me with static quantization? Can someone please share a sample example of static quantization on a model for onnxruntime? I want to apply static quantization on my finetuned NER bert model.
    David Kroell
    @vixadd
    Hey guys, I have an already trained model that I'm looking to quantize in order to maximize data throughput on CPU. Has anyone seen any sort of performance improvements when trying to conduct predictions/inference on CPU-only architecture?