模型量化——NVIDIA——方案选择(PTQ、 partialPTQ、 QAT)

news/2024/7/19 12:23:45 标签: 计算机视觉, transformer, AIGC

 PTQ、 partialPTQ、 QAT 选择流程

    PTQ、 partialPTQ、 QAT 咨询NVIDIA 官方后,他们的校正过程一致,支持的量化算子本质是一样的,那么如果你的算子不是如下几类,那么需要自己编写算子。参考TensorRT/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py at release/8.6 · NVIDIA/TensorRT · GitHub

QuantConv1d, QuantConv2d, QuantConv3d,

QuantConvTranspose1d, QuantConvTranspose2d, QuantConvTranspose3d

QuantLinear

QuantAvgPool1d, QuantAvgPool2d, QuantAvgPool3d,

QuantMaxPool1d, QuantMaxPool2d, QuantMaxPool3d

QuantAdaptiveAvgPool1d, QuantAdaptiveAvgPool2d, QuantAdaptiveAvgPool3d

Clip

QuantLSTM, QuantLSTMCell

 一些知识点

1、As well as quantizing activations, TensorRT must also quantize weights. It uses symmetric quantization with a quantization scale calculated using the maximum absolute values found in the weight tensor. For convolution, deconvolution, and fully connected weights, scales are per-channel. With explicit quantization, weights can be quantized using per-tensor quantization or they can be quantized using per-channel quantization. In either case, the scale precision is FP32. Activation can only be quantized using per-tensor quantization. When using per-channel quantization, the axis of quantization must be the output-channel axis. For example, when the weights of 2D convolution are described using KCRS notation, K is the output-channel axis。TensorRT supports only per-tensor quantization for activation tensors, but supports per-channel weight quantization for convolution, deconvolution, fully connected layers, and MatMul where the second input is constant and both input matrices are 2D.

总结:TRT8 建议权重为per-channel量化(CNN 权重是四维的!!!,per-channel 对应的是一个激活图对应的三维权重!),激活为per-tensor 量化。

2、Calibration can be slow; therefore the output of step 2 (the calibration table) can be cached and reused. The calibration cache data is portable across different devices as long as the calibration happens before layer fusion. Specifically, the calibration cache is portable when using the IInt8EntropyCalibrator2 or IInt8MinMaxCalibrator calibrators, or when QuantizationFlag::kCALIBRATE_BEFORE_FUSION is set. The calibration cache is in general not portable across TensorRT releases.

总结:校正文件对于同一版本的TRT的不同平台的没有先合并算子的是一致的(包括不同的平台),好处:可以校准一次,生成engine 的时候直接用校准缓存。

3、PyTorch 1.8 introduced support for exporting PyTorch models to ONNX using opset 13.PyTorch 1.8.0 and forward support ONNX QuantizeLinear/DequantizeLinear support per channel scales. You can use pytorch-quantization to do INT8 calibration, run quantization aware fine-tuning, generate ONNX and finally use TensorRT to run inference on this ONNX model. More detail can be found in NVIDIA PyTorch-Quantization Toolkit User Guide.SM 7.5 and earlier devices may not have INT8 implementations for all layers. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.

总结:量化需要TRT8+, pytorch 1.8+, onnx opset 13, sm 7.5 (之前的可能一些算子没有int8 实现)

4、TensorRT does not support prequantized ONNX models that use INT8 tensors or quantized operators.By default, do not quantize the outputs of weighted-operations. It is sometimes useful to preserve the higher-precision dequantized output. For example, if the linear operation is followed by an activation function (SiLU, in the following diagram) that requires higher precision input to produce acceptable accuracy. Do not simulate batch-normalization and ReLU fusions in the training framework because TensorRT optimizations guarantee to preserve the arithmetic semantics of these operations. Quantize the residual input in skip-connections. TensorRT can fuse element-wise addition following weighted layers, which are useful for models with skip connections like ResNet and EfficientNet. The precision of the first input to the element-wise addition layer determines the precision of the output of the fusion.

总结:TRT 不支持已经量化后的int8算子(他根据Q、DQ算子自己去优化图,至于所谓的Q DQ算子怎么定义,目前未知)。不要提前合并BN Relu等,TRT 会自动合并这些。具体可以参考最新的介绍:Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

5、INT8 calibration can be used along with the dynamic range APIs. Setting the dynamic range manually overrides the dynamic range generated from INT8 calibration.

6、For networks with implicit quantization, TensorRT attempts to reduce quantization noise in the output by forcing some layers near the network outputs to run in FP32, even if INT8 implementations are available.

排除你指定int8 或者第三方层,TRT会将网络最后几层保留FP32 精度。

7、You can further optimize engine latency by enabling FP16. TensorRT attempts to use FP16 instead of FP32 whenever possible (this is not currently supported for all layer types).

INT8 可以和FP32 混合存在,也可以和FP16混合存在。

8、bs can be set difference between set_calibration_profile and add_optimization_profile?

the int8 engine build has 2 pass, first pass is the calibration, the batch size in calibrator should match the set_calibration_profile, the second pass is the int8 engine build pass, and the batch size is configured from add_optimization_profile.

    经验:小模型(YOLO s m)含有repvgg 部件的使用PTQ掉点严重。partialPTQ一般最敏感的就是几乎整个head 的卷积网络。QAT也是先按照PTQ获得校准文件然后finetune 10% epoch.

参考链接

部分量化:YOLOv6/tools/partial_quantization at main · meituan/YOLOv6 · GitHub

QAT : YOLOv6/tools/qat at main · meituan/YOLOv6 · GitHub

NVIDIA 论文:https://arxiv.org/pdf/2004.09602.pdf 


http://www.niftyadmin.cn/n/5479875.html

相关文章

Python爬虫之增量式爬虫

相关介绍 增量式爬虫是一种用于爬取网页信息的技术,它与全量式爬虫相比具有更高效和节省资源的特点。 增量式爬虫的基本原理是通过比较已爬取的数据和新爬取的数据,只爬取和更新最新的数据。它会记录上一次爬取的状态,将新爬取的数据和已有的…

设计模式——享元模式11

享元模式又称缓存模式:把一个实体的数据分成两个部分,一部分为静态数据,一部分为动态数据,分开存储。 目的减少重复信息的数据加载。如:秒杀环境下,商品1 数量为 10000000, 我们只需把。一个商品…

求三角形面积(C语言)

一、运行结果&#xff1b; 二、源代码&#xff1b; # define _CRT_SECURE_NO_WARNINGS # include <stdio.h> # include <math.h>int main() {//初始化变量值&#xff1b;double a, b, c, s, area;//赋值&#xff1b;a 3.67;b 5.43;c 6.21;//运算求s&#xff1b…

halcon缺陷检测-印刷品检测(差异化模型),键盘字符缺陷检测

前言 在实际项目中&#xff0c;印刷品缺陷检测是缺陷检测中的难点项目。通常印刷品检测往往具备缺陷小&#xff0c;缺陷所在位置不固定&#xff0c;出现少印或者多印的情况。并且由于产线原因&#xff0c;大量的印刷品在视野中的大小并不是完全一致的&#xff0c;可能出现细微…

Vue2 响应式原理

Vue 的响应式原理 Vue 的响应式原理基于"数据劫持"和"依赖收集"的概念。当我们将一个普通的 JavaScript 对象传递给 Vue 实例的 data 选项时&#xff0c;Vue 将遍历此对象的所有属性&#xff0c;并使用 Object.defineProperty()来对每个属性进行 getter 和…

计算机网络----第六天

TCP和UDP的基本原理 传输层的作用&#xff1a; 作用&#xff1a;建立端到端的连接&#xff1b; TCP可靠传输 UDP不可靠传输 多路复用 对应用层进行分段 流量控制 TCP相关知识&#xff1a; 1、面向连接 2、可靠性传输 3、传输速度慢&#xff0c;延迟高 4、适用于完整性要…

在 vscode 中使用 Cascadia Code 字体 ( => === 连体字 )

需要先到 Github 上下载字体的压缩包 Releases microsoft/cascadia-code GitHub 下载解压好之后 打开 tts 文件&#xff0c;安装 Cascadia Code 字体 接着在 vs code 中做两步操作&#xff1a; 1&#xff09;配置字体 打开设置搜索 控制字体系列&#xff0c;修改为以下…

C++ new关键字的一些特殊用法

placement new i.普通的new关键字申请内存空间时&#xff0c;是从系统的堆中分配的空间&#xff0c;申请所得到的空间的位置是根据当时的内存的实际使用情况来决定的。 ii.placement new就是特殊情况下&#xff0c;需要在指定的特定内存去创建对象&#xff0c;就是定位放置new…