YOLOv5 的量化流程及部署方法
YOLOv5 是一種高效的目標(biāo)檢測(cè)算法,尤其在實(shí)時(shí)目標(biāo)檢測(cè)任務(wù)中表現(xiàn)突出。YOLOv5 通過(guò)三種不同尺度的檢測(cè)頭分別處理大、中、小物體;檢測(cè)頭共包括三個(gè)關(guān)鍵任務(wù):邊界框回歸、類別預(yù)測(cè)、置信度預(yù)測(cè);每個(gè)檢測(cè)頭都會(huì)逐像素地使用三個(gè) Anchor,以幫助算法更準(zhǔn)確地預(yù)測(cè)物體邊界。
YOLOv5 具有多種不同大小的模型(YOLOv5n、YOLOv5s、YOLOv5m、YOLOv5l、YOLOv5x)以適配不同的任務(wù)類型和硬件平臺(tái)。本文以基于色選機(jī)數(shù)據(jù)集訓(xùn)練出的 YOLOv5n 模型為例,介紹如何使用 PTQ 進(jìn)行量化編譯并使用 C++進(jìn)行全流程的板端部署。
本示例使用的 Yolov5n 模型,相較于公版在輸入和輸出上存在以下 2 點(diǎn)變動(dòng):
1、輸入分辨率設(shè)定為 384x2048,從而輸出分辨率也調(diào)整為了 48x256,24x128,12x64
2、類別數(shù)量設(shè)定為 17,因此輸出 tensor 的通道數(shù)變?yōu)榱耍?7+4+1)x3=66
從 pytorch 導(dǎo)出的 onnx 模型,具體的輸入輸出信息如下圖所示:
同時(shí),為了優(yōu)化整體耗時(shí),模型尾部的 sigmoid 計(jì)算被放在了后處理。
horizon-nn 1.1.0
horizon_tc_ui 1.24.3
hbdk 3.49.15
先準(zhǔn)備 100 張如上圖所示的色選機(jī)數(shù)據(jù)集圖片存放在 seed100 文件夾,之后可借助 horizon_model_convert_sample 的 02_preprocess.sh 腳本幫助我們生成校準(zhǔn)數(shù)據(jù)。
python3 ../../../data_preprocess.py \
--src_dir ./seed100 \
--dst_dir ./calibration_data_rgb_f32 \
--pic_ext .rgb \
--read_mode opencv \
--saved_data_type float32
def calibration_transformers():
transformers = [
PadResizeTransformer(target_size=(384, 2048)),
return transformers
校準(zhǔn)數(shù)據(jù)僅需 resize 成符合模型輸入的尺寸,并處理成 chw 和 rgb 即可。也就是說(shuō),除了歸一化,其他操作都要對(duì)齊浮點(diǎn)模型訓(xùn)練時(shí)的數(shù)據(jù)預(yù)處理,而歸一化可以放到模型的預(yù)處理節(jié)點(diǎn)中實(shí)現(xiàn)加速計(jì)算。
onnx_model: 'yolov5n.onnx'
march: 'bayes-e'
working_dir: 'model_output'
output_model_file_prefix: 'yolov5n'
input_type_rt: 'nv12'
input_type_train: 'rgb'
input_layout_train: 'NCHW'
norm_type: 'data_scale'
scale_value: 0.003921568627451
cal_data_dir: './calibration_data_rgb_f32'
cal_data_type: 'float32'
calibration_type: 'default'
optimize_level: 'O3'
input_type_rt 指模型在部署時(shí)輸入的數(shù)據(jù)類型,考慮到視頻通路傳來(lái)的通常都是 nv12,因此我們將該項(xiàng)置為 nv12。
input_type_train 指浮點(diǎn)模型訓(xùn)練時(shí)使用的數(shù)據(jù)類型,這里使用 rgb。
input_layout_train 指浮點(diǎn)模型訓(xùn)練時(shí)使用的數(shù)據(jù)排布,這里使用 NCHW。
norm_type 和 scale_value 根據(jù)浮點(diǎn)模型訓(xùn)練時(shí)使用的歸一化參數(shù)設(shè)置,這里配置 scale 為 1/255。
這樣配置后,上板模型會(huì)自帶一個(gè)預(yù)處理節(jié)點(diǎn),用來(lái)將 nv12 數(shù)據(jù)轉(zhuǎn)換為 rgb 并做歸一化,這個(gè)預(yù)處理節(jié)點(diǎn)可以被等效轉(zhuǎn)換為卷積,從而支持 BPU 加速計(jì)算,進(jìn)而顯著減少預(yù)處理耗時(shí)。
我們強(qiáng)烈建議您在編譯處理圖像任務(wù)的模型時(shí),使用這種配置方法。上板模型的數(shù)據(jù)輸入類型可直接使用 nv12,同時(shí)我們也提供了板端讀取 bgr 圖片并轉(zhuǎn)換為 nv12 格式的 C++代碼供您參考。
hb_mapper makertbin --config ./yolov5n_config.yaml --model-type onnx
執(zhí)行以上命令后,即可編譯出用于板端部署的 bin 模型。
Output Cosine Similarity L1 Distance L2 Distance Chebyshev Distance
output 0.996914 0.234755 0.000420 5.957216
613 0.997750 0.232995 0.000744 8.833645
615 0.995946 0.281512 0.001877 4.717240
根據(jù)編譯日志可看出,yolov5n 模型的三個(gè)輸出頭,量化前后的余弦相似度均>0.99,符合精度要求。
PTQ 量化流程會(huì)生成 yolov5n_quantized_model.onnx 和 yolov5n.bin,前者是量化后的 onnx 模型,后者是上板模型。通常來(lái)說(shuō),這兩個(gè)模型具有完全相同的精度,可以使用這種方法進(jìn)行驗(yàn)證。
import cv2
import numpy as np
from PIL import Image
from horizon_tc_ui import HB_ONNXRuntime
def bgr2nv12(image):
image = image.astype(np.uint8)
height, width = image.shape[0], image.shape[1]
yuv420p = cv2.cvtColor(image, cv2.COLOR_BGR2YUV_I420).reshape((height * width * 3 // 2, ))
y = yuv420p[:height * width]
uv_planar = yuv420p[height * width:].reshape((2, height * width // 4))
uv_packed = uv_planar.transpose((1, 0)).reshape((height * width // 2, ))
nv12 = np.zeros_like(yuv420p)
nv12[:height * width] = y
nv12[height * width:] = uv_packed
return nv12
def nv12Toyuv444(nv12, target_size):
height = target_size[0]
width = target_size[1]
nv12_data = nv12.flatten()
yuv444 = np.empty([height, width, 3], dtype=np.uint8)
yuv444[:, :, 0] = nv12_data[:width * height].reshape(height, width)
u = nv12_data[width * height::2].reshape(height // 2, width // 2)
yuv444[:, :, 1] = Image.fromarray(u).resize((width, height),resample=0)
v = nv12_data[width * height + 1::2].reshape(height // 2, width // 2)
yuv444[:, :, 2] = Image.fromarray(v).resize((width, height),resample=0)
return yuv444
def preprocess(input_name):
bgr_input = cv2.imread("seed.jpg")
nv12_input = bgr2nv12(bgr_input)
yuv444 = nv12Toyuv444(nv12_input, (384,2048))
yuv444 = yuv444[np.newaxis,:,:,:]
yuv444_128 = (yuv444-128).astype(np.int8)
return yuv444_128
def main():
sess = HB_ONNXRuntime(model_file="./yolov5n_quantized_model.onnx")
input_names = [input.name for input in sess.get_inputs()]
output_names = [output.name for output in sess.get_outputs()]
feed_dict = dict()
for input_name in input_names:
feed_dict[input_name] = preprocess(input_name)
output = sess.run(output_names, feed_dict)
if __name__ == '__main__':
在讀取原始圖像后,將其轉(zhuǎn)換為 nv12 格式并保存,之后處理成 yuv444_128 格式并送給模型推理。
由 print(output)打印出的信息如下:
[ 0.18080421 0.4917729 0.34173843 0.26877916 -10.983349
-3.8538744 -1.8031031 -2.2803051 -1.5579813 -1.8910917
-3.7208636 -2.4970834 -2.8638227 -3.5894732 -3.338331
hrt_model_exec infer --model-file yolov5n.bin --input-file seed_nv12.bin --enable_dump true --dump_format txt
這里我們將上一步保存的 nv12 數(shù)據(jù)作為 bin 模型的輸入,并保存輸出數(shù)據(jù),其中第一個(gè)輸出分支的數(shù)據(jù)如下:
可以看到,yolov5n_quantized_model.onnx 和 yolov5n.bin 具有相同的輸出。
在算法工具鏈的交付包中,ai benchmark 示例包含了讀圖、前處理、推理、后處理等完整流程的 C++源碼,但考慮到 ai benchmark 代碼耦合度較高,有不低的學(xué)習(xí)成本,不方便用戶嵌入到自己的工程應(yīng)用中,因此我們提供了基于 horizon_runtime_sample 示例修改的簡(jiǎn)易版本 C++代碼,只包含 1 個(gè)頭文件和 1 個(gè) C++源碼,用戶僅需替換原有的 00_quick_start 示例即可編譯運(yùn)行。
該 C++ demo 包含對(duì)單幀數(shù)據(jù)的讀圖(bgr->nv12),模型推理(包含預(yù)處理),后處理,打印輸出結(jié)果等步驟。
該頭文件內(nèi)容主要來(lái)自于 ai benchmark 的 code/include/base/perception_common.h 頭文件,包含了對(duì) argmax 和計(jì)時(shí)功能的定義,以及目標(biāo)檢測(cè)任務(wù)相關(guān)結(jié)構(gòu)體的定義。
typedef std::chrono::steady_clock::time_point Time;
typedef std::chrono::duration Micro;
inline size_t argmax(ForwardIterator first, ForwardIterator last) {
return std::distance(first, std::max_element(first, last));
typedef struct Bbox {
float xmin{0.0};
float ymin{0.0};
float xmax{0.0};
float ymax{0.0};
Bbox() {}
Bbox(float xmin, float ymin, float xmax, float ymax)
: xmin(xmin), ymin(ymin), xmax(xmax), ymax(ymax) {}
friend std::ostream &operator<<(std::ostream &os, const Bbox &bbox) {
const auto precision = os.precision();
const auto flags = os.flags();
os << "[" << std::fixed << std::setprecision(6) << bbox.xmin << ","
<< bbox.ymin << "," << bbox.xmax << "," << bbox.ymax << "]";
return os;
~Bbox() {}
} Bbox;
typedef struct Detection {
int id{0};
float score{0.0};
Bbox bbox;
const char *class_name{nullptr};
Detection() {}
Detection(int id, float score, Bbox bbox)
: id(id), score(score), bbox(bbox) {}
Detection(int id, float score, Bbox bbox, const char *class_name)
: id(id), score(score), bbox(bbox), class_name(class_name) {}
friend bool operator>(const Detection &lhs, const Detection &rhs) {
return (lhs.score > rhs.score);
friend std::ostream &operator<<(std::ostream &os, const Detection &det) {
const auto precision = os.precision();
const auto flags = os.flags();
os << "{"
<< R"("bbox")"
<< ":" << det.bbox << ","
<< R"("prob")"
<< ":" << std::fixed << std::setprecision(6) << det.score << ","
<< R"("label")"
<< ":" << det.id << ","
<< R"("class_name")"
<< ":\"" << det.class_name << "\"}";
return os;
~Detection() {}
} Detection;
struct Perception {
std::vector det;
enum {
DET = (1 << 0),
} type;
friend std::ostream &operator<<(std::ostream &os, Perception &perception) {
os << "[";
if (perception.type == Perception::DET) {
auto &detection = perception.det;
for (int i = 0; i < detection.size(); i++) {
if (i != 0) {
os << ",";
os << detection[i];
os << "]";
return os;
#include "dnn/hb_dnn.h"
#include "opencv2/core/mat.hpp"
#include "opencv2/imgcodecs.hpp"
#include "opencv2/imgproc.hpp"
#include "head.h"
// 上板模型的路徑
auto modelFileName = "yolov5n.bin";
// 單張測(cè)試圖片的路徑
std::string imagePath = "seed.jpg";
// 測(cè)試圖片的寬度
int image_width = 2048;
// 測(cè)試圖片的高度
int image_height = 384;
// 置信度閾值
float score_threshold = 0.2;
// 分類目標(biāo)數(shù)
int num_classes = 17;
// 模型輸出的通道數(shù)
int num_pred = num_classes + 4 + 1;
// nms的topk
int nms_top_k = 5000;
// nms的iou閾值
float nms_iou_threshold = 0.5;
// 為模型推理準(zhǔn)備輸入輸出內(nèi)存空間
void prepare_tensor(int input_count,
int output_count,
hbDNNTensor *input_tensor,
hbDNNTensor *output_tensor,
hbDNNHandle_t dnn_handle) {
hbDNNTensor *input = input_tensor;
for (int i = 0; i < input_count; i++) {
hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i);
int input_memSize = input[i].properties.alignedByteSize;
hbSysAllocCachedMem(&input[i].sysMem[0], input_memSize);
input[i].properties.alignedShape = input[i].properties.validShape;
hbDNNTensor *output = output_tensor;
for (int i = 0; i < output_count; i++) {
hbDNNGetOutputTensorProperties(&output[i].properties, dnn_handle, i);
int output_memSize = output[i].properties.alignedByteSize;
hbSysAllocCachedMem(&output[i].sysMem[0], output_memSize);
// 讀取bgr圖片并轉(zhuǎn)換為nv12格式再存儲(chǔ)進(jìn)輸入內(nèi)存
void read_image_2_tensor_as_nv12(std::string imagePath,
hbDNNTensor *input_tensor) {
hbDNNTensor *input = input_tensor;
hbDNNTensorProperties Properties = input->properties;
int input_h = Properties.validShape.dimensionSize[2];
int input_w = Properties.validShape.dimensionSize[3];
cv::Mat bgr_mat = cv::imread(imagePath, cv::IMREAD_COLOR);
cv::Mat yuv_mat;
cv::cvtColor(bgr_mat, yuv_mat, cv::COLOR_BGR2YUV_I420);
uint8_t *nv12_data = yuv_mat.ptr();
auto input_data = input->sysMem[0].virAddr;
int32_t y_size = input_h * input_w;
memcpy(reinterpret_cast(input_data), nv12_data, y_size);
int32_t uv_height = input_h / 2;
int32_t uv_width = input_w / 2;
uint8_t *nv12 = reinterpret_cast(input_data) + y_size;
uint8_t *u_data = nv12_data + y_size;
uint8_t *v_data = u_data + uv_height * uv_width;
for (int32_t i = 0; i < uv_width * uv_height; i++) {
if (u_data && v_data) {
*nv12++ = *u_data++;
*nv12++ = *v_data++;
// 后處理的核心代碼(不包括nms),初步篩選檢測(cè)框
void process_tensor_core(hbDNNTensor *tensor,
int layer,
std::vector &dets){
hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE);
int height, width, stride;
std::vector> anchors;
if(layer == 0){
height = 48; width = 256; stride = 8; anchors = {{10, 13}, {16, 30}, {33, 23}};
} else if (layer == 1){
height = 24; width = 128; stride = 16; anchors = {{30, 61}, {62, 45}, {59, 119}};
} else if (layer == 2){
height = 12; width = 64; stride = 32; anchors = {{116, 90}, {156, 198}, {373, 326}};
int anchor_num = anchors.size();
auto *data = reinterpret_cast(tensor->sysMem[0].virAddr);
for (uint32_t h = 0; h < height; h++) {
for (uint32_t w = 0; w < width; w++) {
for (int k = 0; k < anchor_num; k++) {
double anchor_x = anchors[k].first;
double anchor_y = anchors[k].second;
float *cur_data = data + k * num_pred;
float objness = cur_data[4];
if (objness < score_threshold)
int id = argmax(cur_data + 5, cur_data + 5 + num_classes);
// 模型檢測(cè)頭不包含sigmoid算子,而將sigmoid計(jì)算安排在后處理進(jìn)行
double x1 = 1 / (1 + std::exp(-objness)) * 1;
double x2 = 1 / (1 + std::exp(-cur_data[id + 5]));
double confidence = x1 * x2;
if (confidence < score_threshold)
float center_x = cur_data[0];
float center_y = cur_data[1];
float scale_x = cur_data[2];
float scale_y = cur_data[3];
double box_center_x =
((1.0 / (1.0 + std::exp(-center_x))) * 2 - 0.5 + w) * stride;
double box_center_y =
((1.0 / (1.0 + std::exp(-center_y))) * 2 - 0.5 + h) * stride;
double box_scale_x =
std::pow((1.0 / (1.0 + std::exp(-scale_x))) * 2, 2) * anchor_x;
double box_scale_y =
std::pow((1.0 / (1.0 + std::exp(-scale_y))) * 2, 2) * anchor_y;
double xmin = (box_center_x - box_scale_x / 2.0);
double ymin = (box_center_y - box_scale_y / 2.0);
double xmax = (box_center_x + box_scale_x / 2.0);
double ymax = (box_center_y + box_scale_y / 2.0);
double xmin_org = xmin;
double xmax_org = xmax;
double ymin_org = ymin;
double ymax_org = ymax;
if (xmax_org <= 0 || ymax_org <= 0)
if (xmin_org > xmax_org || ymin_org > ymax_org)
xmin_org = std::max(xmin_org, 0.0);
xmax_org = std::min(xmax_org, image_width - 1.0);
ymin_org = std::max(ymin_org, 0.0);
ymax_org = std::min(ymax_org, image_height - 1.0);
Bbox bbox(xmin_org, ymin_org, xmax_org, ymax_org);
dets.emplace_back((int)id, confidence, bbox);
data = data + num_pred * anchors.size();
// nms處理,精挑細(xì)選出合適的檢測(cè)框
void yolo5_nms(std::vector &input,
std::vector &result,
bool suppress) {
std::stable_sort(input.begin(), input.end(), std::greater());
std::vector skip(input.size(), false);
std::vector areas;
for (size_t i = 0; i < input.size(); i++) {
float width = input[i].bbox.xmax - input[i].bbox.xmin;
float height = input[i].bbox.ymax - input[i].bbox.ymin;
areas.push_back(width * height);
int count = 0;
for (size_t i = 0; count < nms_top_k && i < skip.size(); i++) {
if (skip[i]) {
skip[i] = true;
for (size_t j = i + 1; j < skip.size(); ++j) {
if (skip[j]) {
if (suppress == false) {
if (input[i].id != input[j].id) {
float xx1 = std::max(input[i].bbox.xmin, input[j].bbox.xmin);
float yy1 = std::max(input[i].bbox.ymin, input[j].bbox.ymin);
float xx2 = std::min(input[i].bbox.xmax, input[j].bbox.xmax);
float yy2 = std::min(input[i].bbox.ymax, input[j].bbox.ymax);
if (xx2 > xx1 && yy2 > yy1) {
float area_intersection = (xx2 - xx1) * (yy2 - yy1);
float iou_ratio =
area_intersection / (areas[j] + areas[i] - area_intersection);
if (iou_ratio > nms_iou_threshold) {
skip[j] = true;
// 打印最終篩選出的檢測(cè)框的置信度和位置信息
std::cout << "score " << input[i].score;
std::cout << " xmin " << input[i].bbox.xmin;
std::cout << " ymin " << input[i].bbox.ymin;
std::cout << " xmax " << input[i].bbox.xmax;
std::cout << " ymax " << input[i].bbox.ymax << std::endl;
// 多線程加速后處理計(jì)算
std::mutex dets_mutex;
void process_tensor_thread(hbDNNTensor *tensor, int layer, std::vector &dets){
std::vector local_dets;
process_tensor_core(tensor, layer, local_dets);
std::lock_guard lock(dets_mutex);
dets.insert(dets.end(), local_dets.begin(), local_dets.end());
void post_process(std::vector &tensors,
Perception *perception){
perception->type = Perception::DET;
std::vector dets;
std::vector threads;
for (int i = 0; i < tensors.size(); ++i) {
threads.emplace_back([&tensors, i, &dets](){
process_tensor_thread(&tensors[i], i, dets);
for (auto &thread : threads)
yolo5_nms(dets, perception->det, false);
int main(int argc, char **argv) {
hbPackedDNNHandle_t packed_dnn_handle;
hbDNNHandle_t dnn_handle;
const char **model_name_list;
int model_count = 0;
hbDNNInitializeFromFiles(&packed_dnn_handle, &modelFileName, 1);
hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle);
hbDNNGetModelHandle(&dnn_handle, packed_dnn_handle, model_name_list[0]);
std::cout<< "yolov5 demo begin!" << std::endl;
std::cout<< "load model success" <
用戶可將頭文件和源碼放入 horizon_runtime_sample/code/00_quick_start/src 路徑,并執(zhí)行 build_x5.sh 編譯工程,再將 horizon_runtime_sample/x5 文件夾復(fù)制到開(kāi)發(fā)板的 /userdata 目錄,并在 /userdata/x5/script/00_quick_start/ 路徑下存放上板模型、測(cè)試圖片等文件,并編寫(xiě)板端運(yùn)行腳本:
export BMEM_CACHEABLE=true
yolov5 demo begin!
load model success
prepare intput and output tensor success
read image to tensor as nv12 success
model infer time: 7.763 ms
model infer success
score 0.365574 xmin 1448.69 ymin 148.4 xmax 1518.55 ymax 278.487
postprocess time: 1.376 ms
postprocess success
release resources success
yolov5 demo end!
可以看到,推理程序成功識(shí)別到了 1 枚瓜子,并且給出了正確的坐標(biāo)信息。
準(zhǔn)確的模型的推理時(shí)間應(yīng)當(dāng)以 hrt_model_exec 工具實(shí)測(cè)結(jié)果為準(zhǔn),參考命令:
hrt_model_exec perf --model-file ./yolov5n.bin --thread-num 1(測(cè)試單線程單幀延時(shí),關(guān)注latency)
hrt_model_exec perf --model-file ./yolov5n.bin --thread-num 8(測(cè)試多線程極限吞吐量,關(guān)注FPS)