diff --git a/AscendIE/TorchAIE/built-in/cv/detection/ssd/README.md b/AscendIE/TorchAIE/built-in/cv/detection/ssd/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bdb8fefd76e2f47d462e30747371bf0f7f230080
--- /dev/null
+++ b/AscendIE/TorchAIE/built-in/cv/detection/ssd/README.md
@@ -0,0 +1,223 @@
+# SSD模型-推理指导
+
+- [概述](#ZH-CN_TOPIC_0000001172161501)
+
+   - [输入输出数据](#ZH-CN_TOPIC_0000001126281702)
+
+- [推理环境准备](#ZH-CN_TOPIC_0000001126281702)
+
+- [快速上手](#ZH-CN_TOPIC_0000001126281700)
+
+  - [获取源码](#section4622531142816)
+  - [准备数据集](#section183221994411)
+  - [模型推理](#section741711594517)
+
+- [模型推理性能&精度](#ZH-CN_TOPIC_0000001172201573)
+
+  ******
+
+# 概述<a name="ZH-CN_TOPIC_0000001172161501"></a>
+
+SSD将detection转化为regression的思路，可以一次完成目标定位与分类。该算法基于Faster RCNN中的Anchor，提出了相似的Prior box；该算法修改了传统的SSD网络：将SSD的FC6和FC7层转化为卷积层，去掉所有的Dropout层和FC8层。同时加入基于特征金字塔的检测方式，在不同感受野的feature map上预测目标。
+
+- 参考实现：
+
+  ```shell
+  url=https://github.com/open-mmlab/mmdetection.git
+  branch=master
+  commit_id=a21eb25535f31634cef332b09fc27d28956fb24b
+  model_name=ssd
+  ```
+
+## 输入输出数据<a name="section540883920406"></a>
+
+- 输入数据
+
+  | 输入数据 | 数据类型 | 大小                      | 数据排布格式 |
+  | -------- | -------- | ------------------------- | ------------ |
+  | input    | RGB_FP32 | batchsize x 3 x 300 x 300 | NCHW         |
+
+- 输出数据
+
+  | 输出数据 | 数据类型 | 大小                  | 数据排布格式 |
+  | -------- | -------- | --------------------- | ------------ |
+  | boxes    | FLOAT32  | batchsize x 8732 x 4  | ND           |
+  | labels   | FLOAT32  | batchsize x 8732 x 80 | ND           |
+
+# 推理环境准备<a name="ZH-CN_TOPIC_0000001126281702"></a>
+
+- 该模型需要两套环境切换运行，用于执行推理的环境（包括插件与驱动）如下
+
+  **表 1**  版本配套表
+
+  | 配套  | 版本  | 环境准备指导  |
+  |---------| ------- | ------------------------------------------------------------ |
+  | 固件与驱动 | 23.0.rc1  | [Pytorch框架推理环境准备](https://www.hiascend.com/document/detail/zh/ModelZoo/pytorchframework/pies) |
+  | CANN | 7.0.RC1.alpha003 | - |
+  | Python | 3.9.11 | - |
+  | PyTorch | 2.0.1 | - |
+  | Torch_AIE | 6.3.rc2 | - |
+
+- 用于执行前后处理以及模型导出，则需要另一套环境，建议使用conda命令构建虚拟环境，并安装相应的包
+
+```
+conda create --name ssd python=3.7.16
+```
+
+# 快速上手<a name="ZH-CN_TOPIC_0000001126281700"></a>
+
+## 获取源码
+
+1. 安装依赖。
+   ```shell
+   pip3 install -r ssd-requirements.txt
+   ```
+
+2. 获取SSD源代码并修改mmdetection。
+   ```shell
+   git clone https://github.com/open-mmlab/mmdetection.git
+   cd mmdetection
+   git reset --hard a21eb25535f31634cef332b09fc27d28956fb24b
+   patch -p1 < ../update_ssd_mmdet.diff
+   pip install -v -e .
+   cd ..
+   ```
+   将`onnx_helper.py`文件，放置在`mmdetection/mmdet/core/export`目录下
+
+3. 随后执行（该步骤如初次安装，可能需等待较长时间）：
+   ```
+   pip3 install --no-cache-dir mmcv-full==1.2.7
+   ```
+
+## 准备数据集<a name="section183221994411"></a>
+
+1. 获取原始数据集。（解压命令参考tar –xvf  \*.tar与 unzip \*.zip）
+
+   推理数据集采用 [coco_val_2017](http://images.cocodataset.org)，数据集下载后存放路径：`dataset=/root/datasets/coco`
+
+   目录结构：
+
+   ```
+   ├── coco
+   │    ├── val2017
+   │    ├── annotations
+   │         ├──instances_val2017.json
+   ```
+
+2. 数据预处理（使用torch 1.8环境）。
+
+   将原始数据集转换为模型输入的二进制数据。执行 `ssd_preprocess.py` 脚本。
+
+   ```shell
+   python ssd_preprocess.py \
+          --image_folder_path $dataset/val2017 \
+          --bin_folder_path val2017_ssd_bin
+   ```
+
+   - 参数说明：
+
+      -   --image_folder_path：原始数据验证集（.jpg）所在路径。
+      -   --bin_folder_path：输出的二进制文件（.bin）所在路径。
+
+   每个图像对应生成一个二进制文件。
+
+3. 生成数据集info文件（使用torch 1.8环境）。
+
+   运行 `get_info.py` 脚本，生成图片数据info文件。
+   ```shell
+   python get_info.py jpg $dataset/val2017 coco2017_ssd_jpg.info
+   ```
+
+   - 参数说明：
+
+      -   第一个参数：生成的数据集文件格式。
+      -   第二个参数：预处理后的数据文件相对路径。
+      -   第三个参数：即将生成的info文件名。
+
+   运行成功后，在当前目录中生成 `coco2017_ssd_jpg.info`。
+
+## 模型推理<a name="section741711594517"></a>
+
+1. 模型转换。
+
+   使用PyTorch将模型权重文件.pth转换为.ts文件。
+   1. 获取权重文件。
+
+      获取经过训练的权重文件：[ssd300_coco_20200307-a92d2092.pth](http://download.openmmlab.com/mmdetection/v2.0/ssd/ssd300_coco/ssd300_coco_20200307-a92d2092.pth)
+
+   2. 导出onnx文件。
+
+      使用`export.py`导出ts文件（使用torch 1.8环境）
+
+      ```
+      python export.py \
+        --checkpoint ./ssd300_coco_20200307-a92d2092.pth \
+        --mmdet_path ./mmdetection \
+        --shape=300 \
+        --mean 123.675 116.28 103.53 \
+        --std 1 1 1
+      ```
+
+      -   参数说明：
+
+         -   checkpoint：原始pth文件所在路径
+         -   mmdet_path：github拉入文件夹路径
+         -   shape：图像尺寸
+
+2. 开始推理验证。  
+
+   1. 执行推理。（使用torch2.0.1版本）
+      ```shell
+      python3 acc_dataset.py --ts_path ./ssd300_coco.ts --img_bin_path ./coco2017_bin --save_dir ./pyinfer_res_npu
+      ```
+
+      -   参数说明：
+
+         -   ts_path：导出ts文件路径
+         -   img_bin_path：图片预处理得到的bin文件夹所在路径
+         -   save_dir：保存推理结果的路径
+
+   2. 精度验证。
+
+      调用coco_eval.py评测map精度（使用torch1.8版本）：
+
+      ```shell
+      det_path=postprocess_out
+      python ssd_postprocess.py \
+              --bin_data_path=out/2022_*/ \
+              --score_threshold=0.02 \
+              --test_annotation=coco2017_ssd_jpg.info \
+              --nms_pre 200 \
+              --det_results_path ${det_path}
+      python txt_to_json.py --npu_txt_path ${det_path}
+      python coco_eval.py --ground_truth /root/datasets/coco/annotations/instances_val2017.json
+      ```
+
+      - 参数说明：
+
+         -   --bin_data_path：为推理结果存放的路径。
+         -   --score_threshold：得分阈值。
+         -   --test_annotation：原始图片信息文件。
+         -   --nms_pre：每张图片获取框数量的阈值。
+         -   --det_results_path：后处理输出路径。
+         -   --npu_txt_path：后处理输出路径。
+         -   --ground_truth：instances_val2017.json文件路径。
+
+# 模型推理性能&精度<a name="ZH-CN_TOPIC_0000001172201573"></a>
+
+调用ACL接口推理计算，性能参考下列数据。
+
+|           | mAP      |
+| --------- | -------- |
+| 310P3精度 | mAP=25.4 |
+
+
+| Throughput | 310*4    | 310P3    | 310B1 |
+| ---------- | -------- | -------- | ----- |
+| bs1        | 179.194  | 298.5514 | 75.42 |
+| bs4        | 207.596  | 337.0112 | 77.9  |
+| bs8        | 211.7312 | 323.5662 | 79.77 |
+| bs16       | 211.288  | 318.1392 | 77.84 |
+| bs32       | 200.2948 | 318.7303 | 79.78 |
+| bs64       | 196.4192 | 313.0790 | 48.36 |
+| 最优batch  | 211.7312 | 337.0112 | 79.77 |
\ No newline at end of file
diff --git a/AscendIE/TorchAIE/built-in/cv/detection/ssd/onnx_helper.py b/AscendIE/TorchAIE/built-in/cv/detection/ssd/onnx_helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..9abd220bafabe507e127b871daa316658aa8c066
--- /dev/null
+++ b/AscendIE/TorchAIE/built-in/cv/detection/ssd/onnx_helper.py
@@ -0,0 +1,245 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+
+import torch
+
+
+def dynamic_clip_for_onnx(x1, y1, x2, y2, max_shape):
+    """Clip boxes dynamically for onnx.
+
+    Since torch.clamp cannot have dynamic `min` and `max`, we scale the
+      boxes by 1/max_shape and clamp in the range [0, 1].
+
+    Args:
+        x1 (Tensor): The x1 for bounding boxes.
+        y1 (Tensor): The y1 for bounding boxes.
+        x2 (Tensor): The x2 for bounding boxes.
+        y2 (Tensor): The y2 for bounding boxes.
+        max_shape (Tensor or torch.Size): The (H,W) of original image.
+    Returns:
+        tuple(Tensor): The clipped x1, y1, x2, y2.
+    """
+    # assert isinstance(
+    #     max_shape,
+    #     torch.Tensor), '`max_shape` should be tensor of (h,w) for onnx, got {}'.format(max_shape.__class__.__name__)
+
+    assert isinstance(max_shape, (torch.Tensor, torch.Size, list, tuple)), '`max_shape` should be ' + \
+        'torch.Tensor/torch.Size/list/tuple of (h, w) for onnx, got {}'.format(max_shape.__class__.__name__)
+    if not isinstance(max_shape, torch.Tensor):
+        max_shape = torch.tensor(max_shape, dtype=x1.dtype, device=x1.device)
+    else:
+        max_shape = max_shape.type_as(x1)
+
+    # scale by 1/max_shape
+    x1 = x1 / max_shape[1]
+    y1 = y1 / max_shape[0]
+    x2 = x2 / max_shape[1]
+    y2 = y2 / max_shape[0]
+
+    # clamp [0, 1]
+    x1 = torch.clamp(x1, 0, 1)
+    y1 = torch.clamp(y1, 0, 1)
+    x2 = torch.clamp(x2, 0, 1)
+    y2 = torch.clamp(y2, 0, 1)
+
+    # scale back
+    x1 = x1 * max_shape[1]
+    y1 = y1 * max_shape[0]
+    x2 = x2 * max_shape[1]
+    y2 = y2 * max_shape[0]
+    return x1, y1, x2, y2
+
+
+def get_k_for_topk(k, size):
+    """Get k of TopK for onnx exporting.
+
+    The K of TopK in TensorRT should not be a Tensor, while in ONNX Runtime
+      it could be a Tensor.Due to dynamic shape feature, we have to decide
+      whether to do TopK and what K it should be while exporting to ONNX.
+    If returned K is less than zero, it means we do not have to do
+      TopK operation.
+
+    Args:
+        k (int or Tensor): The set k value for nms from config file.
+        size (Tensor or torch.Size): The number of elements of \
+            TopK's input tensor
+    Returns:
+        tuple: (int or Tensor): The final K for TopK.
+    """
+    ret_k = -1
+    if k <= 0 or size <= 0:
+        return ret_k
+    if torch.onnx.is_in_onnx_export():
+        is_trt_backend = os.environ.get('ONNX_BACKEND') == 'MMCVTensorRT'
+        if is_trt_backend:
+            # TensorRT does not support dynamic K with TopK op
+            if 0 < k < size:
+                ret_k = k
+        else:
+            # Always keep topk op for dynamic input in onnx for ONNX Runtime
+            ret_k = torch.where(k < size, k, size)
+    elif k < size:
+        ret_k = k
+    else:
+        # ret_k is -1
+        pass
+    return ret_k
+
+
+def add_dummy_nms_for_onnx(boxes,
+                           scores,
+                           max_output_boxes_per_class=1000,
+                           iou_threshold=0.5,
+                           score_threshold=0.05,
+                           pre_top_k=-1,
+                           after_top_k=-1,
+                           labels=None):
+    """Create a dummy onnx::NonMaxSuppression op while exporting to ONNX.
+
+    This function helps exporting to onnx with batch and multiclass NMS op.
+    It only supports class-agnostic detection results. That is, the scores
+    is of shape (N, num_bboxes, num_classes) and the boxes is of shape
+    (N, num_boxes, 4).
+
+    Args:
+        boxes (Tensor): The bounding boxes of shape [N, num_boxes, 4]
+        scores (Tensor): The detection scores of shape
+            [N, num_boxes, num_classes]
+        max_output_boxes_per_class (int): Maximum number of output
+            boxes per class of nms. Defaults to 1000.
+        iou_threshold (float): IOU threshold of nms. Defaults to 0.5
+        score_threshold (float): score threshold of nms.
+            Defaults to 0.05.
+        pre_top_k (bool): Number of top K boxes to keep before nms.
+            Defaults to -1.
+        after_top_k (int): Number of top K boxes to keep after nms.
+            Defaults to -1.
+        labels (Tensor, optional): It not None, explicit labels would be used.
+            Otherwise, labels would be automatically generated using
+            num_classed. Defaults to None.
+
+    Returns:
+        tuple[Tensor, Tensor]: dets of shape [N, num_det, 5]
+            and class labels of shape [N, num_det].
+    """
+    max_output_boxes_per_class = torch.LongTensor([max_output_boxes_per_class])
+    iou_threshold = torch.tensor([iou_threshold], dtype=torch.float32)
+    score_threshold = torch.tensor([score_threshold], dtype=torch.float32)
+    batch_size = scores.shape[0]
+    num_class = scores.shape[2]
+
+    if pre_top_k > 0:
+        nms_pre = torch.tensor(pre_top_k, device=scores.device, dtype=torch.long)
+        nms_pre = get_k_for_topk(nms_pre, boxes.shape[1])
+
+        if nms_pre > 0:
+            max_scores, _ = scores.max(-1)
+            _, topk_inds = max_scores.topk(nms_pre)
+            batch_inds = torch.arange(batch_size).view(
+                -1, 1).expand_as(topk_inds).long()
+            # Avoid onnx2tensorrt issue in https://github.com/NVIDIA/TensorRT/issues/1134 # noqa: E501
+            # transformed_inds = boxes.shape[1] * batch_inds + topk_inds
+            transformed_inds = (boxes.shape[1] * batch_inds.int()) + topk_inds.int()
+            transformed_inds = transformed_inds.long()
+            boxes = boxes.reshape(-1, 4)[transformed_inds, :].reshape(
+                batch_size, -1, 4)
+            scores = scores.reshape(-1, num_class)[transformed_inds, :].reshape(
+                batch_size, -1, num_class)
+            if labels is not None:
+                labels = labels.reshape(-1, 1)[transformed_inds].reshape(
+                    batch_size, -1)
+
+    scores = scores.permute(0, 2, 1)
+    num_box = boxes.shape[1]
+    # turn off tracing to create a dummy output of nms
+    state = torch._C._get_tracing_state()
+    # dummy indices of nms's output
+    num_fake_det = 2
+    batch_inds = torch.randint(batch_size, (num_fake_det, 1))
+    cls_inds = torch.randint(num_class, (num_fake_det, 1))
+    box_inds = torch.randint(num_box, (num_fake_det, 1))
+    indices = torch.cat([batch_inds, cls_inds, box_inds], dim=1)
+    output = indices
+    setattr(DummyONNXNMSop, 'output', output)
+
+    # open tracing
+    torch._C._set_tracing_state(state)
+    selected_indices = DummyONNXNMSop.apply(boxes, scores,
+                                            max_output_boxes_per_class,
+                                            iou_threshold, score_threshold)
+
+    batch_inds, cls_inds = selected_indices[:, 0], selected_indices[:, 1]
+    box_inds = selected_indices[:, 2]
+    if labels is None:
+        labels = torch.arange(num_class, dtype=torch.long).to(scores.device)
+        labels = labels.view(1, num_class, 1).expand_as(scores)
+    scores = scores.reshape(-1, 1)
+    boxes = boxes.reshape(batch_size, -1).repeat(1, num_class).reshape(-1, 4)
+    # pos_inds = (num_class * batch_inds + cls_inds) * num_box + box_inds # original
+    pos_inds = (num_class * batch_inds.int()) + cls_inds.int()
+    pos_inds = (pos_inds * num_box.int()) + box_inds.int()
+    pos_inds = pos_inds.long()
+    # pos_inds = (batch_inds.new_tensor(num_class) * batch_inds + cls_inds) * batch_inds.new_tensor(num_box) + box_inds
+    mask = scores.new_zeros(scores.shape)
+    # Avoid onnx2tensorrt issue in https://github.com/NVIDIA/TensorRT/issues/1134 # noqa: E501
+    # PyTorch style code: mask[batch_inds, box_inds] += 1
+    mask[pos_inds, :] += 1
+    scores = scores * mask
+    boxes = boxes * mask
+
+    scores = scores.reshape(batch_size, -1)
+    boxes = boxes.reshape(batch_size, -1, 4)
+    labels = labels.reshape(batch_size, -1)
+
+    if boxes.dtype != torch.float:
+        boxes = boxes.float()
+        scores = scores.float()
+
+    if after_top_k > 0:
+        nms_after = torch.tensor(
+            after_top_k, device=scores.device, dtype=torch.long)
+        nms_after = get_k_for_topk(nms_after, num_box * num_class)
+
+        if nms_after > 0:
+            _, topk_inds = scores.topk(nms_after)
+            batch_inds = torch.arange(batch_size).view(-1, 1).expand_as(topk_inds).long()
+            # Avoid onnx2tensorrt issue in https://github.com/NVIDIA/TensorRT/issues/1134 # noqa: E501
+            batch_inds = scores.shape[1] * batch_inds
+            # transformed_inds = batch_inds + topk_inds
+            transformed_inds = batch_inds.int() + topk_inds.int()
+            transformed_inds = transformed_inds.long()
+            scores = scores.reshape(-1, 1)[transformed_inds, :].reshape(
+                batch_size, -1)
+            boxes = boxes.reshape(-1, 4)[transformed_inds, :].reshape(
+                batch_size, -1, 4)
+            labels = labels.reshape(-1, 1)[transformed_inds, :].reshape(
+                batch_size, -1)
+
+    scores = scores.unsqueeze(2)
+    dets = torch.cat([boxes, scores], dim=2)
+    return dets, labels
+
+
+class DummyONNXNMSop(torch.autograd.Function):
+    """DummyONNXNMSop.
+
+    This class is only for creating onnx::NonMaxSuppression.
+    """
+
+    @staticmethod
+    def forward(ctx, boxes, scores, max_output_boxes_per_class, iou_threshold,
+                score_threshold):
+
+        return DummyONNXNMSop.output
+
+    @staticmethod
+    def symbolic(g, boxes, scores, max_output_boxes_per_class, iou_threshold,
+                 score_threshold):
+        return g.op(
+            'NonMaxSuppression',
+            boxes,
+            scores,
+            max_output_boxes_per_class,
+            iou_threshold,
+            score_threshold,
+            outputs=1)
diff --git a/AscendIE/TorchAIE/built-in/cv/detection/ssd/ssd_postprocess.py b/AscendIE/TorchAIE/built-in/cv/detection/ssd/ssd_postprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..09dc47196e648d85a7351a7fe00533c0d8a527f9
--- /dev/null
+++ b/AscendIE/TorchAIE/built-in/cv/detection/ssd/ssd_postprocess.py
@@ -0,0 +1,294 @@
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""coco postprocess"""
+
+import os
+import numpy as np
+import argparse
+import cv2
+import warnings
+import torch
+import time
+try:
+    from torch import npu_batch_nms as NMSOp
+    NMS_ON_NPU = True
+except:
+    from torchvision.ops import batched_nms as NMSOp
+    NMS_ON_NPU = False
+
+CLASSES = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
+           'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
+           'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog',
+           'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe',
+           'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
+           'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat',
+           'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
+           'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
+           'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+           'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+           'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
+           'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
+           'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock',
+           'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
+
+
+def coco_postprocess(bbox, image_size, net_input_width, net_input_height):
+    """
+    This function is postprocessing for FasterRCNN output.
+
+    Before calling this function, reshape the raw output of FasterRCNN to
+    following form
+        numpy.ndarray:
+            [x, y, width, height, confidence, probability of 80 classes]
+        shape: (100,)
+    The postprocessing restore the bounding rectangles of FasterRCNN output
+    to origin scale and filter with non-maximum suppression.
+
+    :param bbox: a numpy array of the FasterRCNN output
+    :param image_path: a string of image path
+    :return: three list for best bound, class and score
+    """
+    w = image_size[0]
+    h = image_size[1]
+    scale_w = net_input_width / w
+    scale_h = net_input_height / h
+
+    # cal predict box on the image src
+    pbox = bbox.copy()
+    pbox[:, 0] = (bbox[:, 0]) / scale_w
+    pbox[:, 1] = (bbox[:, 1]) / scale_h
+    pbox[:, 2] = (bbox[:, 2]) / scale_w
+    pbox[:, 3] = (bbox[:, 3]) / scale_h
+    return pbox
+
+
+def np_clip_bbox(bboxes, max_shape):
+    x1, y1, x2, y2 = bboxes[:, 0], bboxes[:, 1], bboxes[:, 2], bboxes[:, 3]
+    h, w = max_shape
+    x1 = x1.clip(min=0, max=w)
+    y1 = y1.clip(min=0, max=h)
+    x2 = x2.clip(min=0, max=w)
+    y2 = y2.clip(min=0, max=h)
+    bboxes = np.stack([x1, y1, x2, y2], axis=-1)
+    return bboxes
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--bin_data_path",
+                        default="/onnx/ssd/cpp_infer_res_cpu")
+    parser.add_argument("--test_annotation", default="./coco2017_ssd_jpg.info")
+    parser.add_argument("--det_results_path", default="./postprocess_out_cpu/")
+    parser.add_argument("--net_out_num", default=2, type=int)
+    parser.add_argument("--num_pred_box", default=8732, type=int)
+    parser.add_argument("--nms_pre", default=200, type=int)
+    parser.add_argument("--net_input_width", default=300, type=int)
+    parser.add_argument("--net_input_height", default=300, type=int)
+    parser.add_argument("--min_bbox_size", default=0.01, type=float)
+    parser.add_argument("--score_threshold", default=0.02, type=float)
+    parser.add_argument("--nms", default=True, type=bool)
+    parser.add_argument("--iou_threshold", default=0.45, type=float)
+    parser.add_argument("--max_per_img", default=200, type=int)
+    parser.add_argument("--ifShowDetObj", action="store_true", default=True,
+                        help="if input the para means True, neither False.")
+    parser.add_argument("--start", default=0, type=float)
+    parser.add_argument("--end", default=1, type=float)
+    parser.add_argument("--device", default=0, type=int)
+    parser.add_argument("--clear_cache", action='store_true')
+    flags = parser.parse_args()
+    
+    # generate dict according to annotation file for query resolution
+    # load width and height of input images
+    img_size_dict = dict()
+    with open(flags.test_annotation)as f:
+        for line in f:
+            temp = line.split(" ")
+            img_file_path = temp[1]
+            img_name = temp[1].split("/")[-1].split(".")[0]
+            img_width = int(temp[2])
+            img_height = int(temp[3])
+            img_size_dict[img_name] = (img_width, img_height, img_file_path)
+
+    # read bin file for generate predict result
+    bin_path = flags.bin_data_path # 推理结果保存路径
+    det_results_path = flags.det_results_path 
+    os.makedirs(det_results_path, exist_ok=True)
+    total_img = set([name[:name.rfind('_')]
+                     for name in os.listdir(bin_path) if "bin" in name])
+    total_img = sorted(total_img) # list of img names (str)
+    num_img = len(total_img) # 5000
+    start = int(flags.start * num_img)
+    end = int(flags.end * num_img)
+    task_len = end - start + 1
+
+    finished = 0
+    time_start = time.time()
+    for img_id in range(start, end):
+        # for img_id, bin_file in enumerate(sorted(total_img)):
+        bin_file = total_img[img_id]
+        path_base = os.path.join(bin_path, bin_file)
+        det_results_file = os.path.join(det_results_path, bin_file + ".txt")
+        if os.path.exists(det_results_file) and not flags.clear_cache:
+            continue
+
+        # load all detected output tensor
+        bbox_file = path_base + "_" + str(0) + ".bin"
+        score_file = path_base + "_" + str(1) + ".bin"
+        assert os.path.exists(
+            bbox_file), '[ERROR] file `{}` not exist'.format(bbox_file)
+        assert os.path.exists(
+            score_file), '[ERROR] file `{}` not exist'.format(score_file)
+        bboxes = np.fromfile(bbox_file, dtype="float32").reshape(
+            flags.num_pred_box, 4)
+        scores = np.fromfile(score_file, dtype="float32").reshape(
+            flags.num_pred_box, 80)
+
+        bboxes = torch.from_numpy(bboxes)
+        scores = torch.from_numpy(scores)
+        try:
+            bboxes = bboxes.npu(flags.device)
+            scores = scores.npu(flags.device)
+        except:
+            warnings.warn('npu is not available, running on cpu')
+
+        max_scores, _ = scores.max(-1) # shape of [8732], torch.float32
+        keep_inds = (max_scores > flags.score_threshold).nonzero(
+            as_tuple=False).view(-1)
+        bboxes = bboxes[keep_inds, :]
+        scores = scores[keep_inds, :]
+
+        if flags.nms_pre > 0 and flags.nms_pre < bboxes.shape[0]:
+            max_scores, _ = scores.max(-1) # shape: torch.Size([2738]) dtype:torch.float32
+            _, topk_inds = max_scores.topk(flags.nms_pre)
+            bboxes = bboxes[topk_inds, :] # shape: torch.Size([200, 4])
+            scores = scores[topk_inds, :] # shape: torch.Size([200, 80])
+
+        # clip bbox border
+        bboxes[:, 0::2].clamp_(min=0, max=flags.net_input_width - 1)
+        bboxes[:, 1::2].clamp_(min=0, max=flags.net_input_height - 1)
+
+        # remove small bbox
+        bboxes_width_height = bboxes[:, 2:] - bboxes[:, :2]
+        valid_bboxes = bboxes_width_height > flags.min_bbox_size
+        keep_inds = (valid_bboxes[:, 0] & valid_bboxes[:, 1]
+                     ).nonzero(as_tuple=False).view(-1)
+        bboxes = bboxes[keep_inds, :]
+        scores = scores[keep_inds, :]
+
+        # rescale bbox to original image size
+        original_img_info = img_size_dict[bin_file]
+        rescale_factor = torch.tensor([
+            original_img_info[0] / flags.net_input_width,
+            original_img_info[1] / flags.net_input_height] * 2,
+            dtype=bboxes.dtype, device=bboxes.device)
+        bboxes *= rescale_factor
+
+        if flags.nms:
+            if NMS_ON_NPU:
+                # repeat bbox for each class
+                # (N, 4) -> (B, N, 80, 4), where B = 1 is the batchsize
+                bboxes = bboxes[None, :, None, :].repeat(1, 1, 80, 1)
+                # (N, 80) -> (B, N, 80), where B = 1 is the batchsize
+                scores = scores[None, :, :]
+
+                # bbox batched nms
+                bboxes, scores, labels, num_total_bboxes = \
+                    NMSOp(
+                        bboxes.half(), scores.half(),
+                        score_threshold=flags.score_threshold,
+                        iou_threshold=flags.iou_threshold,
+                        max_size_per_class=flags.max_per_img,
+                        max_total_size=flags.max_per_img)
+                bboxes = bboxes[0, :num_total_bboxes, :]
+                scores = scores[0, :num_total_bboxes]
+                class_idxs = labels[0, :num_total_bboxes]
+            else:
+                # repeat bbox and class idx for each class
+                bboxes = bboxes[:, None, :].repeat(
+                    1, 80, 1)  # (N, 4) -> (N, 80, 4)
+                class_idxs = torch.arange(80, dtype=torch.long, device=bboxes.device
+                                          )[None, :].repeat(bboxes.shape[0], 1)  # (80) -> (N, 80)
+
+                # reshape bbox for torch nms
+                bboxes = bboxes.view(-1, 4)
+                scores = scores.view(-1)
+                class_idxs = class_idxs.view(-1)
+
+                # bbox batched nms
+                keep_inds = NMSOp(bboxes, scores, class_idxs,
+                                  flags.iou_threshold)
+                bboxes = bboxes[keep_inds]
+                scores = scores[keep_inds]
+                class_idxs = class_idxs[keep_inds]
+        else:
+            # repeat bbox and class idx for each class
+            bboxes = bboxes[:, None, :].repeat(
+                1, 80, 1)  # (N, 4) -> (N, 80, 4)
+            class_idxs = torch.arange(80, dtype=torch.long, device=bboxes.device
+                                      )[None, :].repeat(bboxes.shape[0], 1)  # (80) -> (N, 80)
+
+            # reshape bbox for torch nms
+            bboxes = bboxes.view(-1, 4)
+            scores = scores.view(-1)
+            class_idxs = class_idxs.view(-1)
+
+        # keep topk max_per_img bbox
+        if flags.max_per_img > 0 and flags.max_per_img < bboxes.shape[0]:
+            _, topk_inds = scores.topk(flags.max_per_img)
+            bboxes = bboxes[topk_inds, :]
+            scores = scores[topk_inds]
+            class_idxs = class_idxs[topk_inds]
+
+        # move to cpu if running on npu
+        if bboxes.device != 'cpu':
+            bboxes = bboxes.cpu()
+            scores = scores.cpu()
+            class_idxs = class_idxs.cpu()
+
+        # convert to numpy.ndarray
+        bboxes = bboxes.numpy()
+        scores = scores.numpy()
+        class_idxs = class_idxs.numpy()
+
+        # make det result file
+        if flags.ifShowDetObj == True:
+            imgCur = cv2.imread(original_img_info[2])
+
+        det_results_str = ''
+        for idx in range(bboxes.shape[0]):
+            x1, y1, x2, y2 = bboxes[idx, :]
+            predscore = scores[idx]
+            class_ind = class_idxs[idx]
+
+            class_name = CLASSES[int(class_ind)]
+            det_results_str += "{} {} {} {} {} {}\n".format(
+                class_name, predscore, x1, y1, x2, y2)
+            if flags.ifShowDetObj == True:
+                imgCur = cv2.rectangle(imgCur, (int(x1), int(
+                    y1)), (int(x2), int(y2)), (0, 255, 0), 1)
+                imgCur = cv2.putText(imgCur, class_name + '|' + str(predscore),
+                                     (int(x1), int(y1)), cv2.FONT_HERSHEY_SIMPLEX,
+                                     0.5, (0, 0, 255), 1)
+
+        if flags.ifShowDetObj == True:
+            cv2.imwrite(os.path.join(det_results_path, bin_file +
+                                     '.jpg'), imgCur, [int(cv2.IMWRITE_JPEG_QUALITY), 70])
+
+        with open(det_results_file, "w") as detf:
+            detf.write(det_results_str)
+
+        finished += 1
+        speed = finished / (time.time() - time_start)
+        print('processed {:5d}/{:<5d} images, speed: {:.2f}FPS'.format(
+            finished, task_len, speed), end='\r')