Vehicle V2 Density Map - Optimization Options

Model size / speed / accuracy trade-offs | Current: ResNet50 + FPN | Target: faster inference

Current (V2)

25M

params | 408 img/s | MAE 1.52

Smallest option

2.5M

MobileNetV3-Small (10x smaller)

Fastest path

2-4x

TensorRT (no retrain)

Best balance

MobileNetV3-Large (5x smaller)

Option Comparison

Option	Backbone	Params	Est. img/s	Est. MAE	Model Size	Recommend
Current V2	ResNet50	25M	408	1.52	102 MB	—
A	ResNet18	11M	~700	~1.60	~45 MB	★★★
B	EfficientNet-B0	5M	~800	~1.65	~22 MB	★★★
C (推薦)	MobileNetV3-Large	5M	~1000	~1.70	~20 MB	★★★ 最佳
D	MobileNetV3-Small	2.5M	~1500	~1.90	~10 MB	★★
E	GhostNet v2	4.9M	~1200	~1.75	~20 MB	★★
F	TinyViT-5M	5M	~900	~1.65	~20 MB	★★

Speed vs Accuracy Scatter

Recommendation: MobileNetV3-Large

  為什麼推薦
  跟 Fire/Smoke 模型同架構，程式碼可重用
Timm 內建，預訓練權重完整
深度可分離卷積極快，MAE 僅小幅犧牲
社群最成熟的「small vision」基準
5090-2 預估訓練時間：1-2 小時

架構改動

# Before: ResNet50 + FPN
backbone = ResNet50(pretrained)
# layer1-4 outputs: 256, 512, 1024, 2048 channels
# 4-level FPN

# After: MobileNetV3-Large + simplified FPN
import timm
backbone = timm.create_model(
    "mobilenetv3_large_100.ra_in1k",
    pretrained=True, features_only=True
)
# Block outputs: 16, 24, 40, 112, 960 channels
# Use blocks 2/4/6 for 3-level FPN
# Or just blocks.4 for single-level decoder (even faster)

簡化 decoder 再加速

當前 FPN：4 lateral + 3 smooth + 3 head = 10 conv layers
極簡版：1 lateral + 1 conv + upsample + 1 head = 4 conv layers，再省 30% 時間

替代路徑：不重訓直接優化

優化方式	加速	難度	準度影響	說明
TensorRT fp16 engine	2-3x	低	無	`torch.onnx.export` → `trtexec --fp16`
TensorRT INT8	再 1.5-2x	中	-1~3% MAE	需要 calibration 資料（100-500 張圖）
輸入降到 256x256	2.3x	極低	+5-10% MAE	只改 preprocess resize size
torch.compile	1.2-1.5x	極低	無	加一行 `model = torch.compile(model)`
ONNX Runtime	1.3-1.7x	低	無	CPU/邊緣部署

  最直接的組合

  目標是 DeepStream 部署 → 現有 V2 直接轉 TensorRT INT8 engine，不用重訓：408 img/s → 約 2000 img/s（5x 加速）。半小時可完成。

按使用場景選方案

場景	推薦方案	目標
最高 fps（多相機同時處理）	MobileNetV3-Large + TensorRT INT8	3000+ img/s
最低延遲（即時單幀）	MobileNetV3-Small + TensorRT	< 1 ms/frame
邊緣部署（Jetson Nano/NX）	MobileNetV3-Small + INT8	低功耗，100-300 img/s
伺服器部署（省 GPU）	當前 V2 + TensorRT	~2000 img/s，無需重訓
保準度第一	ResNet18 + TensorRT	900+ img/s，MAE 僅 +0.08

實作路徑

路徑	時間	結果
路徑 1：TensorRT 優化（零重訓）	30 分鐘	V2 → ~2000 img/s，MAE 不變
路徑 2：訓 MobileNetV3-Large	1-2 小時	V3 → ~1000 img/s，MAE ~1.7
路徑 3：兩者結合 ⭐	2-3 小時	V3+TRT → ~3000 img/s，MAE ~1.7

Next Action

建議：

先做路徑 1（TensorRT） — 半小時驗證現有模型的實際部署速度
如果 2000 img/s 已夠用 → 停止，直接部署
如果還不夠快或邊緣部署 → 做路徑 2，訓 MobileNetV3 版本
最後再轉 TRT INT8 → 路徑 3 終極版

Generated 2026-04-14 | Benchmark estimates based on GB10/RTX 5090 profiles | Related: Models Guide | Current V2 Report