RAI Models - Reproduction Guide

Complete training methodology, data sources, hyperparameters, and paths for reproducing all 4 models

Contents 1. Overview 2. Environment Setup 3. Vehicle V1 (Regression) 4. Vehicle V2 (Density Map) 5. Person (Density Map) 6. Fire/Smoke (Classification) 7. CVAT Data Export 8. All File Paths

1. Overview

Model	Task	Data Source	Backbone	Machine	Metric
Vehicle V1	5-class regression counting	CVAT Project 24 (TSP6K + COCO)	ResNet50	gx10 (GB10)	MAE 1.744
Vehicle V2	5-class density map counting	CVAT Project 24 (same)	ResNet50 + FPN	gx10 (GB10)	MAE 1.523
Person	Person density map	CVAT Project 7 (Mapillary + diverse)	ResNet50 + FPN	gx10 (GB10)	MAE 1.798
Fire/Smoke	Multi-label classification	CVAT Project 23 (5 sources)	MobileNetV3-Large	5090-2 (RTX 5090)	mAP 0.9856

2. Environment Setup

Python & CUDA

Python 3.12
CUDA 13.0 (cu128 PyTorch wheels work on CUDA 13)
torch==2.11.0
torchvision==0.26.0

Required packages

pip install torch==2.11.0 torchvision==0.26.0 \
            timm==1.0.26 \
            scikit-learn \
            opencv-python-headless \
            scipy \
            pillow \
            numpy \
            requests \
            tqdm

Hardware used

Machine	GPU	Used for
gx10	NVIDIA GB10 (128GB unified)	Vehicle V1, V2, Person
5090-2	2× RTX 5090 (32GB each)	Fire/Smoke v20260413

3. Vehicle V1 — Direct Regression

Architecture

from torchvision import models
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.fc = nn.Linear(model.fc.in_features, 5)  # 5 classes

# Output: (B, 5) raw scalars in log1p space
# Loss: SmoothL1Loss on log1p(counts)
# Inference: torch.expm1(output).clamp(min=0)

Data

Source	CVAT Project 24 `vehicle_seg`
Classes	`car, truck, bus, motorcycle, bicycle`
Train	22,758 images (46 tasks)
Val	2,077 images (5 tasks)
Test	1,794 images (4 tasks, unlabeled, do not use)
Sources within P24	TSP6K (13 tasks, 6,000 frames) + COCO (42 tasks, 20,629 frames)
Annotation format	Polygon (count shapes per class per frame)

Hyperparameters

img_size = 384
batch_size = 32
epochs = 25  # (best was @ epoch 15)
lr = 1e-4
weight_decay = 1e-4
optimizer = AdamW
scheduler = OneCycleLR(max_lr=1e-4, pct_start=0.1, anneal_strategy="cos")
augmentation = RandomResizedCrop(384, scale=(0.7, 1.0)) + HFlip + ColorJitter(0.2)
loss = SmoothL1Loss on log1p(counts)
amp_dtype = bfloat16
memory_format = channels_last

4. Vehicle V2 — Density Map

Architecture (ResNet50 + FPN Decoder)

class CountingDecoder(nn.Module):
    def __init__(self, num_classes):
        backbone = models.resnet50(weights=ImageNetV2)
        self.stem = nn.Sequential(conv1, bn1, relu, maxpool)
        self.layer1, layer2, layer3, layer4 = backbone layers
        self.lat4 = Conv2d(2048, 256, 1)  # lateral
        self.lat3 = Conv2d(1024, 256, 1)
        self.lat2 = Conv2d(512, 256, 1)
        self.lat1 = Conv2d(256, 256, 1)
        # smooth: 3x Conv3x3 + ReLU
        # head: Conv(256,128) + Conv(128,64) + Conv(64, num_classes)

    def forward(self, x):
        c1 = layer1(stem(x))  # 1/4
        c2 = layer2(c1)       # 1/8
        c3 = layer3(c2)       # 1/16
        c4 = layer4(c3)       # 1/32
        # Top-down FPN with upsample+add+smooth
        return F.relu(head(p1))  # (B, C, H/4, W/4) non-negative

Data

Same as V1. Additional data preparation: extract polygon centroids per frame per class, save to centroids.jsonl:

{
  "image": "images/Train/job_XXX_frame_N.jpg",
  "subset": "Train",
  "job_id": XXX,
  "frame": N,
  "points": {
    "car": [[x, y], ...],
    "truck": [[x, y], ...],
    "bus": [...], "motorcycle": [...], "bicycle": [...]
  }
}

GT Density Map Generation

from scipy.ndimage import gaussian_filter

def make_gt_density(points, map_h, map_w, img_w, img_h, sigma=4.0):
    density = np.zeros((5, map_h, map_w), dtype=np.float32)
    sx, sy = map_w / img_w, map_h / img_h
    for ci, cname in enumerate(CLASSES):
        for px, py in points.get(cname, []):
            cx, cy = int(round(px * sx)), int(round(py * sy))
            if 0 <= cx < map_w and 0 <= cy < map_h:
                density[ci, cy, cx] += 1.0
        if density[ci].sum() > 0:
            density[ci] = gaussian_filter(density[ci], sigma=sigma)
    return density

Hyperparameters

img_size = 384
map_stride = 4  # output is 96x96
sigma = 4.0  # Gaussian kernel sigma (vehicles are relatively large)
batch_size = 24
epochs = 25  # (best was @ epoch 19)
lr = 1e-4
loss = MSE(pred_map, gt_map) * 1000 + 0.1 * L1(pred.sum(), gt.sum())
# density_scale = 1000 (MSE on density values which are small)
# count_weight = 0.1 (auxiliary count consistency)

# Person model: same but num_classes=1, sigma=2.5 (people are smaller)

5. Person — Density Map

Identical architecture to Vehicle V2, but:

num_classes=1 (only person)
sigma=2.5 (people smaller than vehicles)
max_epochs=50 with patience=5 early stopping
Per-frame progress report every 5 epochs

Data

Source	CVAT Project 7 `person`
Train	13,393 frames (202 tasks)
Val	871 frames (78 tasks) — actually 2,031 per centroids.jsonl
Test	4,091 frames (233 tasks)
Sources within P7	Mapillary, 旅館/機場監控, 工廠無塵室, 水庫/橋樑, Taipower, 多元現場
Annotation format	MIXED: rectangle + polygon

Centroid extraction (handles both shape types)

def centroid(shape):
    pts = shape["points"]
    if shape["type"] == "polygon":
        return (np.mean(pts[0::2]), np.mean(pts[1::2]))
    elif shape["type"] == "rectangle":
        return ((pts[0] + pts[2]) / 2, (pts[1] + pts[3]) / 2)
    return None, None

6. Fire/Smoke — Multi-Label Classification

Architecture

import timm
model = timm.create_model(
    "mobilenetv3_large_100.ra_in1k",
    pretrained=True,
    num_classes=2,  # [smoke, fire] multi-label
    drop_rate=0.3,
)

# Output: (B, 2) raw logits
# Apply torch.sigmoid() for probabilities (NOT softmax - it's multi-label)
# Loss: BCEWithLogitsLoss with pos_weight for class imbalance

Data

Source	CVAT Project 23 `fire_smoke_tags`
Total	211,024 frames across 302 tasks
Train	179,001 (smoke_only:95,676 / fire_only:63,026 / smoke+fire:13,582 / negative:6,717)
Val	14,765
Test	17,258
Annotation format	CVAT tags (not shapes) — smoke / fire per frame

Data sources within P23

Source	Type	Contribution
ForestFireSmoke	Wildfire dataset	Large general smoke/fire
Azimjaan FireSmoke	Web-scraped	Train + Test
Zenodo Indoor	Indoor scenes	Train + Val + Test
SK (東海/大甲)	Field cameras	76 tasks, 4,005 frames — domain-specific factory smoke
其他 (Vietnam, misc)	Various	Diverse augmentation

Hyperparameters

img_size = 224
batch_size = 96
epochs = 15
lr = 5e-4
weight_decay = 0.05
drop_rate = 0.3
mixup_alpha = 0.2
patience = 4

augmentation:
  - Resize(256, 256) + RandomCrop(224)
  - RandomHorizontalFlip + RandomVerticalFlip(p=0.2)
  - ColorJitter(0.4) p=0.8
  - GaussianBlur(kernel=5, sigma=0.1-2.0) p=0.2
  - RandomGrayscale(p=0.05)
  - RandomRotation(15) p=0.3
  - RandomErasing(p=0.25, scale=(0.02, 0.15))

loss = BCEWithLogitsLoss(pos_weight=[neg/pos for smoke, fire])
# Computed from training data: ~[0.64, 1.34]

optimizer = AdamW
scheduler = OneCycleLR(max_lr=5e-4, pct_start=0.1)
amp_dtype = float16
memory_format = channels_last

Inference thresholds

Scenario	Smoke threshold	Fire threshold
Full-dataset optimal	0.004	0.057
SK field recommended	0.32 (old) / 0.029 (new)	0.057

7. CVAT Data Export

CVAT Server: https://raicvat.intemotech.com
Auth: POST /api/auth/login with username/password → get Token
Use Authorization: Token xxx header for all API calls

Fast bulk download (recommended for large datasets)

Method 1: CVAT chunk API (per task ZIP)
GET /api/tasks/{task_id}/data?type=chunk&number={N}&quality=original
→ Returns ZIP of ~500 frames per chunk

Method 2: Direct from CVAT server filesystem (fastest)
  Location: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
  Each task has a data_id (different from task_id)
  Get via Django ORM: Task.objects.filter(project_id=N).values('id','data_id')
  tar + stream over SSH for 100x speedup vs API

Method 3: Per-frame API (slow, 500 img/min max)
GET /api/jobs/{job_id}/data?type=frame&number={N}&quality=original

Fetching annotations

# Get all tasks in a project
GET /api/tasks?project_id=N&page_size=100

# Per-task annotations (returns tags + shapes + tracks)
GET /api/jobs/{job_id}/annotations

# Get label definitions
GET /api/labels?project_id=N

Annotation types encountered

Project	Shape types	Notes
P7 person	polygon + rectangle (mixed)	Different annotators used different tools
P8 car_flow	polygon + rectangle (mixed)	Same
P23 fire_smoke	tags (frame-level labels)	Not shapes
P24 vehicle_seg	polygon (mostly) + mask (496 cases)	Fixed via mask→polygon conversion

8. Key File Paths

On gx10 (NVIDIA GB10)

~/vehicle_counting/
├── build_dataset.py          # Download images + build labels.jsonl
├── build_centroids.py        # Extract polygon centroids
├── train_counting.py         # V1 regression training
├── train_density.py          # V2 density map training
├── compare_reports.py        # V1 vs V2 evaluation
├── benchmark_throughput.py   # Speed benchmark
├── dataset/
│   ├── labels.jsonl          # For V1 (counts per image)
│   ├── centroids.jsonl       # For V2 (points per image)
│   └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/
    ├── 20260412_081835/best_full.pt          # V1 (90MB)
    └── 20260412_094521_density/best_full.pt  # V2 (102MB)

~/person_counting/
├── person_build_dataset.py
├── train_person_density.py
├── eval_person.py
├── dataset/
│   ├── centroids.jsonl
│   └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/20260412_222716/best_full.pt         # Person (102MB)

On 5090-2 (RTX 5090)

~/fire_smoke/
├── fire_smoke_export.py       # CVAT API per-frame export (slow)
├── fire_smoke_export_parallel.py  # Parallel chunk (medium)
├── build_manifest_from_tar.py # Manifest matching tar structure
├── train_fire_smoke.py        # Original (gx10 paths)
├── train_fire_smoke_local.py  # 5090-2 adapted
├── eval_sk.py                 # SK overall test
├── eval_sk_perchannel.py      # Per-channel + Grad-CAM
├── dataset/
│   ├── manifest.csv            # path,smoke,fire,split,task_id,data_id
│   └── images/{data_id}/raw/frame_NNNNNN.{jpg|png}
└── runs/
    └── fire_smoke_v20260413/
        ├── best.pt            # state_dict (17MB)
        ├── best_full.pt       # full model (16MB)
        └── summary.json

CVAT Server (AWS t3.xlarge, Tokyo)

# SSH: ssh -i ~/.ssh/cvat-tokyo-key.pem ubuntu@57.183.9.144
# Data volume: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
# Django ORM access:
docker exec cvat_server python3 -c "
import django, os
os.environ['DJANGO_SETTINGS_MODULE'] = 'cvat.settings.production'
django.setup()
from cvat.apps.engine.models import Task
# ..."

Pre-trained models (for reuse)

Model	URL
Vehicle V1 (90MB fp32)	download
Vehicle V2 (51MB fp16)	download
Person (51MB fp16)	download
Fire/Smoke v20260413 (16MB)	download

Related Reports

All models guide	rai-models-guide
Vehicle V1 vs V2	vehicle-counting-report
Car Flow cross-domain	car-flow-eval
Person counting	person-counting-report
Fire/Smoke SK per-channel	fire-smoke-sk-eval
Vehicle model optimization options	vehicle-model-options

Reproduction Checklist

Setup Python 3.12 env with torch 2.11 + timm + scikit-learn + opencv-python-headless + scipy
Get CVAT API credentials (token via POST /api/auth/login)
Export data from CVAT using chunk API or direct server filesystem access
Build manifests — match path format to download method (task_id/filename OR data_id/raw/filename)
For density map models: extract polygon centroids, configure sigma (4.0 for vehicles, 2.5 for persons)
For classification: compute pos_weight from training label distribution
Train with AMP bf16/fp16 + channels_last + OneCycleLR + AdamW
Save as full model (torch.save(model, ...)) for easy loading without class definition
Evaluate with sklearn average_precision_score (classification) or MAE/RMSE (counting)
Consider TensorRT fp16/INT8 for deployment (2-5x speedup)

Generated 2026-04-14 | Trained on NVIDIA GB10 (gx10) and RTX 5090 (5090-2) | CVAT: raicvat.intemotech.com