RAI Models - Reproduction Guide

Complete training methodology, data sources, hyperparameters, and paths for reproducing all 4 models

Contents 1. Overview 2. Environment Setup 3. Vehicle V1 (Regression) 4. Vehicle V2 (Density Map) 5. Person (Density Map) 6. Fire/Smoke (Classification) 7. CVAT Data Export 8. All File Paths

1. Overview

ModelTaskData SourceBackboneMachineMetric
Vehicle V1 5-class regression counting CVAT Project 24 (TSP6K + COCO) ResNet50 gx10 (GB10) MAE 1.744
Vehicle V2 5-class density map counting CVAT Project 24 (same) ResNet50 + FPN gx10 (GB10) MAE 1.523
Person Person density map CVAT Project 7 (Mapillary + diverse) ResNet50 + FPN gx10 (GB10) MAE 1.798
Fire/Smoke Multi-label classification CVAT Project 23 (5 sources) MobileNetV3-Large 5090-2 (RTX 5090) mAP 0.9856

2. Environment Setup

Python & CUDA

Python 3.12
CUDA 13.0 (cu128 PyTorch wheels work on CUDA 13)
torch==2.11.0
torchvision==0.26.0

Required packages

pip install torch==2.11.0 torchvision==0.26.0 \
            timm==1.0.26 \
            scikit-learn \
            opencv-python-headless \
            scipy \
            pillow \
            numpy \
            requests \
            tqdm

Hardware used

MachineGPUUsed for
gx10NVIDIA GB10 (128GB unified)Vehicle V1, V2, Person
5090-22× RTX 5090 (32GB each)Fire/Smoke v20260413

3. Vehicle V1 — Direct Regression

Architecture

from torchvision import models
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.fc = nn.Linear(model.fc.in_features, 5)  # 5 classes

# Output: (B, 5) raw scalars in log1p space
# Loss: SmoothL1Loss on log1p(counts)
# Inference: torch.expm1(output).clamp(min=0)

Data

SourceCVAT Project 24 vehicle_seg
Classescar, truck, bus, motorcycle, bicycle
Train22,758 images (46 tasks)
Val2,077 images (5 tasks)
Test1,794 images (4 tasks, unlabeled, do not use)
Sources within P24TSP6K (13 tasks, 6,000 frames) + COCO (42 tasks, 20,629 frames)
Annotation formatPolygon (count shapes per class per frame)

Hyperparameters

img_size = 384
batch_size = 32
epochs = 25  # (best was @ epoch 15)
lr = 1e-4
weight_decay = 1e-4
optimizer = AdamW
scheduler = OneCycleLR(max_lr=1e-4, pct_start=0.1, anneal_strategy="cos")
augmentation = RandomResizedCrop(384, scale=(0.7, 1.0)) + HFlip + ColorJitter(0.2)
loss = SmoothL1Loss on log1p(counts)
amp_dtype = bfloat16
memory_format = channels_last

4. Vehicle V2 — Density Map

Architecture (ResNet50 + FPN Decoder)

class CountingDecoder(nn.Module):
    def __init__(self, num_classes):
        backbone = models.resnet50(weights=ImageNetV2)
        self.stem = nn.Sequential(conv1, bn1, relu, maxpool)
        self.layer1, layer2, layer3, layer4 = backbone layers
        self.lat4 = Conv2d(2048, 256, 1)  # lateral
        self.lat3 = Conv2d(1024, 256, 1)
        self.lat2 = Conv2d(512, 256, 1)
        self.lat1 = Conv2d(256, 256, 1)
        # smooth: 3x Conv3x3 + ReLU
        # head: Conv(256,128) + Conv(128,64) + Conv(64, num_classes)

    def forward(self, x):
        c1 = layer1(stem(x))  # 1/4
        c2 = layer2(c1)       # 1/8
        c3 = layer3(c2)       # 1/16
        c4 = layer4(c3)       # 1/32
        # Top-down FPN with upsample+add+smooth
        return F.relu(head(p1))  # (B, C, H/4, W/4) non-negative

Data

Same as V1. Additional data preparation: extract polygon centroids per frame per class, save to centroids.jsonl:

{
  "image": "images/Train/job_XXX_frame_N.jpg",
  "subset": "Train",
  "job_id": XXX,
  "frame": N,
  "points": {
    "car": [[x, y], ...],
    "truck": [[x, y], ...],
    "bus": [...], "motorcycle": [...], "bicycle": [...]
  }
}

GT Density Map Generation

from scipy.ndimage import gaussian_filter

def make_gt_density(points, map_h, map_w, img_w, img_h, sigma=4.0):
    density = np.zeros((5, map_h, map_w), dtype=np.float32)
    sx, sy = map_w / img_w, map_h / img_h
    for ci, cname in enumerate(CLASSES):
        for px, py in points.get(cname, []):
            cx, cy = int(round(px * sx)), int(round(py * sy))
            if 0 <= cx < map_w and 0 <= cy < map_h:
                density[ci, cy, cx] += 1.0
        if density[ci].sum() > 0:
            density[ci] = gaussian_filter(density[ci], sigma=sigma)
    return density

Hyperparameters

img_size = 384
map_stride = 4  # output is 96x96
sigma = 4.0  # Gaussian kernel sigma (vehicles are relatively large)
batch_size = 24
epochs = 25  # (best was @ epoch 19)
lr = 1e-4
loss = MSE(pred_map, gt_map) * 1000 + 0.1 * L1(pred.sum(), gt.sum())
# density_scale = 1000 (MSE on density values which are small)
# count_weight = 0.1 (auxiliary count consistency)

# Person model: same but num_classes=1, sigma=2.5 (people are smaller)

5. Person — Density Map

Identical architecture to Vehicle V2, but:

Data

SourceCVAT Project 7 person
Train13,393 frames (202 tasks)
Val871 frames (78 tasks) — actually 2,031 per centroids.jsonl
Test4,091 frames (233 tasks)
Sources within P7Mapillary, 旅館/機場監控, 工廠無塵室, 水庫/橋樑, Taipower, 多元現場
Annotation formatMIXED: rectangle + polygon

Centroid extraction (handles both shape types)

def centroid(shape):
    pts = shape["points"]
    if shape["type"] == "polygon":
        return (np.mean(pts[0::2]), np.mean(pts[1::2]))
    elif shape["type"] == "rectangle":
        return ((pts[0] + pts[2]) / 2, (pts[1] + pts[3]) / 2)
    return None, None

6. Fire/Smoke — Multi-Label Classification

Architecture

import timm
model = timm.create_model(
    "mobilenetv3_large_100.ra_in1k",
    pretrained=True,
    num_classes=2,  # [smoke, fire] multi-label
    drop_rate=0.3,
)

# Output: (B, 2) raw logits
# Apply torch.sigmoid() for probabilities (NOT softmax - it's multi-label)
# Loss: BCEWithLogitsLoss with pos_weight for class imbalance

Data

SourceCVAT Project 23 fire_smoke_tags
Total211,024 frames across 302 tasks
Train179,001 (smoke_only:95,676 / fire_only:63,026 / smoke+fire:13,582 / negative:6,717)
Val14,765
Test17,258
Annotation formatCVAT tags (not shapes) — smoke / fire per frame

Data sources within P23

SourceTypeContribution
ForestFireSmokeWildfire datasetLarge general smoke/fire
Azimjaan FireSmokeWeb-scrapedTrain + Test
Zenodo IndoorIndoor scenesTrain + Val + Test
SK (東海/大甲)Field cameras76 tasks, 4,005 frames — domain-specific factory smoke
其他 (Vietnam, misc)VariousDiverse augmentation

Hyperparameters

img_size = 224
batch_size = 96
epochs = 15
lr = 5e-4
weight_decay = 0.05
drop_rate = 0.3
mixup_alpha = 0.2
patience = 4

augmentation:
  - Resize(256, 256) + RandomCrop(224)
  - RandomHorizontalFlip + RandomVerticalFlip(p=0.2)
  - ColorJitter(0.4) p=0.8
  - GaussianBlur(kernel=5, sigma=0.1-2.0) p=0.2
  - RandomGrayscale(p=0.05)
  - RandomRotation(15) p=0.3
  - RandomErasing(p=0.25, scale=(0.02, 0.15))

loss = BCEWithLogitsLoss(pos_weight=[neg/pos for smoke, fire])
# Computed from training data: ~[0.64, 1.34]

optimizer = AdamW
scheduler = OneCycleLR(max_lr=5e-4, pct_start=0.1)
amp_dtype = float16
memory_format = channels_last

Inference thresholds

ScenarioSmoke thresholdFire threshold
Full-dataset optimal0.0040.057
SK field recommended0.32 (old) / 0.029 (new)0.057

7. CVAT Data Export

CVAT Server: https://raicvat.intemotech.com
Auth: POST /api/auth/login with username/password → get Token
Use Authorization: Token xxx header for all API calls

Fast bulk download (recommended for large datasets)

Method 1: CVAT chunk API (per task ZIP)
GET /api/tasks/{task_id}/data?type=chunk&number={N}&quality=original
→ Returns ZIP of ~500 frames per chunk

Method 2: Direct from CVAT server filesystem (fastest)
  Location: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
  Each task has a data_id (different from task_id)
  Get via Django ORM: Task.objects.filter(project_id=N).values('id','data_id')
  tar + stream over SSH for 100x speedup vs API

Method 3: Per-frame API (slow, 500 img/min max)
GET /api/jobs/{job_id}/data?type=frame&number={N}&quality=original

Fetching annotations

# Get all tasks in a project
GET /api/tasks?project_id=N&page_size=100

# Per-task annotations (returns tags + shapes + tracks)
GET /api/jobs/{job_id}/annotations

# Get label definitions
GET /api/labels?project_id=N

Annotation types encountered

ProjectShape typesNotes
P7 personpolygon + rectangle (mixed)Different annotators used different tools
P8 car_flowpolygon + rectangle (mixed)Same
P23 fire_smoketags (frame-level labels)Not shapes
P24 vehicle_segpolygon (mostly) + mask (496 cases)Fixed via mask→polygon conversion

8. Key File Paths

On gx10 (NVIDIA GB10)

~/vehicle_counting/
├── build_dataset.py          # Download images + build labels.jsonl
├── build_centroids.py        # Extract polygon centroids
├── train_counting.py         # V1 regression training
├── train_density.py          # V2 density map training
├── compare_reports.py        # V1 vs V2 evaluation
├── benchmark_throughput.py   # Speed benchmark
├── dataset/
│   ├── labels.jsonl          # For V1 (counts per image)
│   ├── centroids.jsonl       # For V2 (points per image)
│   └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/
    ├── 20260412_081835/best_full.pt          # V1 (90MB)
    └── 20260412_094521_density/best_full.pt  # V2 (102MB)

~/person_counting/
├── person_build_dataset.py
├── train_person_density.py
├── eval_person.py
├── dataset/
│   ├── centroids.jsonl
│   └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/20260412_222716/best_full.pt         # Person (102MB)

On 5090-2 (RTX 5090)

~/fire_smoke/
├── fire_smoke_export.py       # CVAT API per-frame export (slow)
├── fire_smoke_export_parallel.py  # Parallel chunk (medium)
├── build_manifest_from_tar.py # Manifest matching tar structure
├── train_fire_smoke.py        # Original (gx10 paths)
├── train_fire_smoke_local.py  # 5090-2 adapted
├── eval_sk.py                 # SK overall test
├── eval_sk_perchannel.py      # Per-channel + Grad-CAM
├── dataset/
│   ├── manifest.csv            # path,smoke,fire,split,task_id,data_id
│   └── images/{data_id}/raw/frame_NNNNNN.{jpg|png}
└── runs/
    └── fire_smoke_v20260413/
        ├── best.pt            # state_dict (17MB)
        ├── best_full.pt       # full model (16MB)
        └── summary.json

CVAT Server (AWS t3.xlarge, Tokyo)

# SSH: ssh -i ~/.ssh/cvat-tokyo-key.pem ubuntu@57.183.9.144
# Data volume: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
# Django ORM access:
docker exec cvat_server python3 -c "
import django, os
os.environ['DJANGO_SETTINGS_MODULE'] = 'cvat.settings.production'
django.setup()
from cvat.apps.engine.models import Task
# ..."

Pre-trained models (for reuse)

ModelURL
Vehicle V1 (90MB fp32)download
Vehicle V2 (51MB fp16)download
Person (51MB fp16)download
Fire/Smoke v20260413 (16MB)download

Related Reports

All models guiderai-models-guide
Vehicle V1 vs V2vehicle-counting-report
Car Flow cross-domaincar-flow-eval
Person countingperson-counting-report
Fire/Smoke SK per-channelfire-smoke-sk-eval
Vehicle model optimization optionsvehicle-model-options

Reproduction Checklist

  1. Setup Python 3.12 env with torch 2.11 + timm + scikit-learn + opencv-python-headless + scipy
  2. Get CVAT API credentials (token via POST /api/auth/login)
  3. Export data from CVAT using chunk API or direct server filesystem access
  4. Build manifests — match path format to download method (task_id/filename OR data_id/raw/filename)
  5. For density map models: extract polygon centroids, configure sigma (4.0 for vehicles, 2.5 for persons)
  6. For classification: compute pos_weight from training label distribution
  7. Train with AMP bf16/fp16 + channels_last + OneCycleLR + AdamW
  8. Save as full model (torch.save(model, ...)) for easy loading without class definition
  9. Evaluate with sklearn average_precision_score (classification) or MAE/RMSE (counting)
  10. Consider TensorRT fp16/INT8 for deployment (2-5x speedup)

Generated 2026-04-14 | Trained on NVIDIA GB10 (gx10) and RTX 5090 (5090-2) | CVAT: raicvat.intemotech.com