RAI Models - Reproduction Guide
Complete training methodology, data sources, hyperparameters, and paths for reproducing all 4 models
1. Overview
| Model | Task | Data Source | Backbone | Machine | Metric |
| Vehicle V1 |
5-class regression counting |
CVAT Project 24 (TSP6K + COCO) |
ResNet50 |
gx10 (GB10) |
MAE 1.744 |
| Vehicle V2 |
5-class density map counting |
CVAT Project 24 (same) |
ResNet50 + FPN |
gx10 (GB10) |
MAE 1.523 |
| Person |
Person density map |
CVAT Project 7 (Mapillary + diverse) |
ResNet50 + FPN |
gx10 (GB10) |
MAE 1.798 |
| Fire/Smoke |
Multi-label classification |
CVAT Project 23 (5 sources) |
MobileNetV3-Large |
5090-2 (RTX 5090) |
mAP 0.9856 |
2. Environment Setup
Python & CUDA
Python 3.12
CUDA 13.0 (cu128 PyTorch wheels work on CUDA 13)
torch==2.11.0
torchvision==0.26.0
Required packages
pip install torch==2.11.0 torchvision==0.26.0 \
timm==1.0.26 \
scikit-learn \
opencv-python-headless \
scipy \
pillow \
numpy \
requests \
tqdm
Hardware used
| Machine | GPU | Used for |
| gx10 | NVIDIA GB10 (128GB unified) | Vehicle V1, V2, Person |
| 5090-2 | 2× RTX 5090 (32GB each) | Fire/Smoke v20260413 |
3. Vehicle V1 — Direct Regression
Architecture
from torchvision import models
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model.fc = nn.Linear(model.fc.in_features, 5) # 5 classes
# Output: (B, 5) raw scalars in log1p space
# Loss: SmoothL1Loss on log1p(counts)
# Inference: torch.expm1(output).clamp(min=0)
Data
| Source | CVAT Project 24 vehicle_seg |
| Classes | car, truck, bus, motorcycle, bicycle |
| Train | 22,758 images (46 tasks) |
| Val | 2,077 images (5 tasks) |
| Test | 1,794 images (4 tasks, unlabeled, do not use) |
| Sources within P24 | TSP6K (13 tasks, 6,000 frames) + COCO (42 tasks, 20,629 frames) |
| Annotation format | Polygon (count shapes per class per frame) |
Hyperparameters
img_size = 384
batch_size = 32
epochs = 25 # (best was @ epoch 15)
lr = 1e-4
weight_decay = 1e-4
optimizer = AdamW
scheduler = OneCycleLR(max_lr=1e-4, pct_start=0.1, anneal_strategy="cos")
augmentation = RandomResizedCrop(384, scale=(0.7, 1.0)) + HFlip + ColorJitter(0.2)
loss = SmoothL1Loss on log1p(counts)
amp_dtype = bfloat16
memory_format = channels_last
4. Vehicle V2 — Density Map
Architecture (ResNet50 + FPN Decoder)
class CountingDecoder(nn.Module):
def __init__(self, num_classes):
backbone = models.resnet50(weights=ImageNetV2)
self.stem = nn.Sequential(conv1, bn1, relu, maxpool)
self.layer1, layer2, layer3, layer4 = backbone layers
self.lat4 = Conv2d(2048, 256, 1) # lateral
self.lat3 = Conv2d(1024, 256, 1)
self.lat2 = Conv2d(512, 256, 1)
self.lat1 = Conv2d(256, 256, 1)
# smooth: 3x Conv3x3 + ReLU
# head: Conv(256,128) + Conv(128,64) + Conv(64, num_classes)
def forward(self, x):
c1 = layer1(stem(x)) # 1/4
c2 = layer2(c1) # 1/8
c3 = layer3(c2) # 1/16
c4 = layer4(c3) # 1/32
# Top-down FPN with upsample+add+smooth
return F.relu(head(p1)) # (B, C, H/4, W/4) non-negative
Data
Same as V1. Additional data preparation: extract polygon centroids per frame per class, save to centroids.jsonl:
{
"image": "images/Train/job_XXX_frame_N.jpg",
"subset": "Train",
"job_id": XXX,
"frame": N,
"points": {
"car": [[x, y], ...],
"truck": [[x, y], ...],
"bus": [...], "motorcycle": [...], "bicycle": [...]
}
}
GT Density Map Generation
from scipy.ndimage import gaussian_filter
def make_gt_density(points, map_h, map_w, img_w, img_h, sigma=4.0):
density = np.zeros((5, map_h, map_w), dtype=np.float32)
sx, sy = map_w / img_w, map_h / img_h
for ci, cname in enumerate(CLASSES):
for px, py in points.get(cname, []):
cx, cy = int(round(px * sx)), int(round(py * sy))
if 0 <= cx < map_w and 0 <= cy < map_h:
density[ci, cy, cx] += 1.0
if density[ci].sum() > 0:
density[ci] = gaussian_filter(density[ci], sigma=sigma)
return density
Hyperparameters
img_size = 384
map_stride = 4 # output is 96x96
sigma = 4.0 # Gaussian kernel sigma (vehicles are relatively large)
batch_size = 24
epochs = 25 # (best was @ epoch 19)
lr = 1e-4
loss = MSE(pred_map, gt_map) * 1000 + 0.1 * L1(pred.sum(), gt.sum())
# density_scale = 1000 (MSE on density values which are small)
# count_weight = 0.1 (auxiliary count consistency)
# Person model: same but num_classes=1, sigma=2.5 (people are smaller)
5. Person — Density Map
Identical architecture to Vehicle V2, but:
num_classes=1 (only person)
sigma=2.5 (people smaller than vehicles)
max_epochs=50 with patience=5 early stopping
- Per-frame progress report every 5 epochs
Data
| Source | CVAT Project 7 person |
| Train | 13,393 frames (202 tasks) |
| Val | 871 frames (78 tasks) — actually 2,031 per centroids.jsonl |
| Test | 4,091 frames (233 tasks) |
| Sources within P7 | Mapillary, 旅館/機場監控, 工廠無塵室, 水庫/橋樑, Taipower, 多元現場 |
| Annotation format | MIXED: rectangle + polygon |
Centroid extraction (handles both shape types)
def centroid(shape):
pts = shape["points"]
if shape["type"] == "polygon":
return (np.mean(pts[0::2]), np.mean(pts[1::2]))
elif shape["type"] == "rectangle":
return ((pts[0] + pts[2]) / 2, (pts[1] + pts[3]) / 2)
return None, None
6. Fire/Smoke — Multi-Label Classification
Architecture
import timm
model = timm.create_model(
"mobilenetv3_large_100.ra_in1k",
pretrained=True,
num_classes=2, # [smoke, fire] multi-label
drop_rate=0.3,
)
# Output: (B, 2) raw logits
# Apply torch.sigmoid() for probabilities (NOT softmax - it's multi-label)
# Loss: BCEWithLogitsLoss with pos_weight for class imbalance
Data
| Source | CVAT Project 23 fire_smoke_tags |
| Total | 211,024 frames across 302 tasks |
| Train | 179,001 (smoke_only:95,676 / fire_only:63,026 / smoke+fire:13,582 / negative:6,717) |
| Val | 14,765 |
| Test | 17,258 |
| Annotation format | CVAT tags (not shapes) — smoke / fire per frame |
Data sources within P23
| Source | Type | Contribution |
| ForestFireSmoke | Wildfire dataset | Large general smoke/fire |
| Azimjaan FireSmoke | Web-scraped | Train + Test |
| Zenodo Indoor | Indoor scenes | Train + Val + Test |
| SK (東海/大甲) | Field cameras | 76 tasks, 4,005 frames — domain-specific factory smoke |
| 其他 (Vietnam, misc) | Various | Diverse augmentation |
Hyperparameters
img_size = 224
batch_size = 96
epochs = 15
lr = 5e-4
weight_decay = 0.05
drop_rate = 0.3
mixup_alpha = 0.2
patience = 4
augmentation:
- Resize(256, 256) + RandomCrop(224)
- RandomHorizontalFlip + RandomVerticalFlip(p=0.2)
- ColorJitter(0.4) p=0.8
- GaussianBlur(kernel=5, sigma=0.1-2.0) p=0.2
- RandomGrayscale(p=0.05)
- RandomRotation(15) p=0.3
- RandomErasing(p=0.25, scale=(0.02, 0.15))
loss = BCEWithLogitsLoss(pos_weight=[neg/pos for smoke, fire])
# Computed from training data: ~[0.64, 1.34]
optimizer = AdamW
scheduler = OneCycleLR(max_lr=5e-4, pct_start=0.1)
amp_dtype = float16
memory_format = channels_last
Inference thresholds
| Scenario | Smoke threshold | Fire threshold |
| Full-dataset optimal | 0.004 | 0.057 |
| SK field recommended | 0.32 (old) / 0.029 (new) | 0.057 |
7. CVAT Data Export
CVAT Server: https://raicvat.intemotech.com
Auth: POST /api/auth/login with username/password → get Token
Use Authorization: Token xxx header for all API calls
Fast bulk download (recommended for large datasets)
Method 1: CVAT chunk API (per task ZIP)
GET /api/tasks/{task_id}/data?type=chunk&number={N}&quality=original
→ Returns ZIP of ~500 frames per chunk
Method 2: Direct from CVAT server filesystem (fastest)
Location: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
Each task has a data_id (different from task_id)
Get via Django ORM: Task.objects.filter(project_id=N).values('id','data_id')
tar + stream over SSH for 100x speedup vs API
Method 3: Per-frame API (slow, 500 img/min max)
GET /api/jobs/{job_id}/data?type=frame&number={N}&quality=original
Fetching annotations
# Get all tasks in a project
GET /api/tasks?project_id=N&page_size=100
# Per-task annotations (returns tags + shapes + tracks)
GET /api/jobs/{job_id}/annotations
# Get label definitions
GET /api/labels?project_id=N
Annotation types encountered
| Project | Shape types | Notes |
| P7 person | polygon + rectangle (mixed) | Different annotators used different tools |
| P8 car_flow | polygon + rectangle (mixed) | Same |
| P23 fire_smoke | tags (frame-level labels) | Not shapes |
| P24 vehicle_seg | polygon (mostly) + mask (496 cases) | Fixed via mask→polygon conversion |
8. Key File Paths
On gx10 (NVIDIA GB10)
~/vehicle_counting/
├── build_dataset.py # Download images + build labels.jsonl
├── build_centroids.py # Extract polygon centroids
├── train_counting.py # V1 regression training
├── train_density.py # V2 density map training
├── compare_reports.py # V1 vs V2 evaluation
├── benchmark_throughput.py # Speed benchmark
├── dataset/
│ ├── labels.jsonl # For V1 (counts per image)
│ ├── centroids.jsonl # For V2 (points per image)
│ └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/
├── 20260412_081835/best_full.pt # V1 (90MB)
└── 20260412_094521_density/best_full.pt # V2 (102MB)
~/person_counting/
├── person_build_dataset.py
├── train_person_density.py
├── eval_person.py
├── dataset/
│ ├── centroids.jsonl
│ └── images/{subset}/job_{jid}_frame_{fn}.jpg
└── runs/20260412_222716/best_full.pt # Person (102MB)
On 5090-2 (RTX 5090)
~/fire_smoke/
├── fire_smoke_export.py # CVAT API per-frame export (slow)
├── fire_smoke_export_parallel.py # Parallel chunk (medium)
├── build_manifest_from_tar.py # Manifest matching tar structure
├── train_fire_smoke.py # Original (gx10 paths)
├── train_fire_smoke_local.py # 5090-2 adapted
├── eval_sk.py # SK overall test
├── eval_sk_perchannel.py # Per-channel + Grad-CAM
├── dataset/
│ ├── manifest.csv # path,smoke,fire,split,task_id,data_id
│ └── images/{data_id}/raw/frame_NNNNNN.{jpg|png}
└── runs/
└── fire_smoke_v20260413/
├── best.pt # state_dict (17MB)
├── best_full.pt # full model (16MB)
└── summary.json
CVAT Server (AWS t3.xlarge, Tokyo)
# SSH: ssh -i ~/.ssh/cvat-tokyo-key.pem ubuntu@57.183.9.144
# Data volume: /var/lib/docker/volumes/cvat_cvat_data/_data/data/{data_id}/raw/
# Django ORM access:
docker exec cvat_server python3 -c "
import django, os
os.environ['DJANGO_SETTINGS_MODULE'] = 'cvat.settings.production'
django.setup()
from cvat.apps.engine.models import Task
# ..."
Pre-trained models (for reuse)
Related Reports
Reproduction Checklist
- Setup Python 3.12 env with torch 2.11 + timm + scikit-learn + opencv-python-headless + scipy
- Get CVAT API credentials (token via POST /api/auth/login)
- Export data from CVAT using chunk API or direct server filesystem access
- Build manifests — match path format to download method (task_id/filename OR data_id/raw/filename)
- For density map models: extract polygon centroids, configure sigma (4.0 for vehicles, 2.5 for persons)
- For classification: compute pos_weight from training label distribution
- Train with AMP bf16/fp16 + channels_last + OneCycleLR + AdamW
- Save as full model (
torch.save(model, ...)) for easy loading without class definition
- Evaluate with sklearn
average_precision_score (classification) or MAE/RMSE (counting)
- Consider TensorRT fp16/INT8 for deployment (2-5x speedup)
Generated 2026-04-14 | Trained on NVIDIA GB10 (gx10) and RTX 5090 (5090-2) | CVAT: raicvat.intemotech.com