CVPR 2026 Workshop DriveX · Archival Track · Poster

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan¹, Junyi Ma¹, Pei Li², Xingyou Yang¹, Kai Chen³, Sikai Chen¹, Bin Ran¹

¹University of Wisconsin–Madison ²University of Wyoming ³Columbia University

Overview

Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present CrashSight, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving.

Highlights

First Infrastructure-Side Crash VQA. 250 expert-annotated surveillance clips with phase-aware dense captions and 13K multiple-choice QA pairs across 7 categories.
4-Stage Annotation Pipeline. VLM-assisted drafting → human expert refinement → LLM-driven VQA generation → verification & augmentation.
+16.1% via Fine-Tuning. Domain-specific fine-tuning yields substantial gains; a 3B fine-tuned model surpasses all 8B zero-shot baselines.
Systematic Error Taxonomy. Transition analysis traces persistent failures to visual token budget, frozen encoder, and pretraining distribution mismatch.

Benchmark Construction Pipeline

We develop a scalable 4-stage pipeline that transforms raw surveillance footage into a structured VQA benchmark. The pipeline combines VLM-assisted draft captioning with explicit phase boundaries, human expert refinement using a standardized correction template, LLM-driven QA generation with counterfactual distractors, and a final verification and augmentation pass. Approximately 90% of VLM drafts require substantial human correction, underscoring the necessity of expert oversight in safety-critical annotation.

QA Taxonomy

CrashSight organizes 13K QA pairs into a two-tier, seven-category taxonomy. Tier 1 (Crash Understanding) covers scene identification, involved parties, and post-crash outcomes through phase-local recognition. Tier 2 (Crash Reasoning) requires cross-phase temporal integration and causal inference for crash mechanics, fault determination, and temporal sequence tasks. A dedicated robustness category probes hallucination resistance with four distinct question types.

Video Demos

Example crash clips from CrashSight, showing phase-aware temporal decomposition from infrastructure cameras.

Side-impact collision at an urban intersection.

Rear-end collision on a multi-lane roadway.

Crash involving a vulnerable road user.

Multi-vehicle incident requiring temporal sequence reasoning.

Dataset Statistics

The benchmark contains 13,016 QA pairs with approximately uniform answer position distribution (23.3–27.8%) after option shuffling, eliminating position bias. Involved Parties (IP) is the largest category at 28.0% of all questions, reflecting the complexity of entity identification. Side-impact and T-bone collisions are the most prevalent accident types, consistent with the composition of the TAD source corpus.

Key Results

We benchmark 8 VLM configurations across four model families. Domain-specific fine-tuning yields up to +16.1 average accuracy improvement, with a fine-tuned 3B model (74.7%) surpassing all zero-shot baselines including InternVL3-8B (68.7%). A persistent human–AI gap of 18.3 points remains, concentrated in visually demanding categories such as Involved Parties and Crash Mechanics. Best zero-shot in underline, best overall in bold.

Model	Size	SI	IP	CM	FD	PCO	TS	Rob	AVG
Zero-Shot Models
LLaVA-OneVision	0.5B	59.4	36.0	36.7	50.6	48.0	52.4	18.9	41.5
LLaVA-NeXT-Video	7B	75.5	52.3	51.6	58.6	57.1	42.9	66.1	58.6
Qwen2.5-VL	3B	66.8	50.6	52.0	54.5	62.8	64.3	71.1	58.6
Qwen2.5-VL	7B	67.7	51.9	55.4	66.7	74.2	66.7	73.3	62.9
InternVL3	2B	71.8	52.1	59.2	70.1	71.2	81.0	71.1	64.2
InternVL3	8B	72.4	58.1	61.1	71.3	80.3	85.7	82.2	68.7
Fine-Tuned Models (Ours)
Qwen2.5-VL (FT)	3B	84.0	61.6	69.3	71.3	78.8	76.0	97.2	74.7
Qwen2.5-VL (FT)	7B	80.6	63.2	68.7	83.9	84.3	76.2	97.8	76.4
Human Expert	–	95.1	94.7	93.8	94.5	95.1	94.8	99.2	94.7

Qualitative Analysis

We identify two dominant persistent failure modes that fine-tuning cannot resolve. Temporal reasoning failures arise when sparse uniform frame sampling omits short but causally decisive pre-crash interactions, causing incomplete event reconstruction. Spatial grounding failures occur when bounded pixel resolution and a frozen visual encoder prevent the model from discriminating fine-grained entity details under oblique surveillance viewpoints.

Citation

@inproceedings{gan2026crashsight,
  title={CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark
         for Traffic Crash Scene Understanding and Reasoning},
  author={Gan, Rui and Ma, Junyi and Li, Pei and Yang, Xingyou
          and Chen, Kai and Chen, Sikai and Ran, Bin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
             and Pattern Recognition Workshops (CVPRW)},
  year={2026}
}