CVPR 2026 Workshop DriveX · Archival Track · Poster

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan1, Junyi Ma1, Pei Li2, Xingyou Yang1, Kai Chen3, Sikai Chen1, Bin Ran1

1University of Wisconsin–Madison   2University of Wyoming   3Columbia University

Overview

CrashSight Overview

Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present CrashSight, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving.

Highlights

  • First Infrastructure-Side Crash VQA. 250 expert-annotated surveillance clips with phase-aware dense captions and 13K multiple-choice QA pairs across 7 categories.
  • 4-Stage Annotation Pipeline. VLM-assisted drafting → human expert refinement → LLM-driven VQA generation → verification & augmentation.
  • +16.1% via Fine-Tuning. Domain-specific fine-tuning yields substantial gains; a 3B fine-tuned model surpasses all 8B zero-shot baselines.
  • Systematic Error Taxonomy. Transition analysis traces persistent failures to visual token budget, frozen encoder, and pretraining distribution mismatch.

Benchmark Construction Pipeline

We develop a scalable 4-stage pipeline that transforms raw surveillance footage into a structured VQA benchmark. The pipeline combines VLM-assisted draft captioning with explicit phase boundaries, human expert refinement using a standardized correction template, LLM-driven QA generation with counterfactual distractors, and a final verification and augmentation pass. Approximately 90% of VLM drafts require substantial human correction, underscoring the necessity of expert oversight in safety-critical annotation.

CrashSight Pipeline

QA Taxonomy

CrashSight organizes 13K QA pairs into a two-tier, seven-category taxonomy. Tier 1 (Crash Understanding) covers scene identification, involved parties, and post-crash outcomes through phase-local recognition. Tier 2 (Crash Reasoning) requires cross-phase temporal integration and causal inference for crash mechanics, fault determination, and temporal sequence tasks. A dedicated robustness category probes hallucination resistance with four distinct question types.

QA Taxonomy

Video Demos

Example crash clips from CrashSight, showing phase-aware temporal decomposition from infrastructure cameras.

Side-impact collision at an urban intersection.

Rear-end collision on a multi-lane roadway.

Crash involving a vulnerable road user.

Multi-vehicle incident requiring temporal sequence reasoning.

Dataset Statistics

The benchmark contains 13,016 QA pairs with approximately uniform answer position distribution (23.3–27.8%) after option shuffling, eliminating position bias. Involved Parties (IP) is the largest category at 28.0% of all questions, reflecting the complexity of entity identification. Side-impact and T-bone collisions are the most prevalent accident types, consistent with the composition of the TAD source corpus.

Dataset Statistics

Key Results

We benchmark 8 VLM configurations across four model families. Domain-specific fine-tuning yields up to +16.1 average accuracy improvement, with a fine-tuned 3B model (74.7%) surpassing all zero-shot baselines including InternVL3-8B (68.7%). A persistent human–AI gap of 18.3 points remains, concentrated in visually demanding categories such as Involved Parties and Crash Mechanics. Best zero-shot in underline, best overall in bold.

ModelSizeSIIPCMFDPCOTSRobAVG
Zero-Shot Models
LLaVA-OneVision0.5B59.436.036.750.648.052.418.941.5
LLaVA-NeXT-Video7B75.552.351.658.657.142.966.158.6
Qwen2.5-VL3B66.850.652.054.562.864.371.158.6
Qwen2.5-VL7B67.751.955.466.774.266.773.362.9
InternVL32B71.852.159.270.171.281.071.164.2
InternVL38B72.458.161.171.380.385.782.268.7
Fine-Tuned Models (Ours)
Qwen2.5-VL (FT)3B84.061.669.371.378.876.097.274.7
Qwen2.5-VL (FT)7B80.663.268.783.984.376.297.876.4
Human Expert95.194.793.894.595.194.899.294.7

Qualitative Analysis

We identify two dominant persistent failure modes that fine-tuning cannot resolve. Temporal reasoning failures arise when sparse uniform frame sampling omits short but causally decisive pre-crash interactions, causing incomplete event reconstruction. Spatial grounding failures occur when bounded pixel resolution and a frozen visual encoder prevent the model from discriminating fine-grained entity details under oblique surveillance viewpoints.

Failure Analysis

Citation

@inproceedings{gan2026crashsight,
  title={CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark
         for Traffic Crash Scene Understanding and Reasoning},
  author={Gan, Rui and Ma, Junyi and Li, Pei and Yang, Xingyou
          and Chen, Kai and Chen, Sikai and Ran, Bin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision
             and Pattern Recognition Workshops (CVPRW)},
  year={2026}
}
© 2026 Rui Gan · University of Wisconsin–Madison
Built with Astrofy · Last updated March 2026