Abstract
TL;DR: EgoNight is the first benchmark for nighttime egocentric vision, with day–night aligned videos and 3,658 VQA pairs. SOTA MLLMs struggle to generalize from day to night.
Existing egocentric vision benchmarks focus on daytime, overlooking low-light conditions common in real-world settings. We introduce EgoNight, the first benchmark for nighttime egocentric vision, centered on visual question answering (VQA). EgoNight features day–night aligned videos, combining synthetic (Blender) and real-world data to ensure aligned scenes and actions, improving annotation quality and enabling direct comparison across lighting conditions. EgoNight-VQA includes 3,658 QA pairs from 90 videos across 12 QA types. Evaluations show significant performance drops for state-of-the-art MLLMs when moving from day to night.
Dataset
EgoNight comprises two subsets with day–night aligned videos, and one subset with night-only videos:
EgoNight-Sofia
Real-world indoor/outdoor egocentric recordings
EgoNight-Synthetic
Blender-rendered synthetic scenes with Infinigen
EgoNight-Oxford
Night-only video sequences from Oxford-day-and-night
Main Pipeline
EgoNight main pipeline.
Statistics
Dataset statistics.
Benchmark & Question Types
EgoNight-VQA covers 12 question types for fine-grained evaluation:
- Object Recognition
- Text Recognition
- Spatial Reasoning
- Scene Sequence
- Navigation
- Counting of Static
- Action Recognition
- Non-Common-Sense Reasoning
- Lighting Recognition
- Lighting Dynamic
- Dynamic Detection
- Counting of Dynamic
QA Examples (Object Recognition)
QA Examples — hover a question type above to view
▶
Auxiliary Tasks
Depth Estimation & Day–night Retrieval — click to expand
Beyond VQA, EgoNight introduces two auxiliary tasks: egocentric depth estimation at night and day–night correspondence retrieval (evaluated on EgoNight-Synthetic only for depth, with ground-truth available from Blender-rendered scenes).
Depth Estimation
We estimate scene geometry from monocular egocentric frames by predicting depth and inpainting regions occluded by the wearer, yielding a complete background depth map.
Egocentric depth estimation at night on EgoNight-Synthetic.
Retrieval (Spatial & Temporal)
We perform cross-condition (day–night) retrieval using features robust to illumination changes: spatial retrieval matches frames from the same location, while temporal retrieval aligns frames corresponding to the same moment in time.
Day–night correspondence retrieval qualitative results.
Results
State-of-the-art MLLMs show substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions.
Leaderboard on EgoNight-VQA
Accuracies (%) of OpenQA results across three datasets and three difficulty levels.
| Models | EgoNight-Synthetic |
EgoNight-Sofia |
EgoNight-Oxford |
Avg. | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Medium | Hard | Easy | Medium | Hard | Easy | Medium | Hard | ||
| 29.30 | 26.87 | 18.87 | 32.04 | 29.35 | 31.69 | 39.72 | 37.13 | 40.72 | 30.93 | |
Gemini 2.5 Pro |
31.05 | 24.81 | 16.51 | 38.24 | 26.81 | 28.87 | 36.75 | 36.81 | 27.88 | 30.60 |
InternVL3-8B |
20.21 | 15.50 | 16.98 | 24.03 | 21.74 | 20.42 | 22.90 | 20.85 | 16.36 | 20.06 |
Qwen2.5-VL-72B |
18.39 | 15.25 | 12.26 | 24.03 | 17.03 | 20.42 | 24.81 | 22.80 | 16.36 | 18.99 |
GLM-4.1V-9B-Base |
19.09 | 13.70 | 15.57 | 18.60 | 18.48 | 16.20 | 17.15 | 22.15 | 18.79 | 18.20 |
VideoLLaMA3-7B |
16.85 | 13.44 | 14.62 | 11.11 | 10.87 | 9.15 | 12.26 | 10.46 | 9.15 | 13.64 |
Qwen2.5-VL-7B |
13.01 | 13.95 | 13.68 | 15.44 | 12.68 | 12.68 | 13.74 | 13.36 | 12.73 | 13.44 |
Qwen2.5-VL-3B |
14.69 | 10.34 | 7.08 | 15.50 | 13.04 | 12.68 | 17.18 | 11.40 | 12.12 | 13.41 |
LLaVA-NeXT-Video-7B |
6.36 | 11.37 | 1.89 | 13.95 | 9.78 | 14.79 | 3.05 | 2.61 | 3.03 | 7.28 |
EgoGPT |
15.79 | 13.55 | 12.04 | 12.41 | 12.13 | 10.36 | 12.37 | 13.58 | 13.68 | 14.29 |
Day–Night Performance Gap
Citation
@inproceedings{zhang2026egonight,
title={EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark},
author={Zhang, Deheng and Fu, Yuqian and Yang, Runyi and Miao, Yang and Qian, Tianwen and Zheng, Xu and Sun, Guolei and Chhatkuli, Ajad and Huang, Xuanjing and Jiang, Yu-Gang and Van Gool, Luc and Paudel, Danda Pani},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}
Contact
For questions and collaboration, please reach out:
Gemini 2.5 Pro
InternVL3-8B
Qwen2.5-VL-72B
GLM-4.1V-9B-Base
VideoLLaMA3-7B
LLaVA-NeXT-Video-7B
EgoGPT