EgoNight: Towards Egocentric Vision Understanding at Night

Abstract

TL;DR: EgoNight is the first benchmark for nighttime egocentric vision, with day–night aligned videos and 3,658 VQA pairs. SOTA MLLMs struggle to generalize from day to night.

Existing egocentric vision benchmarks focus on daytime, overlooking low-light conditions common in real-world settings. We introduce EgoNight, the first benchmark for nighttime egocentric vision, centered on visual question answering (VQA). EgoNight features day–night aligned videos, combining synthetic (Blender) and real-world data to ensure aligned scenes and actions, improving annotation quality and enabling direct comparison across lighting conditions. EgoNight-VQA includes 3,658 QA pairs from 90 videos across 12 QA types. Evaluations show significant performance drops for state-of-the-art MLLMs when moving from day to night.

Video

Dataset

EgoNight comprises two subsets with day–night aligned videos, and one subset with night-only videos:

EgoNight-Sofia

Real-world indoor/outdoor egocentric recordings

EgoNight-Synthetic

Blender-rendered synthetic scenes with Infinigen

EgoNight-Oxford

Night-only video sequences from Oxford-day-and-night

Main Pipeline

EgoNight main pipeline.

Statistics

Dataset statistics.

Benchmark & Question Types

EgoNight-VQA covers 12 question types for fine-grained evaluation:

Object Recognition
Text Recognition
Spatial Reasoning
Scene Sequence
Navigation
Counting of Static
Action Recognition
Non-Common-Sense Reasoning
Lighting Recognition
Lighting Dynamic
Dynamic Detection
Counting of Dynamic

QA Examples (Object Recognition)

QA Examples — hover a question type above to view

▶

Auxiliary Tasks

Depth Estimation & Day–night Retrieval — click to expand

Beyond VQA, EgoNight introduces two auxiliary tasks: egocentric depth estimation at night and day–night correspondence retrieval (evaluated on EgoNight-Synthetic only for depth, with ground-truth available from Blender-rendered scenes).

Depth Estimation

We estimate scene geometry from monocular egocentric frames by predicting depth and inpainting regions occluded by the wearer, yielding a complete background depth map.

Egocentric depth estimation at night on EgoNight-Synthetic.

Retrieval (Spatial & Temporal)

We perform cross-condition (day–night) retrieval using features robust to illumination changes: spatial retrieval matches frames from the same location, while temporal retrieval aligns frames corresponding to the same moment in time.

Day–night correspondence retrieval qualitative results.

Results

State-of-the-art MLLMs show substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions.

Explore Results Interactively

Leaderboard on EgoNight-VQA

Accuracies (%) of OpenQA results across three datasets and three difficulty levels.

Models	EgoNight-Synthetic			EgoNight-Sofia			EgoNight-Oxford			Avg.
Models	Easy	Medium	Hard	Easy	Medium	Hard	Easy	Medium	Hard	Avg.
GPT-4.1	29.30	26.87	18.87	32.04	29.35	31.69	39.72	37.13	40.72	30.93
Gemini 2.5 Pro	31.05	24.81	16.51	38.24	26.81	28.87	36.75	36.81	27.88	30.60
InternVL3-8B	20.21	15.50	16.98	24.03	21.74	20.42	22.90	20.85	16.36	20.06
Qwen2.5-VL-72B	18.39	15.25	12.26	24.03	17.03	20.42	24.81	22.80	16.36	18.99
GLM-4.1V-9B-Base	19.09	13.70	15.57	18.60	18.48	16.20	17.15	22.15	18.79	18.20
VideoLLaMA3-7B	16.85	13.44	14.62	11.11	10.87	9.15	12.26	10.46	9.15	13.64
Qwen2.5-VL-7B	13.01	13.95	13.68	15.44	12.68	12.68	13.74	13.36	12.73	13.44
Qwen2.5-VL-3B	14.69	10.34	7.08	15.50	13.04	12.68	17.18	11.40	12.12	13.41
LLaVA-NeXT-Video-7B	6.36	11.37	1.89	13.95	9.78	14.79	3.05	2.61	3.03	7.28
EgoGPT	15.79	13.55	12.04	12.41	12.13	10.36	12.37	13.58	13.68	14.29

Day–Night Performance Gap

Citation

@inproceedings{zhang2026egonight,
  title={EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark},
  author={Zhang, Deheng and Fu, Yuqian and Yang, Runyi and Miao, Yang and Qian, Tianwen and Zheng, Xu and Sun, Guolei and Chhatkuli, Ajad and Huang, Xuanjing and Jiang, Yu-Gang and Van Gool, Luc and Paudel, Danda Pani},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Contact

For questions and collaboration, please reach out:

Deheng Zhang

deheng.zhang@insait.ai

Yuqian Fu

yuqian.fu.ai@gmail.com