EgoNight

Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

ICLR 2026

*Equal Contribution     †Corresponding Author

Abstract

TL;DR: EgoNight is the first benchmark for nighttime egocentric vision, with day–night aligned videos and 3,658 VQA pairs. SOTA MLLMs struggle to generalize from day to night.

Existing egocentric vision benchmarks focus on daytime, overlooking low-light conditions common in real-world settings. We introduce EgoNight, the first benchmark for nighttime egocentric vision, centered on visual question answering (VQA). EgoNight features day–night aligned videos, combining synthetic (Blender) and real-world data to ensure aligned scenes and actions, improving annotation quality and enabling direct comparison across lighting conditions. EgoNight-VQA includes 3,658 QA pairs from 90 videos across 12 QA types. Evaluations show significant performance drops for state-of-the-art MLLMs when moving from day to night.

EgoNight Teaser

Dataset

EgoNight comprises two subsets with day–night aligned videos, and one subset with night-only videos:

EgoNight-Sofia

Real-world indoor/outdoor egocentric recordings

Day
Night
Day
Night
Day
Night

EgoNight-Synthetic

Blender-rendered synthetic scenes with Infinigen

Day
Night
Day
Night
Day
Night

EgoNight-Oxford

Night-only video sequences from Oxford-day-and-night

Day
Day
Day

Main Pipeline

Main pipeline

EgoNight main pipeline.

Statistics

Dataset statistics

Dataset statistics.

Benchmark & Question Types

EgoNight-VQA covers 12 question types for fine-grained evaluation:

  • Object Recognition
  • Text Recognition
  • Spatial Reasoning
  • Scene Sequence
  • Navigation
  • Counting of Static
  • Action Recognition
  • Non-Common-Sense Reasoning
  • Lighting Recognition
  • Lighting Dynamic
  • Dynamic Detection
  • Counting of Dynamic

QA Examples (Object Recognition)

Object Recognition QA Example QA annotation

QA Examples — hover a question type above to view

Auxiliary Tasks

Depth Estimation & Day–night Retrieval — click to expand

Beyond VQA, EgoNight introduces two auxiliary tasks: egocentric depth estimation at night and day–night correspondence retrieval (evaluated on EgoNight-Synthetic only for depth, with ground-truth available from Blender-rendered scenes).

Depth Estimation

We estimate scene geometry from monocular egocentric frames by predicting depth and inpainting regions occluded by the wearer, yielding a complete background depth map.

Synthetic depth estimation

Egocentric depth estimation at night on EgoNight-Synthetic.

Retrieval (Spatial & Temporal)

We perform cross-condition (day–night) retrieval using features robust to illumination changes: spatial retrieval matches frames from the same location, while temporal retrieval aligns frames corresponding to the same moment in time.

Spatial retrieval

Day–night correspondence retrieval qualitative results.

Results

State-of-the-art MLLMs show substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions.

Leaderboard on EgoNight-VQA

Accuracies (%) of OpenQA results across three datasets and three difficulty levels.

Models EgoNight-Synthetic EgoNight-Sofia EgoNight-Oxford Avg.
Easy Medium Hard Easy Medium Hard Easy Medium Hard
GPT-4.1 29.30 26.87 18.87 32.04 29.35 31.69 39.72 37.13 40.72 30.93
Gemini 2.5 Pro 31.05 24.81 16.51 38.24 26.81 28.87 36.75 36.81 27.88 30.60
InternVL3-8B 20.21 15.50 16.98 24.03 21.74 20.42 22.90 20.85 16.36 20.06
Qwen2.5-VL-72B 18.39 15.25 12.26 24.03 17.03 20.42 24.81 22.80 16.36 18.99
GLM-4.1V-9B-Base 19.09 13.70 15.57 18.60 18.48 16.20 17.15 22.15 18.79 18.20
VideoLLaMA3-7B 16.85 13.44 14.62 11.11 10.87 9.15 12.26 10.46 9.15 13.64
Qwen2.5-VL-7B 13.01 13.95 13.68 15.44 12.68 12.68 13.74 13.36 12.73 13.44
Qwen2.5-VL-3B 14.69 10.34 7.08 15.50 13.04 12.68 17.18 11.40 12.12 13.41
LLaVA-NeXT-Video-7B 6.36 11.37 1.89 13.95 9.78 14.79 3.05 2.61 3.03 7.28
EgoGPT 15.79 13.55 12.04 12.41 12.13 10.36 12.37 13.58 13.68 14.29

Day–Night Performance Gap

Performance gap

Citation

@inproceedings{zhang2026egonight,
  title={EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark},
  author={Zhang, Deheng and Fu, Yuqian and Yang, Runyi and Miao, Yang and Qian, Tianwen and Zheng, Xu and Sun, Guolei and Chhatkuli, Ajad and Huang, Xuanjing and Jiang, Yu-Gang and Van Gool, Luc and Paudel, Danda Pani},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Contact

For questions and collaboration, please reach out: