EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang¹*, Yuqian Fu¹* , Runyi Yang¹, Yang Miao¹, Tianwen Qian², Xu Zheng^1,3, Guolei Sun⁴, Ajad Chhatkuli¹, Xuanjing Huang⁵, Yu-Gang Jiang⁵, Luc Van Gool¹, Danda Pani Paudel¹

1 INSAIT, Sofia University 2 East China Normal University 3 HKUST(GZ) 4 Nankai University 5 Fudan University

ICLR 2026

Overview

EgoNight is the first benchmark suite for nighttime egocentric vision, centered on VQA and extended with retrieval and depth estimation.

Most egocentric benchmarks emphasize daytime scenes and miss low-light deployment realities.
Nighttime assistive agents face visibility drop, uneven lighting, and noisy first-person motion.
Aligned day-night videos enable cleaner annotation and fair illumination-gap analysis.

90 videos

3,658 QA pairs

12 QA types

Dataset Design

EgoNight combines aligned synthetic and real-world sources with diverse scenes, illumination types, and easy/medium/hard difficulty levels.

Synthetic: 50 aligned pairs

Sofia: 20 aligned pairs

Oxford: 20 night videos

Annotation Pipeline

Generate nighttime captions with QA-type-specific prompts.
Create diverse candidate QA pairs from clips and captions.
Use day-augmented synthesis for paired QA types.

All 3,658 QA pairs are manually verified. Total annotation effort: 300+ hours.

Main VQA Results

Even top MLLMs struggle under nighttime egocentric settings.

Category	Best model	Avg. Acc (%)
Closed-source	GPT-4.1	30.93
Open-source	InternVL3-8B	20.06
Egocentric	EgoGPT	14.29

Day-to-night drop: 32.8% on Synthetic, 25.0% on Sofia.
New QA types (lighting/navigation/non-common-sense) remain hardest.

Day-night performance gap across QA types

Day-night Correspondence Retrieval

Spatial retrieval (R@1, Night->Day): GPT-4.1 leads on both Synthetic (54.1) and Sofia (84.5).
Temporal localization (mIoU, Night->Day): feature encoders are more stable than MLLMs.
GPT-4.1 drops from Day->Day to Night->Day by 21.5 points (Synthetic) and 8.0 points (Sofia).

Model	Spatial N->D (Syn/Sofia)	Temporal N->D (Syn/Sofia)
GPT-4.1	54.1 / 84.5	10.0 / 15.5
Percep. Enc.	41.6 / 80.9	32.9 / 33.4
DINOv2	28.7 / 74.5	33.7 / 33.1
InternVL3-8B	27.7 / 56.3	9.9 / 13.3

Qualitative day-night retrieval examples

Depth Estimation at Night

Best overall: UniK3D (Night AbsRel 0.253, delta1 0.254).
Fisheye-aware methods outperform generic depth estimators.
A clear day-night gap persists across all methods.

Method	AbsRel D/N	delta1 D/N
Depth Anything	0.297 / 0.302	0.249 / 0.237
VGGTStream	0.293 / 0.298	0.234 / 0.232
DAC	0.245 / 0.292	0.255 / 0.216
UniK3D	0.224 / 0.253	0.280 / 0.254

Nighttime egocentric depth estimation examples

Takeaways

EgoNight introduces a practical benchmark protocol for illumination-robust egocentric AI.
Current MLLMs are far from solved for nighttime perception and reasoning.
Aligned synthetic + real data offers a strong path for annotation and evaluation.
Synthetic and real performance are strongly correlated.
Synthetic-only fine-tuning transfers well to real night data.
Full fine-tuning works best; vision-only tuning helps perception, while LLM-only tuning improves both.