EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang1*, Yuqian Fu1* , Runyi Yang1, Yang Miao1, Tianwen Qian2, Xu Zheng1,3, Guolei Sun4, Ajad Chhatkuli1, Xuanjing Huang5, Yu-Gang Jiang5, Luc Van Gool1, Danda Pani Paudel1

1 INSAIT, Sofia University    2 East China Normal University    3 HKUST(GZ)    4 Nankai University    5 Fudan University

Participating institutes ICLR 2026

Overview

EgoNight is the first benchmark suite for nighttime egocentric vision, centered on VQA and extended with retrieval and depth estimation.

  • Most egocentric benchmarks emphasize daytime scenes and miss low-light deployment realities.
  • Nighttime assistive agents face visibility drop, uneven lighting, and noisy first-person motion.
  • Aligned day-night videos enable cleaner annotation and fair illumination-gap analysis.
90 videos
3,658 QA pairs
12 QA types
EgoNight dataset statistics
EgoNight teaser overview figure

Dataset Design

EgoNight combines aligned synthetic and real-world sources with diverse scenes, illumination types, and easy/medium/hard difficulty levels.

Synthetic iconSynthetic: 50 aligned pairs
Sofia iconSofia: 20 aligned pairs
Oxford iconOxford: 20 night videos
EgoNight QA taxonomy and examples

Annotation Pipeline

  • Generate nighttime captions with QA-type-specific prompts.
  • Create diverse candidate QA pairs from clips and captions.
  • Use day-augmented synthesis for paired QA types.

All 3,658 QA pairs are manually verified. Total annotation effort: 300+ hours.

EgoNight data and annotation pipeline

Main VQA Results

Even top MLLMs struggle under nighttime egocentric settings.

CategoryBest modelAvg. Acc (%)
Closed-source GPT-4.1 icon GPT-4.1 30.93
Open-source InternVL3-8B icon InternVL3-8B 20.06
Egocentric EgoGPT icon EgoGPT 14.29
  • Day-to-night drop: 32.8% on Synthetic, 25.0% on Sofia.
  • New QA types (lighting/navigation/non-common-sense) remain hardest.
Day-night performance gap across QA types

Day-night Correspondence Retrieval

  • Spatial retrieval (R@1, Night->Day): GPT-4.1 leads on both Synthetic (54.1) and Sofia (84.5).
  • Temporal localization (mIoU, Night->Day): feature encoders are more stable than MLLMs.
  • GPT-4.1 drops from Day->Day to Night->Day by 21.5 points (Synthetic) and 8.0 points (Sofia).
ModelSpatial N->D (Syn/Sofia)Temporal N->D (Syn/Sofia)
GPT-4.154.1 / 84.510.0 / 15.5
Percep. Enc.41.6 / 80.932.9 / 33.4
DINOv228.7 / 74.533.7 / 33.1
InternVL3-8B27.7 / 56.39.9 / 13.3
Qualitative day-night retrieval examples

Depth Estimation at Night

  • Best overall: UniK3D (Night AbsRel 0.253, delta1 0.254).
  • Fisheye-aware methods outperform generic depth estimators.
  • A clear day-night gap persists across all methods.
MethodAbsRel D/Ndelta1 D/N
Depth Anything0.297 / 0.3020.249 / 0.237
VGGTStream0.293 / 0.2980.234 / 0.232
DAC0.245 / 0.2920.255 / 0.216
UniK3D0.224 / 0.2530.280 / 0.254
Nighttime egocentric depth estimation examples

Takeaways

Copyright (c) 2026 EgoNight authors. All rights reserved.