LLM-DepthEval: Evaluating GPT-4o for Monocular Depth Estimation
April 6, 2025 ยท View on GitHub
๐ Project Overview
Large language models (LLMs) with multimodal capabilities, such as GPT-4o, have demonstrated impressive generalization across diverse tasks involving both text and visual data. However, these models are primarily trained on general multimodal datasets without explicit fine-tuning for specific visual tasks such as monocular depth estimation.
This project aims to investigate whether GPT-4o inherently possesses the ability to estimate depth from single RGB images without specific training in monocular depth estimation. More broadly, we seek to evaluate if GPT-4o's multimodal training has led to an emergent "world model," allowing it to implicitly grasp physical relationshipsโparticularly spatial and depth relationsโthrough joint learning from RGB images and textual information.
๐ฏ Objectives
Determine if GPT-4o can perform monocular depth estimation without explicit fine-tuning.
Explore the potential of GPT-4o as an implicit "world model," reflecting an understanding of physical relationships within visual scenes.
Assess whether GPT-4o's multimodal learning enables the model to infer accurate spatial relationships from purely visual and textual contexts.
๐ Pipeline Overview

๐ Project Structure
(To be expanded as the project progresses.)
๐ Quick Start
Install dependencies:
pip install -r requirements.txt
๐ Methodology
We will evaluate GPT-4o through:
Qualitative analyses of GPT-4o-generated depth maps.
Quantitative benchmarks against established monocular depth estimation datasets.
Comparative analysis with traditional depth estimation models.
๐ Future Directions
Extend analysis to other physical understanding tasks.
Investigate transfer learning capabilities of GPT-4o across diverse multimodal tasks.
๐ค Collaborators Wanted
We are actively looking for passionate collaborators to join this exploration into the capabilities of GPT-4o in depth estimation and multimodal understanding. Whether your expertise is in computer vision, machine learning, multimodal models, or evaluation methodologies, we'd love to connect with you.
๐ Acknowledgements
This project draws inspiration from the GPT-ImgEval repository, particularly in its automated visual evaluation workflow.
We adapted and extended the pipeline to support automatic image downloading, tailored specifically for macOS systems. Our automation utilizes a combination of AppleScript and PyAutoGUI to streamline screenshot capturing and clipboard handling.
โ ๏ธ Note: The automatic processing pipeline is currently tested and supported only on macOS.