Skip to content

[WACV 2024 Survey Paper] Multimodal Large Language Models for Autonomous Driving

License

Notifications You must be signed in to change notification settings

IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Survey on Multimodal Large Language Models for Autonomous Driving

We add new references from CVPR 2024 in our repo, some references are from 自动驾驶之心.

💥 News: MAPLM (Tencent, UIUC) and LaMPilot (Purdue University) from our team are accepted by CVPR 2024.

News: LLVM-AD Workshop is successfully organized at WACV 2024.

On-site

WACV 2024 Proceedings | Arxiv | Workshop | Report by 机器之心

Summary of the 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD)

Abstract

With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this repo, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.

Awesome Papers

MLLM for Perception & Planning & Control for Autonomous Driving

Please ping us if you find any interesting new papers in this area. We will update them into the Table. And all of them will be included in the next version of the survey paper.

Model Year Backbone Task Modality Learning Input Output
Driving with LLMs 2023 LLaMA Perception Control Vision, Language Finetuning Vector Query Response / Actions
Talk2BEV 2023 Flan5XXL Vicuna-13b Perception Planning Vision, Language In-context learning Image Query Response
GAIA-1 2023 - Planning Vision, Language Pretraining Video Prompt Video
Dilu 2023 GPT-3.5 GPT-4 Planning Control Language In-context learning Text Action
Drive as You Speak 2023 GPT-4 Planning Language In-context learning Text Code
Receive, Reason, and React 2023 GPT-4 Planning Control Language In-context learning Text Action
Drive Like a Human 2023 GPT-3.5 Planning Control Language In-context learning Text Action
GPT-Driver 2023 GPT-3.5 Planning Vision, Language In-context learning Text Trajectory
SurrealDriver 2023 GPT-4 Planning Control Language In-context learning Text Text / Action
LanguageMPC 2023 GPT-3.5 Planning Language In-context learning Text Action
DriveGPT4 2023 Llama 2 Planning Control Vision, Language In-context learning Image Text Action Text / Action
Domain Knowledge Distillation from LLMs 2023 GPT-3.5 Text Generation Language In-context learning Text Concept
LaMPilot 2023 GPT-4 / LLaMA-2 / PaLM2 Planning (Code Generation) Language In-context learning Text Code as action
Language Agent 2023 GPT-3.5 Planning Language Training Text Action
LMDrive 2023 CARLA + LLaVA Planning Control Vision, Language Training RGB Image LiDAR Text Control Signal
On the Road with GPT-4V(ision) 2023 GPT-4Vision Perception Vision, Language In-context learning RGB Image Text Text Description
DriveLLM 2023 GPT-4 Planning Control Language In-context learning Text Action
DriveMLM 2023 LLaMA+Q-Former Perception Planning Vision, Language Training RGB Image LiDAR Text Decision State
DriveLM 2023 GVQA Perception Planning Vision, Language Training RGB Image Text Text / Action
LangProp 2024 IL, DAgger, RL + ChatGPT Planning (Code/Action Generation) CARLA simulator Vsion, Language Training CARLA simulator Text Code as action
LimSim++ 2024 LimSim, GPT-4 Planning Simulator BEV, Language In-context learning Simulator Vision, Language Text / Action
DriveVLM 2024 Qwen-VL Planning Sequence of Images, Language Training Vision, Language Text / Action
RAG-Driver 2024 Vicuna1.5-7B Planning Control Video, Language Training Vision, Language Text / Action
ChatSim 2024 GPT-4 Perception (Image Editing) Image, Language In-context learning Vision, Language Image
VLP 2024 CLIP Text Encoder Planning Image, Language Training Vision, Language Text / Action

Datasets

The table is inspired by Comparison and stats in DriveLM

Dataset Base Dataset Language Form Perspectives Scale Release?
BDD-X 2018 BDD Description Planning Description & Justification 8M frames, 20k text strings ✔️
HAD HRI Advice 2019 HDD Advice Goal-oriented & stimulus-driven advice 5,675 video clips, 45k text strings ✔️
Talk2Car 2019 nuScenes Description Goal Point Description 30K frames, 10K text strings ✔️
SUTD-TrafficQA 2021 Self-Collected QA QA 10k frames 62k text strings ✔️
DRAMA 2022 Self-Collected Description QA + Captions 18k frames, 100k text strings ✔️
nuScenes-QA 2023 nuScenes QA Perception Result 30K frames, 460K generated QA pairs nuScenes-QA
Reason2Drive 2023 nuScenes, Waymo, ONCE QA Perception, Prediction and Reasoning 600K video-text pairs Reason2Drive
Rank2Tell 2023 Self-Collected QA Risk Localization and Ranking 116 video clips (20s each) Rank2Tell
DriveLM 2023 nuScenes QA + Scene Description Perception, Prediction and Planning with Logic 30K frames, 360k annotated QA pairs DriveLM
MAPLM 2023 THMA QA + Scene Description Perception, Prediction and HD Map Annotation 2M frames, 16M annotated HD map Description + 13K released QA pairs MAPLM
LingoQA 2023 Collected by Wayve QA Perception, and Planning 28K frames, 419.9K QA + Captioning LingoQA

Other Survey Papers

Model Year Focus
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems 2023 Vision-Language Models for Transportation Systems
LLM4Drive: A Survey of Large Language Models for Autonomous Driving 2023 Language Models for Autonomous Driving
Towards Knowledge-driven Autonomous Driving 2023 Summary on how to use large language models, world models, and neural rendering to contribute to a more holistic, adaptive, and intelligent autonomous driving system.
Applications of Large Scale Foundation Models for Autonomous Driving 2023 Large Scale Foundation Models (LLMs, VLMs, VFMs, World Models) for Autonomous Driving
Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies 2023 Closed-Loop Autonomous Driving
A Survey on Autonomous Driving Datasets: Data Statistic, Annotation, and Outlook 2024 Autonomous Driving Datasets
A Survey for Foundation Models in Autonomous Driving 2024 Multimodal Foundation Models for Autonomous Driving

Papers Accepted by WACV 2024 LLVM-AD

A Survey on Multimodal Large Language Models for Autonomous Driving

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

A Game of Bundle Adjustment - Learning Efficient Convergence Accepted as a tech report for their ICCV 2023 Paper

VLAAD: Vision and Language Assistant for Autonomous Driving

A Safer Vision-based Autonomous Planning System for Quadrotor UAVs with Dynamic Obstacle Trajectory Prediction and Its Application with LLMs

Human-Centric Autonomous Systems With LLMs for User Command Reasoning

NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations

Latency Driven Spatially Sparse Optimization for Multi-Branch CNNs for Semantic Segmentation

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization

Future Directions Section

Social Behavior for Autonomous Driving (UIUC, Purdue University)

Personalized Autonomous Driving (Purdue University, UIUC)

Hardware Support for LLMs in Autonomous Driving (SambaNova Systems)

LLMs for HD Maps (Tencent)

Code as Action for Autonomous Driving (Purdue University, UIUC)

Citation

If the survey and our workshop inspire you, please cite our work:

@inproceedings{cui2024survey,
  title={A survey on multimodal large language models for autonomous driving},
  author={Cui, Can and Ma, Yunsheng and Cao, Xu and Ye, Wenqian and Zhou, Yang and Liang, Kaizhao and Chen, Jintai and Lu, Juanwu and Yang, Zichong and Liao, Kuei-Da and others},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={958--979},
  year={2024}
}