Empowering multimodal road traffic profiling with Vision Language Models and frequency spectrum fusion

Haolong Xiang, Xiaolong Xu*, Guangdong Wang, Xuyun Zhang, Xiaoyong Li, Qi Zhang, Amin Beheshti, Wei Fan*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.

Original languageEnglish
Title of host publicationIJCAI 2025
Subtitle of host publicationProceedings of the 34th International Joint Conference on Artificial Intelligence
EditorsJames Kwok
Place of PublicationMontreal
PublisherInternational Joint Conferences on Artificial Intelligence Organization
Pages2694-2702
Number of pages9
ISBN (Electronic)9781956792065
DOIs
Publication statusPublished - 2025
Event34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025 - Montreal, Canada
Duration: 16 Aug 202522 Aug 2025

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Conference

Conference34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
Country/TerritoryCanada
CityMontreal
Period16/08/2522/08/25

Fingerprint

Dive into the research topics of 'Empowering multimodal road traffic profiling with Vision Language Models and frequency spectrum fusion'. Together they form a unique fingerprint.

Cite this