TY - GEN
T1 - Empowering multimodal road traffic profiling with Vision Language Models and frequency spectrum fusion
AU - Xiang, Haolong
AU - Xu, Xiaolong
AU - Wang, Guangdong
AU - Zhang, Xuyun
AU - Li, Xiaoyong
AU - Zhang, Qi
AU - Beheshti, Amin
AU - Fan, Wei
PY - 2025
Y1 - 2025
N2 - With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.
AB - With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.
UR - https://www.scopus.com/pages/publications/105021808769
U2 - 10.24963/ijcai.2025/300
DO - 10.24963/ijcai.2025/300
M3 - Conference proceeding contribution
AN - SCOPUS:105021808769
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 2694
EP - 2702
BT - IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence Organization
CY - Montreal
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
Y2 - 16 August 2025 through 22 August 2025
ER -