Spatial-temporal interleaved network for efficient action recognition

Shengqin Jiang, Haokui Zhang*, Yuankai Qi, Qingshan Liu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.

Original languageEnglish
Pages (from-to)178-187
Number of pages10
JournalIEEE Transactions on Industrial Informatics
Volume21
Issue number1
Early online date26 Sept 2024
DOIs
Publication statusPublished - Jan 2025

Cite this