TY - JOUR
T1 - Spatial-temporal interleaved network for efficient action recognition
AU - Jiang, Shengqin
AU - Zhang, Haokui
AU - Qi, Yuankai
AU - Liu, Qingshan
PY - 2025/1
Y1 - 2025/1
N2 - The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.
AB - The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.
UR - http://www.scopus.com/inward/record.url?scp=85205440103&partnerID=8YFLogxK
U2 - 10.1109/TII.2024.3450021
DO - 10.1109/TII.2024.3450021
M3 - Article
AN - SCOPUS:85205440103
SN - 1551-3203
VL - 21
SP - 178
EP - 187
JO - IEEE Transactions on Industrial Informatics
JF - IEEE Transactions on Industrial Informatics
IS - 1
ER -