TY - JOUR
T1 - D3D
T2 - dual 3-D convolutional network for real-time action recognition
AU - Jiang, Shengqin
AU - Qi, Yuankai
AU - Zhang, Haokui
AU - Bai, Zongwen
AU - Lu, Xiaobo
AU - Wang, Peng
PY - 2021/7
Y1 - 2021/7
N2 - Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.
AB - Three-dimensional convolutional neural networks (3D CNNs) have been explored to learn spatio-temporal information for video-based human action recognition. Expensive computational cost and memory demand resulted from standard 3D CNNs, however, hinder their application in practical scenarios. In this article, we address the aforementioned limitations by proposing a novel dual 3-D convolutional network (D3DNet) with two complementary lightweight branches. A coarse branch maintains large temporal receptive field by a fast temporal downsampling strategy and simulates the expensive 3-D convolutions using a combination of more efficient spatial convolutions and temporal convolutions. Meanwhile, a fine branch progressively downsamples the video in the temporal domain and adopts 3-D convolutional units with reduced channel capacities to capture multiresolution spatio-temporal information. Instead of learning these two branches independently, a shallow spatiotemporal downsampling module is shared for these two branches for efficient low-level feature learning. Besides, lateral connections are learned to effectively fuse the information from the two branches at multiple stages. The proposed network makes good balance between inference speed and action recognition performance. Based on RGB information only, it achieves competing performance on five popular video-based action recognition datasets, with inference speed of 3200 FPS on a single NVIDIA GTX 2080Ti card.
UR - http://www.scopus.com/inward/record.url?scp=85104185602&partnerID=8YFLogxK
U2 - 10.1109/TII.2020.3018487
DO - 10.1109/TII.2020.3018487
M3 - Article
AN - SCOPUS:85104185602
SN - 1551-3203
VL - 17
SP - 4584
EP - 4593
JO - IEEE Transactions on Industrial Informatics
JF - IEEE Transactions on Industrial Informatics
IS - 7
ER -