Anticipating future actions is a key component of intelligence, specifically when it applies to real-time systems, such as robots or autonomous cars. While recent work has addressed prediction of raw RGB pixel values in future video frames, we focus on predicting further in future by predicting a summary of moving pixels through a sequence of frames which we call dynamic images. More precisely, given a dynamic image, we predict the motion evolution through next unseen video frames. Since this representation consists of a sequence of frames, we can go one second further into the future compared to the previous work in this field. We employed convolutional LSTMs to train our network on the dynamic images in an unsupervised learning process. Since our final goal is predicting the next action of a complex task such as an assembly task, we exploited labelled actions for the recognition process on top of predicted dynamic images. We show the effectiveness of our method on predicting the next human action in the above-mentioned task through the two-step process of predicting the next dynamic image and recognizing the action which it represents.
|Number of pages||9|
|Journal||arXiv.org e-Print archive|
|Publication status||E-pub ahead of print - 2017|