Abstract
We propose a novel method for crowd video classification, based on a two-stream convolutional architecture which incorporates spatial and temporal networks. Our proposed method cope with the key challenge of capturing the complementary information on appearance from still frames and motion between frames. In our proposed method, a motion flow field is obtained from the video through dense optical flow. We demonstrate that the proposed method trained on information including dense optical flow achieves significant improvement in performance. We train and evaluate our proposed method on a benchmark crowd video dataset. The experimental results of our method show that it outperforms the reference methods.