This paper presents a new approach for temporal detection of short human activities in untrimmed videos. Most present methods for temporal action detection, to our best knowledge, are trained on public action datasets that feature actions spanning up to tens and hundreds of seconds. However, it is often desired in manufacturing, transportation, and other safety-critical scenes that fine-grained actions be automatically detected, classified, and monitored. We propose a new Dilated Convolutional Temporal Prediction Network that features 1-D dilated convolution operation in a Residual network (ResNet)-like architecture for the generation of action proposals on orders of fractions of a second. The new architecture is used as a part of the action monitoring pipeline in subway cars. Experiments demonstrate that the proposed model outperforms the state-of-the-art on the task of temporal action proposal generation on a real-world video dataset, while achieving a fast processing speed suitable for online monitoring.
Read the paper here.