Abstract
Automated detection of unusual activities in surveillance videos remains a significant challenge due to the massive amount of recorded footage and the low occurrence of anomalous events. Here we report a novel deep learning framework designed to address this problem by integrating three core components: three-dimensional Convolutional Neural Networks (3D-CNNs) for extracting spatiotemporal features, Long Short-Term Memory (LSTM) networks for capturing sequential dependencies, and an attention mechanism for emphasizing the most salient regions of video data. The study aimed to design a robust model capable of classifying surveillance video clips into “usual” and “unusual” categories with high accuracy while handling class imbalance and environmental variations. The proposed model was trained and evaluated on three large-scale benchmark datasets: UCF-Crime, XD-Violence, and CCTV-Fights, which represent real-world anomalies under diverse conditions. Experimental results demonstrated that the framework achieved an overall accuracy of 97.41% on UCF-Crime, 98.11% on XD-Violence, and 98.50% on CCTV-Fights, alongside consistently high values of precision, recall, and F1-score. These findings indicate that combining spatiotemporal modelling with attention-driven context aggregation substantially improves anomaly detection performance compared to existing baselines. The significance of this research lies in showing that integrating temporal modelling and attention can advance current surveillance systems, providing a more scalable and effective approach for anomaly detection in surveillance videos.
Keywords
3D Convolutional Neural Networks
Fight Detection
Surveillance Videos
Video Anomaly Detection
Violence Detection.