In this paper, we propose a deep learning architecture that integrates shallow features into deep features to strengthen the network's ability to express motion information at a deep level. This architecture incorporates attention mechanisms to direct the network's attention towards the most salient deep features while integrating sequences in both channel and temporal dimensions. To emphasize the motion information of the characters while preserving their fundamental contours, we extract the optical flow in both the horizontal and vertical directions from the video sequence. By concatenating the grayscale frame difference, we generate a new three-channel input that effectively captures and represents the dynamic factors of the video. To enhance the weight of motion information in the network, we design a downsampling module to extract shallow features, which are then fused with the deep features extracted by MobileNet's Blocks. Next, we employ a channel attention module to update the channel weights for each frame within the video sequence. Following this, we introduce a ConvLSTM module to enhance the temporal correlation of the video sequences. These two modules aim to redistribute network attention: the channel attention module focuses on channel-level information and the ConvLSTM module emphasizes temporal aspects. Finally, we employ 3D convolution and global pooling to compress the feature sizes, which are then fed into fully connected layers to perform violence detection. Experiments are conducted on three publicly available standard datasets to demonstrate that the proposed model performs well in terms of recognition accuracy.