Abstract
Temporal action detection is a very important task in the field of computer vision, which aims to identify and locate specific actions or events occurring in a video. How to fully extract feature information and perform efficient and accurate feature fusion is always an important topic in time series motion detection. To solve this problem, a Combining Local and Global Attention in Transformer (TransCLGA) model is proposed in this paper, which aims to extract rich feature information at each time scale and perform regulated multi-scale fusion. In the backbone network, we use a Hybrid Attention Module that combines local multi-head attention and global multi-head attention mechanism to give full play to Transformer's advantages in processing global information and make up for its shortcomings in local feature extraction. Channel Convolutional Block is introduced in this part to reduce the interference information brought by global-local information extraction and enhance the feature representation. In the neck of the model, we designed a gated feature fusion pyramid, which realized the effective integration of information at different time scales by selectively retaining key information, and finally realized the accurate prediction of the motion detection head. Our model was tested on two datasets (THUMOS14 and ActivityNet1.3) with excellent results.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2026 Bin Zhang, Yinfeng Fang, Xuguang Zhang, Yun Zhang
Downloads
Publication Facts
Reviewer profiles N/A
Author statements
- Academic society
- China Instrument and Control Society
- Publisher
- China Instrument and Control Society