TransCLGA : Combining Local and Global Attention in Transformer for Temporal Action Detection

Bin Zhang; Yinfeng Fang; Xuguang Zhang; Yun Zhang

doi:10.15878/j.instr.202600298

Vol. 13 No. 1 (2026), Articles

Vol. 13 No. 1 (2026)

TransCLGA : Combining Local and Global Attention in Transformer for Temporal Action Detection

Articles

https://doi.org/10.15878/j.instr.202600298

Published 2026-05-06

Bin Zhang
Yinfeng Fang
Xuguang Zhang⁺⁻
Yun Zhang

Bin Zhang

https://orcid.org/0009-0009-1824-7435

Yinfeng Fang

Xuguang Zhang

College of Media Engineering, Communication university of Zhejiang

https://orcid.org/0000-0001-8684-5802

Yun Zhang

PDF

HTML

Keywords

Temporal action detection
Vision Transformer
Gated mechanism
Self-attention
Action recognition

How to Cite

Zhang, B., Fang, Y., Zhang, X., & Zhang, Y. (2026). TransCLGA : Combining Local and Global Attention in Transformer for Temporal Action Detection. Instrumentation, 13(1). https://doi.org/10.15878/j.instr.202600298

Funding data

National Natural Science Foundation of China
Grant numbers 61771418

Abstract

Temporal action detection is a very important task in the field of computer vision, which aims to identify and locate specific actions or events occurring in a video. How to fully extract feature information and perform efficient and accurate feature fusion is always an important topic in time series motion detection. To solve this problem, a Combining Local and Global Attention in Transformer (TransCLGA) model is proposed in this paper, which aims to extract rich feature information at each time scale and perform regulated multi-scale fusion. In the backbone network, we use a Hybrid Attention Module that combines local multi-head attention and global multi-head attention mechanism to give full play to Transformer's advantages in processing global information and make up for its shortcomings in local feature extraction. Channel Convolutional Block is introduced in this part to reduce the interference information brought by global-local information extraction and enhance the feature representation. In the neck of the model, we designed a gated feature fusion pyramid, which realized the effective integration of information at different time scales by selectively retaining key information, and finally realized the accurate prediction of the motion detection head. Our model was tested on two datasets (THUMOS14 and ActivityNet1.3) with excellent results.

https://doi.org/10.15878/j.instr.202600298

PDF

HTML