Journal of System Simulation

Dense Video Description Method Based on Multi-modal Fusion in Transformer Network

Xiang Li, School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, ChinaFollow
Haifeng Sang, School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, ChinaFollow

Abstract

Abstract: In order to solve the problems that most of the current dense video description models use twostage methods, which have low efficiency, ignore audio and semantic information, and have incomplete description results, a multi-modal and semantic information fusion dense video description method was proposed. An adaptive R(2+1)D network was proposed to extract visual features, a semantic detector was designed to generate semantic information, audio features were added to supplement it, a multi-scale deformable attention module was established, and a parallel prediction head was applied to accelerate the convergence rate and improve the accuracy of the model. The experimental results show that the model has good performance on the two benchmark datasets, and the evaluation index BLEU4 reaches 2.17.

Recommended Citation

Li, Xiang and Sang, Haifeng (2024) "Dense Video Description Method Based on Multi-modal Fusion in Transformer Network," Journal of System Simulation: Vol. 36: Iss. 5, Article 2.
DOI: 10.16182/j.issn1004731x.joss.23-0017
Available at: https://dc-china-simulation.researchcommons.org/journal/vol36/iss5/2

First Page

1061

Last Page

1071

CLC

TP391

Recommended Citation

Li Xiang, Sang Haifeng. Dense Video Description Method Based on Multi-modal Fusion in Transformer Network[J]. Journal of System Simulation, 2024, 36(5): 1061-1071.

Corresponding Author

Sang Haifeng

DOI

10.16182/j.issn1004731x.joss.23-0017

Download

Included in

Artificial Intelligence and Robotics Commons, Computer Engineering Commons, Numerical Analysis and Scientific Computing Commons, Operations Research, Systems Engineering and Industrial Engineering Commons, Systems Science Commons

COinS

Journal of System Simulation

Dense Video Description Method Based on Multi-modal Fusion in Transformer Network

Authors

Abstract

Recommended Citation

First Page

Last Page

CLC

Recommended Citation

Corresponding Author

DOI

Included in

Share

Search