Institute of Computing Technology, Chinese Academy IR
Rethink video retrieval representation for video captioning | |
Tian, Mingkai1; Li, Guorong1; Qi, Yuankai2; Wang, Shuhui3; Sheng, Quan Z.2; Huang, Qingming1 | |
2024-12-01 | |
发表期刊 | PATTERN RECOGNITION
![]() |
ISSN | 0031-3203 |
卷号 | 156页码:13 |
摘要 | Video captioning, a challenging task targeting the automatic generation of accurate and comprehensive descriptions based on video content, has witnessed substantial success recently driven by bridging video representations and textual semantics. Inspired by the nature of the video retrieval task, which learns visual features strongly related to text queries, we propose to take advantage of visual representation learning from the video retrieval framework to tackle video captioning tasks and construct adequate multi-grained cross- modal matching while extracting visual features. However, a simple direct application of recent video retrieval models fails to capture sufficient temporal details and the rich visual features of local patch tokens of video frames lack semantic information essential for captioning tasks. These deficiencies are primarily due to these models lack fine-grained interactions between video frames and offer only weak textual supervision over frame patch tokens. To increase the attention on temporal details, we propose a learnable token shift module, which flexibly captures subtle movements in local regions across the temporal sequence. Furthermore, we devise a Refineformer, which learns to integrate local video patch tokens strongly related to desired captions via a cross-attention mechanism. Extensive experiments on MSVD, MSR-VTT and VATEX demonstrate the favorable performance of our method. Code will be available at https://github.com/tiesanguaixia/IVRC. |
关键词 | Video captioning Video-text retrieval Token shift Cross-attention |
DOI | 10.1016/j.patcog.2024.110744 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | National Natural Science Foundation of China[62272438] ; National Natural Science Foundation of China[62236008] ; National Natural Science Foundation of China[U21B2038] ; National Natural Science Foundation of China[61931008] ; Key Deployment Program of the Chinese Academy of Sciences, China[KGFZD145-23-18] ; Fundamental Research Funds for Central Universities, China[E2ET1104] |
WOS研究方向 | Computer Science ; Engineering |
WOS类目 | Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic |
WOS记录号 | WOS:001361490600001 |
出版者 | ELSEVIER SCI LTD |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/41171 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Li, Guorong |
作者单位 | 1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Key Lab Big Data Min & Knowledge Management, Beijing, Peoples R China 2.Macquarie Univ, Sch Comp, Sydney, NSW, Australia 3.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing, Peoples R China |
推荐引用方式 GB/T 7714 | Tian, Mingkai,Li, Guorong,Qi, Yuankai,et al. Rethink video retrieval representation for video captioning[J]. PATTERN RECOGNITION,2024,156:13. |
APA | Tian, Mingkai,Li, Guorong,Qi, Yuankai,Wang, Shuhui,Sheng, Quan Z.,&Huang, Qingming.(2024).Rethink video retrieval representation for video captioning.PATTERN RECOGNITION,156,13. |
MLA | Tian, Mingkai,et al."Rethink video retrieval representation for video captioning".PATTERN RECOGNITION 156(2024):13. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论