Rethink video retrieval representation for video captioning

doi:10.1016/j.patcog.2024.110744

	Rethink video retrieval representation for video captioning
	Tian, Mingkai 1; Li, Guorong 1; Qi, Yuankai 2; Wang, Shuhui 3; Sheng, Quan Z.2; Huang, Qingming 1
	2024-12-01
发表期刊	PATTERN RECOGNITION
ISSN	0031-3203
卷号	156 页码:13
摘要	Video captioning, a challenging task targeting the automatic generation of accurate and comprehensive descriptions based on video content, has witnessed substantial success recently driven by bridging video representations and textual semantics. Inspired by the nature of the video retrieval task, which learns visual features strongly related to text queries, we propose to take advantage of visual representation learning from the video retrieval framework to tackle video captioning tasks and construct adequate multi-grained cross- modal matching while extracting visual features. However, a simple direct application of recent video retrieval models fails to capture sufficient temporal details and the rich visual features of local patch tokens of video frames lack semantic information essential for captioning tasks. These deficiencies are primarily due to these models lack fine-grained interactions between video frames and offer only weak textual supervision over frame patch tokens. To increase the attention on temporal details, we propose a learnable token shift module, which flexibly captures subtle movements in local regions across the temporal sequence. Furthermore, we devise a Refineformer, which learns to integrate local video patch tokens strongly related to desired captions via a cross-attention mechanism. Extensive experiments on MSVD, MSR-VTT and VATEX demonstrate the favorable performance of our method. Code will be available at https://github.com/tiesanguaixia/IVRC.
关键词	Video captioning Video-text retrieval Token shift Cross-attention
DOI	10.1016/j.patcog.2024.110744
收录类别	SCI
语种	英语
WOS研究方向	Computer Science ; Engineering
WOS类目	Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic
WOS记录号	WOS:001361490600001
出版者	ELSEVIER SCI LTD
引用统计	被引频次：7[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/41171
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Li, Guorong
作者单位	1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Key Lab Big Data Min & Knowledge Management, Beijing, Peoples R China 2.Macquarie Univ, Sch Comp, Sydney, NSW, Australia 3.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing, Peoples R China
推荐引用方式 GB/T 7714	Tian, Mingkai,Li, Guorong,Qi, Yuankai,et al. Rethink video retrieval representation for video captioning[J]. PATTERN RECOGNITION,2024,156:13.
APA	Tian, Mingkai,Li, Guorong,Qi, Yuankai,Wang, Shuhui,Sheng, Quan Z.,&Huang, Qingming.(2024).Rethink video retrieval representation for video captioning.PATTERN RECOGNITION,156,13.
MLA	Tian, Mingkai,et al."Rethink video retrieval representation for video captioning".PATTERN RECOGNITION 156(2024):13.