CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Rethink video retrieval representation for video captioning
Tian, Mingkai1; Li, Guorong1; Qi, Yuankai2; Wang, Shuhui3; Sheng, Quan Z.2; Huang, Qingming1
2024-12-01
发表期刊PATTERN RECOGNITION
ISSN0031-3203
卷号156页码:13
摘要Video captioning, a challenging task targeting the automatic generation of accurate and comprehensive descriptions based on video content, has witnessed substantial success recently driven by bridging video representations and textual semantics. Inspired by the nature of the video retrieval task, which learns visual features strongly related to text queries, we propose to take advantage of visual representation learning from the video retrieval framework to tackle video captioning tasks and construct adequate multi-grained cross- modal matching while extracting visual features. However, a simple direct application of recent video retrieval models fails to capture sufficient temporal details and the rich visual features of local patch tokens of video frames lack semantic information essential for captioning tasks. These deficiencies are primarily due to these models lack fine-grained interactions between video frames and offer only weak textual supervision over frame patch tokens. To increase the attention on temporal details, we propose a learnable token shift module, which flexibly captures subtle movements in local regions across the temporal sequence. Furthermore, we devise a Refineformer, which learns to integrate local video patch tokens strongly related to desired captions via a cross-attention mechanism. Extensive experiments on MSVD, MSR-VTT and VATEX demonstrate the favorable performance of our method. Code will be available at https://github.com/tiesanguaixia/IVRC.
关键词Video captioning Video-text retrieval Token shift Cross-attention
DOI10.1016/j.patcog.2024.110744
收录类别SCI
语种英语
资助项目National Natural Science Foundation of China[62272438] ; National Natural Science Foundation of China[62236008] ; National Natural Science Foundation of China[U21B2038] ; National Natural Science Foundation of China[61931008] ; Key Deployment Program of the Chinese Academy of Sciences, China[KGFZD145-23-18] ; Fundamental Research Funds for Central Universities, China[E2ET1104]
WOS研究方向Computer Science ; Engineering
WOS类目Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic
WOS记录号WOS:001361490600001
出版者ELSEVIER SCI LTD
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/41171
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Li, Guorong
作者单位1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Key Lab Big Data Min & Knowledge Management, Beijing, Peoples R China
2.Macquarie Univ, Sch Comp, Sydney, NSW, Australia
3.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Tian, Mingkai,Li, Guorong,Qi, Yuankai,et al. Rethink video retrieval representation for video captioning[J]. PATTERN RECOGNITION,2024,156:13.
APA Tian, Mingkai,Li, Guorong,Qi, Yuankai,Wang, Shuhui,Sheng, Quan Z.,&Huang, Qingming.(2024).Rethink video retrieval representation for video captioning.PATTERN RECOGNITION,156,13.
MLA Tian, Mingkai,et al."Rethink video retrieval representation for video captioning".PATTERN RECOGNITION 156(2024):13.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Tian, Mingkai]的文章
[Li, Guorong]的文章
[Qi, Yuankai]的文章
百度学术
百度学术中相似的文章
[Tian, Mingkai]的文章
[Li, Guorong]的文章
[Qi, Yuankai]的文章
必应学术
必应学术中相似的文章
[Tian, Mingkai]的文章
[Li, Guorong]的文章
[Qi, Yuankai]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。