CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

doi:10.1016/j.neucom.2022.07.028

	CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning
	Luo, Huaishao 1; Ji, Lei 2,3,4; Zhong, Ming 5; Chen, Yang 5; Lei, Wen 5; Duan, Nan 4; Li, Tianrui 1
	2022-10-07
发表期刊	NEUROCOMPUTING
ISSN	0925-2312
卷号	508 页码:293-304
摘要	Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved.
关键词	Video retrieval Video captioning CLIP
DOI	10.1016/j.neucom.2022.07.028
收录类别	SCI
语种	英语
资助项目	National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170]
WOS研究方向	Computer Science
WOS类目	Computer Science, Artificial Intelligence
WOS记录号	WOS:000848021200006
出版者	ELSEVIER
引用统计	被引频次：407[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/19441
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Luo, Huaishao; Ji, Lei
作者单位	1.Southwest Jiaotong Univ, Chengdu, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China 3.Univ Chinese Acad Sci, Beijing, Peoples R China 4.Microsoft Res Asia, Beijing, Peoples R China 5.Microsoft STCA, Beijing, Peoples R China
推荐引用方式 GB/T 7714	Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304.
APA	Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304.
MLA	Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304.