Institute of Computing Technology, Chinese Academy IR
CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning | |
Luo, Huaishao1; Ji, Lei2,3,4; Zhong, Ming5; Chen, Yang5; Lei, Wen5; Duan, Nan4; Li, Tianrui1 | |
2022-10-07 | |
发表期刊 | NEUROCOMPUTING |
ISSN | 0925-2312 |
卷号 | 508页码:293-304 |
摘要 | Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved. |
关键词 | Video retrieval Video captioning CLIP |
DOI | 10.1016/j.neucom.2022.07.028 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170] |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence |
WOS记录号 | WOS:000848021200006 |
出版者 | ELSEVIER |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/19441 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Luo, Huaishao; Ji, Lei |
作者单位 | 1.Southwest Jiaotong Univ, Chengdu, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China 3.Univ Chinese Acad Sci, Beijing, Peoples R China 4.Microsoft Res Asia, Beijing, Peoples R China 5.Microsoft STCA, Beijing, Peoples R China |
推荐引用方式 GB/T 7714 | Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304. |
APA | Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304. |
MLA | Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论