Institute of Computing Technology, Chinese Academy IR
| CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning | |
| Luo, Huaishao1; Ji, Lei2,3,4; Zhong, Ming5; Chen, Yang5; Lei, Wen5; Duan, Nan4; Li, Tianrui1 | |
| 2022-10-07 | |
| 发表期刊 | NEUROCOMPUTING
![]() |
| ISSN | 0925-2312 |
| 卷号 | 508页码:293-304 |
| 摘要 | Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved. |
| 关键词 | Video retrieval Video captioning CLIP |
| DOI | 10.1016/j.neucom.2022.07.028 |
| 收录类别 | SCI |
| 语种 | 英语 |
| 资助项目 | National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170] |
| WOS研究方向 | Computer Science |
| WOS类目 | Computer Science, Artificial Intelligence |
| WOS记录号 | WOS:000848021200006 |
| 出版者 | ELSEVIER |
| 引用统计 | |
| 文献类型 | 期刊论文 |
| 条目标识符 | http://119.78.100.204/handle/2XEOYT63/19441 |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Luo, Huaishao; Ji, Lei |
| 作者单位 | 1.Southwest Jiaotong Univ, Chengdu, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China 3.Univ Chinese Acad Sci, Beijing, Peoples R China 4.Microsoft Res Asia, Beijing, Peoples R China 5.Microsoft STCA, Beijing, Peoples R China |
| 推荐引用方式 GB/T 7714 | Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304. |
| APA | Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304. |
| MLA | Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304. |
| 条目包含的文件 | 条目无相关文件。 | |||||
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论