CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Focus and Align: Learning Tube Tokens for Video-Language Pre-Training
Zhu, Yongqing1,2; Li, Xiangyang1,2; Zheng, Mao3; Yang, Jiahao1,2; Wang, Zihan1,2; Guo, Xiaoqian1,2; Chai, Zifeng3; Yuan, Yuchen3; Jiang, Shuqiang1,2
2023
发表期刊IEEE TRANSACTIONS ON MULTIMEDIA
ISSN1520-9210
卷号25页码:8036-8050
摘要Video-language pre-training (VLP) has attracted increasing attention for cross-modality understanding tasks. To enhance visual representations, recent works attempt to adopt transformer-based architectures as video encoders. These works usually focus on the visual representations of the sampled frames. Compared with frame representations, frame patches incorporate more fine-grained spatio-temporal information, which could lead to a better understanding of video contents. However, how to exploit the spatio-temporal information within frame patches for VLP has been less investigated. In this work, we propose a method to learn tube tokens to model the key spatio-temporal information from frame patches. To this end, multiple semantic centers are introduced to focus on the underlying patterns of frame patches. Based on each semantic center, the spatio-temporal information within frame patches is integrated into a unique tube token. Complementary to frame representations, tube tokens provide detailed clues of video contents. Furthermore, to better align the generated tube tokens and the contents of descriptions, a local alignment mechanism is introduced. The experiments based on a variety of downstream tasks demonstrate the effectiveness of the proposed method.
关键词Electron tubes Semantics Visualization Feature extraction Task analysis Transformers Detectors Local alignment mechanism semantic centers tube tokens video-language pre-training
DOI10.1109/TMM.2022.3231108
收录类别SCI
语种英语
资助项目National Natural Science Foundation of China
WOS研究方向Computer Science ; Telecommunications
WOS类目Computer Science, Information Systems ; Computer Science, Software Engineering ; Telecommunications
WOS记录号WOS:001125902000019
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
被引频次:1[WOS]   [WOS记录]     [WOS相关记录]
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/38438
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Jiang, Shuqiang
作者单位1.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Beijing 100049, Peoples R China
3.Tencent, Dept Machine Learning Platform, Beijing 100193, Peoples R China
推荐引用方式
GB/T 7714
Zhu, Yongqing,Li, Xiangyang,Zheng, Mao,et al. Focus and Align: Learning Tube Tokens for Video-Language Pre-Training[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2023,25:8036-8050.
APA Zhu, Yongqing.,Li, Xiangyang.,Zheng, Mao.,Yang, Jiahao.,Wang, Zihan.,...&Jiang, Shuqiang.(2023).Focus and Align: Learning Tube Tokens for Video-Language Pre-Training.IEEE TRANSACTIONS ON MULTIMEDIA,25,8036-8050.
MLA Zhu, Yongqing,et al."Focus and Align: Learning Tube Tokens for Video-Language Pre-Training".IEEE TRANSACTIONS ON MULTIMEDIA 25(2023):8036-8050.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Zhu, Yongqing]的文章
[Li, Xiangyang]的文章
[Zheng, Mao]的文章
百度学术
百度学术中相似的文章
[Zhu, Yongqing]的文章
[Li, Xiangyang]的文章
[Zheng, Mao]的文章
必应学术
必应学术中相似的文章
[Zhu, Yongqing]的文章
[Li, Xiangyang]的文章
[Zheng, Mao]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。