CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Consistent multimodal pre-training for visual tokenization
Pan, Ting1,2,3; Tang, Lulu3; Wang, Xinlong3; Liu, Xin4; Shan, Shiguang1,2
2025-09-28
发表期刊SCIENCE CHINA-INFORMATION SCIENCES
ISSN1674-733X
卷号68期号:10页码:15
摘要Multimodal large language models (MLLMs) have recently demonstrated notable progress in understanding diverse visual context. Nevertheless, the overall performance of these large vision-language connecting models is highly related to a smaller vision-language pre-trained (CLIP) model at low resolution. Currently, this nesting vision-language alignment paradigm has hindered the development of a distinct vision foundation model for domain-specific multimodal tasks (e.g., OCR and document perception). In this paper, we explore a native high-resolution vision foundation model that is specifically designed for both image-level and region-level multimodal language tasks, clearly substituting the low-resolution CLIP models. Specifically, we introduce TAP-v2, a novel visual tokenizer that encodes general-purpose contextual information to enable comprehensive perception across diverse visual content.
关键词foundation model multimodal representation learning visual tokenization
DOI10.1007/s11432-024-4603-x
收录类别SCI
语种英语
资助项目National Key R&D Program of China[2022ZD0116302]
WOS研究方向Computer Science ; Engineering
WOS类目Computer Science, Information Systems ; Engineering, Electrical & Electronic
WOS记录号WOS:001585658600001
出版者SCIENCE PRESS
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/41677
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Liu, Xin; Shan, Shiguang
作者单位1.Chinese Acad Sci, Inst Comp Technol, Beijing 100086, Peoples R China
2.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100049, Peoples R China
3.Beijing Acad Artificial Intelligence, Beijing, Peoples R China
4.SeetaCloud, Nanjing 210000, Peoples R China
推荐引用方式
GB/T 7714
Pan, Ting,Tang, Lulu,Wang, Xinlong,et al. Consistent multimodal pre-training for visual tokenization[J]. SCIENCE CHINA-INFORMATION SCIENCES,2025,68(10):15.
APA Pan, Ting,Tang, Lulu,Wang, Xinlong,Liu, Xin,&Shan, Shiguang.(2025).Consistent multimodal pre-training for visual tokenization.SCIENCE CHINA-INFORMATION SCIENCES,68(10),15.
MLA Pan, Ting,et al."Consistent multimodal pre-training for visual tokenization".SCIENCE CHINA-INFORMATION SCIENCES 68.10(2025):15.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Pan, Ting]的文章
[Tang, Lulu]的文章
[Wang, Xinlong]的文章
百度学术
百度学术中相似的文章
[Pan, Ting]的文章
[Tang, Lulu]的文章
[Wang, Xinlong]的文章
必应学术
必应学术中相似的文章
[Pan, Ting]的文章
[Tang, Lulu]的文章
[Wang, Xinlong]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。