Institute of Computing Technology, Chinese Academy IR
| Consistent multimodal pre-training for visual tokenization | |
| Pan, Ting1,2,3; Tang, Lulu3; Wang, Xinlong3; Liu, Xin4; Shan, Shiguang1,2 | |
| 2025-09-28 | |
| 发表期刊 | SCIENCE CHINA-INFORMATION SCIENCES
![]() |
| ISSN | 1674-733X |
| 卷号 | 68期号:10页码:15 |
| 摘要 | Multimodal large language models (MLLMs) have recently demonstrated notable progress in understanding diverse visual context. Nevertheless, the overall performance of these large vision-language connecting models is highly related to a smaller vision-language pre-trained (CLIP) model at low resolution. Currently, this nesting vision-language alignment paradigm has hindered the development of a distinct vision foundation model for domain-specific multimodal tasks (e.g., OCR and document perception). In this paper, we explore a native high-resolution vision foundation model that is specifically designed for both image-level and region-level multimodal language tasks, clearly substituting the low-resolution CLIP models. Specifically, we introduce TAP-v2, a novel visual tokenizer that encodes general-purpose contextual information to enable comprehensive perception across diverse visual content. |
| 关键词 | foundation model multimodal representation learning visual tokenization |
| DOI | 10.1007/s11432-024-4603-x |
| 收录类别 | SCI |
| 语种 | 英语 |
| 资助项目 | National Key R&D Program of China[2022ZD0116302] |
| WOS研究方向 | Computer Science ; Engineering |
| WOS类目 | Computer Science, Information Systems ; Engineering, Electrical & Electronic |
| WOS记录号 | WOS:001585658600001 |
| 出版者 | SCIENCE PRESS |
| 引用统计 | |
| 文献类型 | 期刊论文 |
| 条目标识符 | http://119.78.100.204/handle/2XEOYT63/41677 |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Liu, Xin; Shan, Shiguang |
| 作者单位 | 1.Chinese Acad Sci, Inst Comp Technol, Beijing 100086, Peoples R China 2.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100049, Peoples R China 3.Beijing Acad Artificial Intelligence, Beijing, Peoples R China 4.SeetaCloud, Nanjing 210000, Peoples R China |
| 推荐引用方式 GB/T 7714 | Pan, Ting,Tang, Lulu,Wang, Xinlong,et al. Consistent multimodal pre-training for visual tokenization[J]. SCIENCE CHINA-INFORMATION SCIENCES,2025,68(10):15. |
| APA | Pan, Ting,Tang, Lulu,Wang, Xinlong,Liu, Xin,&Shan, Shiguang.(2025).Consistent multimodal pre-training for visual tokenization.SCIENCE CHINA-INFORMATION SCIENCES,68(10),15. |
| MLA | Pan, Ting,et al."Consistent multimodal pre-training for visual tokenization".SCIENCE CHINA-INFORMATION SCIENCES 68.10(2025):15. |
| 条目包含的文件 | 条目无相关文件。 | |||||
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论