Consistent multimodal pre-training for visual tokenization

doi:10.1007/s11432-024-4603-x

	Consistent multimodal pre-training for visual tokenization
	Pan, Ting 1,2,3; Tang, Lulu 3; Wang, Xinlong 3; Liu, Xin 4; Shan, Shiguang 1,2
	2025-09-28
发表期刊	SCIENCE CHINA-INFORMATION SCIENCES
ISSN	1674-733X
卷号	68 期号:10 页码:15
摘要	Multimodal large language models (MLLMs) have recently demonstrated notable progress in understanding diverse visual context. Nevertheless, the overall performance of these large vision-language connecting models is highly related to a smaller vision-language pre-trained (CLIP) model at low resolution. Currently, this nesting vision-language alignment paradigm has hindered the development of a distinct vision foundation model for domain-specific multimodal tasks (e.g., OCR and document perception). In this paper, we explore a native high-resolution vision foundation model that is specifically designed for both image-level and region-level multimodal language tasks, clearly substituting the low-resolution CLIP models. Specifically, we introduce TAP-v2, a novel visual tokenizer that encodes general-purpose contextual information to enable comprehensive perception across diverse visual content.
关键词	foundation model multimodal representation learning visual tokenization
DOI	10.1007/s11432-024-4603-x
收录类别	SCI
语种	英语
WOS研究方向	Computer Science ; Engineering
WOS类目	Computer Science, Information Systems ; Engineering, Electrical & Electronic
WOS记录号	WOS:001585658600001
出版者	SCIENCE PRESS
引用统计
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/41677
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Liu, Xin; Shan, Shiguang
作者单位	1.Chinese Acad Sci, Inst Comp Technol, Beijing 100086, Peoples R China 2.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100049, Peoples R China 3.Beijing Acad Artificial Intelligence, Beijing, Peoples R China 4.SeetaCloud, Nanjing 210000, Peoples R China
推荐引用方式 GB/T 7714	Pan, Ting,Tang, Lulu,Wang, Xinlong,et al. Consistent multimodal pre-training for visual tokenization[J]. SCIENCE CHINA-INFORMATION SCIENCES,2025,68(10):15.
APA	Pan, Ting,Tang, Lulu,Wang, Xinlong,Liu, Xin,&Shan, Shiguang.(2025).Consistent multimodal pre-training for visual tokenization.SCIENCE CHINA-INFORMATION SCIENCES,68(10),15.
MLA	Pan, Ting,et al."Consistent multimodal pre-training for visual tokenization".SCIENCE CHINA-INFORMATION SCIENCES 68.10(2025):15.