Patching the visual ability of large multimodal models by collaborating with small models

doi:10.1007/s11704-025-41126-5

CSpace

	Patching the visual ability of large multimodal models by collaborating with small models
	Liang, Hao 1,2; Zhang, Xiaolong 1,2; Kan, Meina 1,2; Shan, Shiguang 1,2,3; Chen, Xilin 1,2
	2026-02-12
发表期刊	FRONTIERS OF COMPUTER SCIENCE
ISSN	2095-2228
卷号	20 期号:9 页码:17
摘要	Large multimodal models (LMMs) have demonstrated significant success across various tasks but fall short on some basic visual functions, such as inaccurate object counting and imprecise localization. These limitations restrict the application of LMMs in broad scenarios. To enhance the capabilities of LMMs, we propose a novel method to patch their visual perceptual abilities by collaborating with small task-specific models. Our method begins with utilizing an LMM to decompose the user query into a series of visual functions. For each function, the appropriate model, either the LMM itself or a small task-specific model, is invoked. To determine whether to patch the LMM with a small task-specific model, we design a novel question-answering-based reinforcement learning strategy to optimize the decision process. Finally, the LMM generates the answer utilizing the visual perceptual results. The proposed method is evaluated on two standard visual question-answering datasets and two specialized datasets. The experimental results demonstrate that our method effectively enhances the visual abilities of LMMs.
关键词	model collaboration patching visual ability large multimodal models
DOI	10.1007/s11704-025-41126-5
收录类别	SCI
语种	英语
WOS研究方向	Computer Science
WOS类目	Computer Science, Information Systems ; Computer Science, Software Engineering ; Computer Science, Theory & Methods
WOS记录号	WOS:001690415600001
出版者	HIGHER EDUCATION PRESS
引用统计
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/42796
专题	中国科学院计算技术研究所
通讯作者	Kan, Meina
作者单位	1.Chinese Acad Sci, Inst Comp Technol, State Key Lab AI Safety, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Beijing 100049, Peoples R China 3.Peng Cheng Natl Lab, Shenzhen 518055, Peoples R China
推荐引用方式 GB/T 7714	Liang, Hao,Zhang, Xiaolong,Kan, Meina,et al. Patching the visual ability of large multimodal models by collaborating with small models[J]. FRONTIERS OF COMPUTER SCIENCE,2026,20(9):17.
APA	Liang, Hao,Zhang, Xiaolong,Kan, Meina,Shan, Shiguang,&Chen, Xilin.(2026).Patching the visual ability of large multimodal models by collaborating with small models.FRONTIERS OF COMPUTER SCIENCE,20(9),17.
MLA	Liang, Hao,et al."Patching the visual ability of large multimodal models by collaborating with small models".FRONTIERS OF COMPUTER SCIENCE 20.9(2026):17.