CSpace
Patching the visual ability of large multimodal models by collaborating with small models
Liang, Hao1,2; Zhang, Xiaolong1,2; Kan, Meina1,2; Shan, Shiguang1,2,3; Chen, Xilin1,2
2026-02-12
发表期刊FRONTIERS OF COMPUTER SCIENCE
ISSN2095-2228
卷号20期号:9页码:17
摘要Large multimodal models (LMMs) have demonstrated significant success across various tasks but fall short on some basic visual functions, such as inaccurate object counting and imprecise localization. These limitations restrict the application of LMMs in broad scenarios. To enhance the capabilities of LMMs, we propose a novel method to patch their visual perceptual abilities by collaborating with small task-specific models. Our method begins with utilizing an LMM to decompose the user query into a series of visual functions. For each function, the appropriate model, either the LMM itself or a small task-specific model, is invoked. To determine whether to patch the LMM with a small task-specific model, we design a novel question-answering-based reinforcement learning strategy to optimize the decision process. Finally, the LMM generates the answer utilizing the visual perceptual results. The proposed method is evaluated on two standard visual question-answering datasets and two specialized datasets. The experimental results demonstrate that our method effectively enhances the visual abilities of LMMs.
关键词model collaboration patching visual ability large multimodal models
DOI10.1007/s11704-025-41126-5
收录类别SCI
语种英语
WOS研究方向Computer Science
WOS类目Computer Science, Information Systems ; Computer Science, Software Engineering ; Computer Science, Theory & Methods
WOS记录号WOS:001690415600001
出版者HIGHER EDUCATION PRESS
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/42796
专题中国科学院计算技术研究所
通讯作者Kan, Meina
作者单位1.Chinese Acad Sci, Inst Comp Technol, State Key Lab AI Safety, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Beijing 100049, Peoples R China
3.Peng Cheng Natl Lab, Shenzhen 518055, Peoples R China
推荐引用方式
GB/T 7714
Liang, Hao,Zhang, Xiaolong,Kan, Meina,et al. Patching the visual ability of large multimodal models by collaborating with small models[J]. FRONTIERS OF COMPUTER SCIENCE,2026,20(9):17.
APA Liang, Hao,Zhang, Xiaolong,Kan, Meina,Shan, Shiguang,&Chen, Xilin.(2026).Patching the visual ability of large multimodal models by collaborating with small models.FRONTIERS OF COMPUTER SCIENCE,20(9),17.
MLA Liang, Hao,et al."Patching the visual ability of large multimodal models by collaborating with small models".FRONTIERS OF COMPUTER SCIENCE 20.9(2026):17.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Liang, Hao]的文章
[Zhang, Xiaolong]的文章
[Kan, Meina]的文章
百度学术
百度学术中相似的文章
[Liang, Hao]的文章
[Zhang, Xiaolong]的文章
[Kan, Meina]的文章
必应学术
必应学术中相似的文章
[Liang, Hao]的文章
[Zhang, Xiaolong]的文章
[Kan, Meina]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。