CSpace  > 中国科学院计算技术研究所期刊论文  > 英文
Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
Yu, Ting1; Fu, Kunhao1; Wang, Shuhui2; Huang, Qingming3; Yu, Jun4
2025-02-01
发表期刊IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
ISSN1051-8215
卷号35期号:2页码:1615-1630
摘要Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.
关键词Cognition Computational modeling Visualization Context modeling Data models Adaptation models Accuracy Question answering (information retrieval) Transformers Feature extraction Video question answering discriminative unimodal comprehension cross-modal interaction domain-specific heuristics video-language foundation models entity-action relationships context-aware reasoning
DOI10.1109/TCSVT.2024.3475510
收录类别SCI
语种英语
资助项目Zhejiang Provincial Natural Science Foundation of China[LY23F020005] ; National Natural Science Foundation of China[62002314]
WOS研究方向Engineering
WOS类目Engineering, Electrical & Electronic
WOS记录号WOS:001422045800012
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
文献类型期刊论文
条目标识符http://119.78.100.204/handle/2XEOYT63/40741
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Yu, Ting
作者单位1.Hangzhou Normal Univ, Sch Informat Sci & Technol, Hangzhou 311121, Peoples R China
2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
3.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China
4.Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518055, Peoples R China
推荐引用方式
GB/T 7714
Yu, Ting,Fu, Kunhao,Wang, Shuhui,et al. Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,2025,35(2):1615-1630.
APA Yu, Ting,Fu, Kunhao,Wang, Shuhui,Huang, Qingming,&Yu, Jun.(2025).Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering.IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,35(2),1615-1630.
MLA Yu, Ting,et al."Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering".IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 35.2(2025):1615-1630.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Yu, Ting]的文章
[Fu, Kunhao]的文章
[Wang, Shuhui]的文章
百度学术
百度学术中相似的文章
[Yu, Ting]的文章
[Fu, Kunhao]的文章
[Wang, Shuhui]的文章
必应学术
必应学术中相似的文章
[Yu, Ting]的文章
[Fu, Kunhao]的文章
[Wang, Shuhui]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。