Institute of Computing Technology, Chinese Academy IR
Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering | |
Yu, Ting1; Fu, Kunhao1; Wang, Shuhui2; Huang, Qingming3; Yu, Jun4 | |
2025-02-01 | |
发表期刊 | IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
![]() |
ISSN | 1051-8215 |
卷号 | 35期号:2页码:1615-1630 |
摘要 | Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA. |
关键词 | Cognition Computational modeling Visualization Context modeling Data models Adaptation models Accuracy Question answering (information retrieval) Transformers Feature extraction Video question answering discriminative unimodal comprehension cross-modal interaction domain-specific heuristics video-language foundation models entity-action relationships context-aware reasoning |
DOI | 10.1109/TCSVT.2024.3475510 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | Zhejiang Provincial Natural Science Foundation of China[LY23F020005] ; National Natural Science Foundation of China[62002314] |
WOS研究方向 | Engineering |
WOS类目 | Engineering, Electrical & Electronic |
WOS记录号 | WOS:001422045800012 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/40741 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Yu, Ting |
作者单位 | 1.Hangzhou Normal Univ, Sch Informat Sci & Technol, Hangzhou 311121, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China 3.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China 4.Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518055, Peoples R China |
推荐引用方式 GB/T 7714 | Yu, Ting,Fu, Kunhao,Wang, Shuhui,et al. Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,2025,35(2):1615-1630. |
APA | Yu, Ting,Fu, Kunhao,Wang, Shuhui,Huang, Qingming,&Yu, Jun.(2025).Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering.IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,35(2),1615-1630. |
MLA | Yu, Ting,et al."Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering".IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 35.2(2025):1615-1630. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论