Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering

doi:10.1109/TCSVT.2024.3475510

	Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering
	Yu, Ting 1; Fu, Kunhao 1; Wang, Shuhui 2; Huang, Qingming 3; Yu, Jun 4
	2025-02-01
发表期刊	IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
ISSN	1051-8215
卷号	35 期号:2 页码:1615-1630
摘要	Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.
关键词	Cognition Computational modeling Visualization Context modeling Data models Adaptation models Accuracy Question answering (information retrieval) Transformers Feature extraction Video question answering discriminative unimodal comprehension cross-modal interaction domain-specific heuristics video-language foundation models entity-action relationships context-aware reasoning
DOI	10.1109/TCSVT.2024.3475510
收录类别	SCI
语种	英语
WOS研究方向	Engineering
WOS类目	Engineering, Electrical & Electronic
WOS记录号	WOS:001422045800012
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计	被引频次：3[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	http://119.78.100.204/handle/2XEOYT63/40741
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Yu, Ting
作者单位	1.Hangzhou Normal Univ, Sch Informat Sci & Technol, Hangzhou 311121, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China 3.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China 4.Harbin Inst Technol, Dept Comp Sci & Technol, Shenzhen 518055, Peoples R China
推荐引用方式 GB/T 7714	Yu, Ting,Fu, Kunhao,Wang, Shuhui,et al. Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,2025,35(2):1615-1630.
APA	Yu, Ting,Fu, Kunhao,Wang, Shuhui,Huang, Qingming,&Yu, Jun.(2025).Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering.IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,35(2),1615-1630.
MLA	Yu, Ting,et al."Prompting Video-Language Foundation Models With Domain-Specific Fine-Grained Heuristics for Video Question Answering".IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 35.2(2025):1615-1630.