A Summarize-then-Search Method for Long Video Question Answering: Related Work

26 May 2024

This paper is available on arxiv under CC 4.0 license.


(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);

(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).

Movie Summarization Movies are typical examples of long videos with clear narrative structures. Gorinski et al. [7]generate the shorter version of a screenplay as the task of finding an optimal graph chain of a movie scene. TRIPOD [23] is a screenplay dataset containing turning point annotations. In the same work, an automatic model to identify the turning point from movie narratives is proposed. Papalampidi et al. [24] later uses the TV series CSI to demonstrate the usefulness of turning points in automatic movie summarization. Lee et al. [15] further improves turning point identification with dialogue features and transformer architecture.

Long Video QA The task of video question answering has been studied extensively in the literature in the form of both Open-Ended QA [9] and Multi-Choice Problems [28, 29]. Several approaches have been proposed to address this task, starting from RNN-based attention networks [9, 30, 36, 38], to memory networks [12, 22, 27], and transformers [4, 6]. Recently, multimodal models pre-trained on large-scale video datasets (VideoQA [31], VIOLET [5], and MERLOT [33] and MERLOT-Reserve [34]) shows promising performance in video question answering as well.

However, long video QA has received relatively less attention despite its importance. MovieQA [27] formulates QAs on the entire movies, which typically span two long hours. DramaQA [3] uses a single TV series as visual context, and tasks a solver to understand video clips of length from one to twenty minutes.