Post

ChapVidMR: Chapter-based Video Moment Retrieval

ChapVidMR: Chapter-based Video Moment Retrieval

ICVGIP 2024DOI


Video Moment Retrieval (VMR) is the task of finding the specific timestamp in a video that corresponds to a natural language query. You search “the part where they explain the offside rule” and the model returns 4:32–5:10.

Standard VMR has a quiet assumption baked in: each query maps to a single, contiguous moment. That works for short clips but breaks down for long-form video — lectures, documentaries, YouTube deep dives. In a two-hour video, “the part where they explain X” might actually span three separate segments, with digressions in between.

ChapVidMR (Chapter-based Video Moment Retrieval) addresses this.

The Core Idea

YouTube chapters give us something valuable for free: human-annotated, semantically coherent video segments. Creators add chapters because they naturally divide their content into meaningful units. We exploited this structure as a supervisory signal — rather than trying to localize arbitrary temporal spans, we link natural language queries to sets of chapters.

This reframes VMR as a multi-label retrieval problem: a query can and should match multiple relevant moments rather than a single one.

Dataset

No existing VMR dataset supported multi-moment retrieval at this scale, so we built one. Using YouTube-8M as a base and GPT-4o to generate natural language queries aligned to chapter content, we produced a dataset with 12,500+ datapoints covering diverse video content and query types.

The pipeline combined:

  • Video Language Models for visual-semantic alignment
  • Audio modality for content that doesn’t appear visually (narration, off-screen explanation)
  • Prompt engineering to generate varied, naturalistic queries per chapter

Why Chapters Work

The key insight is that chapters provide temporal locality with semantic coherence — something that arbitrary timestamp supervision doesn’t. A chapter boundary is a human judgment that the content has meaningfully shifted. Using chapter-aligned retrieval means the model learns to match queries to semantically complete units rather than arbitrary windows.

This also makes evaluation cleaner: instead of computing IoU between predicted and ground-truth timestamps (which is sensitive to annotation disagreement), chapter-level retrieval can be evaluated as a multi-label classification problem with well-defined positive sets.

Results

The paper was accepted at ICVGIP 2024 (Indian Conference on Vision, Graphics and Image Processing). Full results and methodology at the DOI link.

This post is licensed under CC BY 4.0 by the author.