YT-RAG: A Multimodal Retrieval-Augmented Generation Framework for YouTube Video Understanding
DOI:
https://doi.org/10.47392/IRJASH.2026.018Keywords:
Multimodal RAG, Video Understanding, YouTube, Dual-Modality Retrieval, Agentic Retrieval, Fast Embed, Semantic NotesAbstract
The vast body of video content on platforms such as YouTube represents one of the richest yet most query-inaccessible knowledge repositories in existence. We present YT-RAG, a multimodal Retrieval-Augmented Generation (RAG) system that enables natural-language conversation with any YouTube video by independently indexing its spoken transcript as 384-dimensional text embeddings and its visual frames as 512-dimensional image embeddings using FastEmbed, with no GPU infrastructure required. Three principal contributions define the system: (i) a parallel dual-modality ingestion pipeline running transcript and frame extraction as concurrent asyncio tasks; (ii) an agentic retrieval loop powered by Google Gemini 2.0 Flash’s native tool-call API, making retrieval conditional on model judgment rather than mandatory; and (iii) a semantic user-notes channel providing a personalised third retrieval layer absent from all prior video RAG literature. Empirical evaluation across six YouTube content categories — Comedy, Podcast, Cooking, News, Tutorials, and Coding — shows aggregate dual-modality retrieval achieves Hit@5 approximately 4× higher than text-only and 8× higher than image-only retrieval. A systematic anti-correlation in per-category modality strengths confirms that the two channels retrieve complementary evidence. Deployed as a Chrome extension operating natively alongside the YouTube player, YT-RAG is training-free, containerised, and built entirely on official APIs without agentic frameworks
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.