YT-RAG: A Multimodal Retrieval-Augmented Generation Framework for YouTube Video Understanding

Gangadhari Swapna; Chebrolu Yogesh; G. J. S. Hari Teja

doi:10.47392/IRJASH.2026.018

Authors

Gangadhari Swapna Associate professor, Dept. of CSE, RGUKT RK-VALLEY, India Author
Chebrolu Yogesh UG Scholar, Dept. of CSE, RGUKT RK-VALLEY, India Author
G. J. S. Hari Teja UG Scholar, Dept. of CSE, RGUKT RK-VALLEY, India Author

DOI:

https://doi.org/10.47392/IRJASH.2026.018

Keywords:

Multimodal RAG, Video Understanding, YouTube, Dual-Modality Retrieval, Agentic Retrieval, Fast Embed, Semantic Notes

Abstract

The vast body of video content on platforms such as YouTube represents one of the richest yet most query-inaccessible knowledge repositories in existence. We present YT-RAG, a multimodal Retrieval-Augmented Generation (RAG) system that enables natural-language conversation with any YouTube video by independently indexing its spoken transcript as 384-dimensional text embeddings and its visual frames as 512-dimensional image embeddings using FastEmbed, with no GPU infrastructure required. Three principal contributions define the system: (i) a parallel dual-modality ingestion pipeline running transcript and frame extraction as concurrent asyncio tasks; (ii) an agentic retrieval loop powered by Google Gemini 2.0 Flash’s native tool-call API, making retrieval conditional on model judgment rather than mandatory; and (iii) a semantic user-notes channel providing a personalised third retrieval layer absent from all prior video RAG literature. Empirical evaluation across six YouTube content categories — Comedy, Podcast, Cooking, News, Tutorials, and Coding — shows aggregate dual-modality retrieval achieves Hit@5 approximately 4× higher than text-only and 8× higher than image-only retrieval. A systematic anti-correlation in per-category modality strengths confirms that the two channels retrieve complementary evidence. Deployed as a Chrome extension operating natively alongside the YouTube player, YT-RAG is training-free, containerised, and built entirely on official APIs without agentic frameworks

YT-RAG: A Multimodal Retrieval-Augmented Generation Framework for YouTube Video Understanding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Information

Latest publications