Knowledge Pipeline — Pallava KB

Knowledge Ingestion Pipeline

How source material becomes searchable knowledge

Pallava KB · April 2026

What the pipeline does

Any source — a PDF book, a YouTube lecture, a scanned inscription image, or a web article — is automatically converted into structured, semantically searchable knowledge chunks. Once indexed, the content is instantly available to the AI when answering questions, generating research, or translating inscriptions.

📄

PDF

Academic books, papers, excavation reports. OCR for scanned pages.

🎬

Video

YouTube, Vimeo lectures. Audio transcribed + keyframes described.

🖼️

Image

Stone inscriptions, manuscript photos. Claude Vision extracts text.

🌐

Web URL

Wikipedia, blogs, online articles on Pallava history.

📝

Word Doc

Research notes, manuscripts, field reports in .docx format.

Video ingestion — step by step

The video URL and label are immediately saved to the Source Library with status pending. A unique job ID is returned so the UI can track progress in real time. Nothing is downloaded yet.

Download the video yt-dlp

The full video file is downloaded to a temporary folder on the server — like saving a video file from YouTube. yt-dlp supports 1,000+ sites including YouTube, Vimeo, Dailymotion, and more.

The raw video file is deleted after processing. Only keyframe screenshots are kept permanently.

Transcribe audio Whisper (local) Free — no API cost

OpenAI's Whisper model runs locally on the server and converts all spoken audio into timestamped text segments — like automatic subtitles. It handles English, Tamil, Telugu, Sanskrit, and 90+ other languages.

A 1-hour video takes ~25–40 minutes to transcribe on CPU. Output example:
00:02:15 → "The vimana rises in three tiers above the garbhagriha…"

Extract keyframes OpenCV Free — no API cost

One screenshot is taken every 30 seconds from the video and saved permanently as PNG images. For a 1-hour video, this produces approximately 120 keyframe images. These are stored in corpus/video_frames/{id}/ and are viewable in the Source Library.

Classify each frame Claude Haiku

Every keyframe is shown to Claude Haiku — a fast, inexpensive model — with a simple yes/no question: "Does this frame show a temple, inscription, sculpture, or historically significant subject?" Frames showing presenter slides with plain text, talking heads, logos, or blank screens are filtered out.

Only frames that pass this filter proceed to the expensive Opus description step. Results are cached — re-ingesting the same video skips all Haiku calls at zero cost.

Describe significant frames Claude Opus

Each frame that passed the Haiku filter is sent to Claude Opus — the most capable model — along with the transcript text spoken ±30 seconds around that moment. Opus produces a structured scholarly description:

            CAPTION: Shore Temple viewed from the south-east corner

            SUBJECT: Shore Temple, Mahabalipuram

            LOCATION: Mahabalipuram, Tamil Nadu

            DETAILS: The twin shrines dedicated to Shiva rise above
            the Bay of Bengal shoreline. Carved from granite in the late 7th–early 8th century CE
            under Narasimhavarman II, the structure shows characteristic Pallava dravidian vimana
            with a square garbhagriha and pyramidal superstructure…
          

The description is enriched by the spoken narration — if the lecturer is explaining what is on screen, Opus uses that context to write a more accurate, detailed description.

Interleave transcript + keyframes Chronological merge

Transcript segments and keyframe descriptions are merged in time order into a single document, preserving the narrative flow of the video.

Example output — Shore Temple lecture at 00:02:45

00:02:15

Transcript

The vimana rises in three tiers above the garbhagriha…

00:02:28

Transcript

Notice the sculptural panels on the outer wall…

00:03:00

Keyframe description (Opus)

CAPTION: Shore Temple viewed from south-east
DETAILS: Three registers of carved panels showing Vaishnava iconography, consistent with late Pallava temple design as described by the speaker…

00:03:15

Transcript

This arrangement is unique to the Shore Temple complex…

Chunk, embed and index ChromaDB Multilingual embeddings

The combined document is split into overlapping chunks of ~400 words. Each chunk is converted into a numerical vector using the multilingual embedding model (intfloat/multilingual-e5-large) which understands Tamil, Telugu, Sanskrit, and English. Chunks are stored in ChromaDB — the same vector database used for PDFs.

From this point, asking any question in the Ask or Research tab will automatically retrieve relevant video segments — including both the spoken words and the visual descriptions.

How audio and visuals are correlated

The pipeline does not treat audio and video as separate data streams. At the description step (Step 6), Opus receives both the keyframe image and the transcript spoken within ±30 seconds of that frame. This means if a speaker is explaining what they are showing on screen, the description will incorporate both observations.

Example: A frame showing the Shore Temple at 00:03:00 is described alongside transcript context from 00:02:30 to 00:03:30 — so if the speaker says "the outer prakara has three registers of Vaishnava panels" while the camera shows the temple wall, Opus uses both pieces of evidence in its description.

Limitation: If the speaker discusses something off-screen (e.g. describing a different site while showing a map), the audio and visual context may not align. The ±30 second window is fixed and works best for lecture-style videos where the speaker narrates what is on screen.

What is stored permanently

Keyframe images

corpus/video_frames/{id}/frame_NNNN.png

One PNG per 30 seconds. Viewable in Source Library. Never deleted.

Frame manifest (cache)

corpus/video_frames/{id}/manifest.json

Haiku and Opus results per frame. Re-ingesting same URL skips all API calls.

Video metadata

corpus/video_frames/{id}/meta.json

Title and original URL saved from yt-dlp for reference.

Knowledge chunks

corpus/knowledge.json + ChromaDB

Transcript + keyframe descriptions as indexed chunks. Same format as PDFs.

The raw video file is not kept. Only the screenshots and extracted text are stored. The original URL is preserved in the Source Library so the video can always be re-downloaded if needed.

Cost and time — 1-hour video

Step	Tool	Time	Cost
Download	yt-dlp	1–3 min	Free
Transcription	Whisper (local CPU)	25–40 min	Free
Keyframe extraction	OpenCV	< 1 min	Free
Frame classification (~120 frames)	Claude Haiku	3–5 min	~$0.05–0.10
Frame description (~20–40 frames)	Claude Opus	5–10 min	~$1.50–3.00
Total (1-hour video)		~35–60 min	~$1.50–3.00

Re-ingestion is nearly free. If you ingest the same video again (e.g. to update tags), the manifest cache means all Haiku and Opus calls are skipped. Only the Whisper transcription re-runs.

Other source types — how they differ

PDFs, images, and web URLs follow a simpler version of the same pipeline. The key difference is that video is the only source type that combines two independent signals — audio transcription and visual description — into one document.

Source	Text extraction	Visual description	Typical cost
PDF (digital)	PyMuPDF (free)	Haiku filter + Opus for significant images	$0.50–3.00 / 200-page book
PDF (scanned)	Haiku OCR per page	Haiku filter + Opus for significant images	$1.00–5.00 / 200-page book
Image / inscription	Opus Vision OCR	Included in OCR step	~$0.01–0.05 / image
Web URL	HTML scraper (free)	None	Free
Video (1 hr)	Whisper transcription (free)	Haiku filter + Opus for significant frames	$1.50–3.00