Knowledge Ingestion Pipeline
How source material becomes searchable knowledge
Pallava KB · April 2026
What the pipeline does
Any source — a PDF book, a YouTube lecture, a scanned inscription image, or a web article —
is automatically converted into structured, semantically searchable knowledge chunks.
Once indexed, the content is instantly available to the AI when answering questions,
generating research, or translating inscriptions.
📄
PDF
Academic books, papers, excavation reports. OCR for scanned pages.
🎬
Video
YouTube, Vimeo lectures. Audio transcribed + keyframes described.
🖼️
Image
Stone inscriptions, manuscript photos. Claude Vision extracts text.
🌐
Web URL
Wikipedia, blogs, online articles on Pallava history.
📝
Word Doc
Research notes, manuscripts, field reports in .docx format.
Video ingestion — step by step
The video URL and label are immediately saved to the Source Library with status
pending. A unique job ID is returned so the UI can track progress in real time.
Nothing is downloaded yet.
The full video file is downloaded to a temporary folder on the server —
like saving a video file from YouTube. yt-dlp supports 1,000+ sites including
YouTube, Vimeo, Dailymotion, and more.
The raw video file is deleted after processing. Only keyframe screenshots are kept permanently.
OpenAI's Whisper model runs locally on the server and converts all spoken audio
into timestamped text segments — like automatic subtitles. It handles English,
Tamil, Telugu, Sanskrit, and 90+ other languages.
A 1-hour video takes ~25–40 minutes to transcribe on CPU. Output example:
00:02:15 → "The vimana rises in three tiers above the garbhagriha…"
One screenshot is taken every 30 seconds from the video and saved permanently
as PNG images. For a 1-hour video, this produces approximately 120 keyframe images.
These are stored in corpus/video_frames/{id}/
and are viewable in the Source Library.
Every keyframe is shown to Claude Haiku — a fast, inexpensive model — with a
simple yes/no question: "Does this frame show a temple, inscription, sculpture,
or historically significant subject?" Frames showing presenter slides with
plain text, talking heads, logos, or blank screens are filtered out.
Only frames that pass this filter proceed to the expensive Opus description step.
Results are cached — re-ingesting the same video skips all Haiku calls at zero cost.
Each frame that passed the Haiku filter is sent to Claude Opus — the most
capable model — along with the transcript text spoken ±30 seconds around
that moment. Opus produces a structured scholarly description:
CAPTION: Shore Temple viewed from the south-east corner
SUBJECT: Shore Temple, Mahabalipuram
LOCATION: Mahabalipuram, Tamil Nadu
DETAILS: The twin shrines dedicated to Shiva rise above
the Bay of Bengal shoreline. Carved from granite in the late 7th–early 8th century CE
under Narasimhavarman II, the structure shows characteristic Pallava dravidian vimana
with a square garbhagriha and pyramidal superstructure…
The description is enriched by the spoken narration — if the lecturer is explaining
what is on screen, Opus uses that context to write a more accurate, detailed description.
Transcript segments and keyframe descriptions are merged in time order
into a single document, preserving the narrative flow of the video.
Example output — Shore Temple lecture at 00:02:45
00:02:15
Transcript
The vimana rises in three tiers above the garbhagriha…
00:02:28
Transcript
Notice the sculptural panels on the outer wall…
00:03:00
Keyframe description (Opus)
CAPTION: Shore Temple viewed from south-east
DETAILS: Three registers of carved panels showing Vaishnava iconography,
consistent with late Pallava temple design as described by the speaker…
00:03:15
Transcript
This arrangement is unique to the Shore Temple complex…
The combined document is split into overlapping chunks of ~400 words.
Each chunk is converted into a numerical vector using the multilingual embedding model
(intfloat/multilingual-e5-large) which understands Tamil, Telugu, Sanskrit,
and English. Chunks are stored in ChromaDB — the same vector database used for PDFs.
From this point, asking any question in the Ask or Research tab will automatically
retrieve relevant video segments — including both the spoken words and the visual descriptions.
How audio and visuals are correlated
The pipeline does not treat audio and video as separate data streams.
At the description step (Step 6), Opus receives both the keyframe image
and the transcript spoken within ±30 seconds of that frame.
This means if a speaker is explaining what they are showing on screen,
the description will incorporate both observations.
Example: A frame showing the Shore Temple at 00:03:00 is described
alongside transcript context from 00:02:30 to 00:03:30 — so if the speaker says
"the outer prakara has three registers of Vaishnava panels" while the camera
shows the temple wall, Opus uses both pieces of evidence in its description.
Limitation: If the speaker discusses something off-screen
(e.g. describing a different site while showing a map), the audio and visual
context may not align. The ±30 second window is fixed and works best for
lecture-style videos where the speaker narrates what is on screen.
What is stored permanently
Keyframe images
corpus/video_frames/{id}/frame_NNNN.png
One PNG per 30 seconds. Viewable in Source Library. Never deleted.
Frame manifest (cache)
corpus/video_frames/{id}/manifest.json
Haiku and Opus results per frame. Re-ingesting same URL skips all API calls.
Video metadata
corpus/video_frames/{id}/meta.json
Title and original URL saved from yt-dlp for reference.
Knowledge chunks
corpus/knowledge.json + ChromaDB
Transcript + keyframe descriptions as indexed chunks. Same format as PDFs.
The raw video file is not kept. Only the screenshots and extracted text are stored.
The original URL is preserved in the Source Library so the video can always be re-downloaded if needed.
Cost and time — 1-hour video
| Step |
Tool |
Time |
Cost |
| Download |
yt-dlp |
1–3 min |
Free |
| Transcription |
Whisper (local CPU) |
25–40 min |
Free |
| Keyframe extraction |
OpenCV |
< 1 min |
Free |
| Frame classification (~120 frames) |
Claude Haiku |
3–5 min |
~$0.05–0.10 |
| Frame description (~20–40 frames) |
Claude Opus |
5–10 min |
~$1.50–3.00 |
| Total (1-hour video) |
|
~35–60 min |
~$1.50–3.00 |
Re-ingestion is nearly free. If you ingest the same video again
(e.g. to update tags), the manifest cache means all Haiku and Opus calls are skipped.
Only the Whisper transcription re-runs.
Other source types — how they differ
PDFs, images, and web URLs follow a simpler version of the same pipeline.
The key difference is that video is the only source type that combines
two independent signals — audio transcription and visual description — into one document.
| Source |
Text extraction |
Visual description |
Typical cost |
| PDF (digital) |
PyMuPDF (free) |
Haiku filter + Opus for significant images |
$0.50–3.00 / 200-page book |
| PDF (scanned) |
Haiku OCR per page |
Haiku filter + Opus for significant images |
$1.00–5.00 / 200-page book |
| Image / inscription |
Opus Vision OCR |
Included in OCR step |
~$0.01–0.05 / image |
| Web URL |
HTML scraper (free) |
None |
Free |
| Video (1 hr) |
Whisper transcription (free) |
Haiku filter + Opus for significant frames |
$1.50–3.00 |