Project Roadmap

Corpus building progress · Planned features · Pending work

Corpus Completeness 35%

0% 30% — Current 70% — Phase 2 unlocks 90% — Phase 3 100% ✓

Phase 1 Corpus Building & Visual Training Foundation In Progress

✓

Upload critical Pallava PDFs

Minakshi, Sastri, Rea, Jouveau-Dubreuil and other primary sources

✓

Tag all sources in Source Library

Tag Pallava-relevant files as "pallava" — already doing this

✓

Add author names in Description field

Feeds the References section in the About Pallavas page

✓

Build Indian Visual Training Repository (Gallery)

536 Pallava sculptural art images from wisdomlib.org crawled, Vision-described, and stored in indian_visual_gallery ChromaDB collection. Gallery UI at /gallery-ui.

Complete wisdomlib gallery crawl (all 536 images)

Run: py -3 crawl_wisdomlib_gallery.py — completes training corpus for Pallava visual recognition. ~$1.50 with Sonnet.

Consolidate shared code into corpus_engine.py

Remove duplicated Vision prompt, parser, and registry I/O from gallery.py and crawl script. Single source of truth before adding new features.

✓

Modify pdf_ingester.py to save page images on first ingest

Save rendered page PNGs to corpus/pdf_pages/{fingerprint}/ during Pass 2. Eliminates need to re-ingest PDFs when requirements evolve. Zero API cost — PyMuPDF renders locally.

Upload secondary and cross-corpus sources

Bhagavatam, general Hindu history, architecture references etc.

Note authoritative sources for weighting

Mentally note which books are high-quality — you'll assign star ratings in Phase 3

Switch image describer from Opus → Sonnet

Once critical PDFs are ingested — saves ~80% API cost per PDF

Phase 2 Visual Recognition & Retrieval Infrastructure Unlocks at 70%

POST /identify — multimodal image recognition endpoint

Upgrade /identify-temple to handle all Indian art. Upload image → Claude Vision identifies dynasty/style/period → searches indian_visual_gallery for matching training images → searches pallava_knowledge for relevant PDF passages → synthesized scholarly response with sources.

Register PDF page images in gallery

After pdf_ingester.py saves page PNGs: run registration script to add image entries into image_registry.json from saved pdf_pages/ folder. Links PDF scholarly text with visual images.

Identify panel in gallery UI

Upload box in gallery.html → POST /identify → results panel showing: dynasty/style card, matched training images grid, relevant PDF passages, full synthesis. Works for any uploaded photo.

Tag filter in retriever

Add "Pallava Sources Only" mode to Ask + Research dropdowns. ChromaDB already stores tags as a comma-separated string per chunk (field: tags, written by knowledge_store.py on every ingest). Implementation: add a where={"tags": {"$contains": "pallava"}} filter to the collection.query() call in rag/retriever.py. UI: a mode toggle on the Ask panel (index.html) and Research page. No re-ingestion needed — tags are already in ChromaDB metadata for all existing chunks.

Source weighting (1–5 stars)

Assign authority ratings per source. Add a weight field (1.0–5.0) to source_library.py catalog entries. Editable via a star-rating control in Library UI (library.html). At query time, rag/retriever.py multiplies each chunk's similarity score by its source weight before ranking — same pattern as the existing author_precedence.json boost. Weight defaults to 3.0 (neutral) for all existing sources. Requires the Retroactive ChromaDB patch below to propagate weights to existing chunks.

Retroactive ChromaDB patch

After star weights are assigned, backfill weight and tags into all existing ChromaDB chunks — no re-ingestion, no API cost. ChromaDB supports in-place metadata updates via collection.update(ids=[...], metadatas=[...]). Script: load source_library/catalog.json, group chunk IDs by fingerprint (already stored in chunk ID as know_{fingerprint}_chunk_N), call collection.update() in batches. Run once after weights are assigned; all future ingests automatically include the weight at index time.

Conflict surfacing in research output

Flag when sources disagree with a visible warning; present both views

Expand visual corpus to other Indian dynasties

Add Chola, Gupta, Chalukya, Hoysala, Pala galleries using the same crawl infrastructure. Dynasty is a first-class tag — gallery and /identify handle all dynasties automatically.

Phase 3 Knowledge Refinement & Cross-Modal Linking Unlocks at 90%

Assign star weights to all sources

Rate each book 1–5 based on authority and Pallava relevance

Pin trusted sources to About Pallavas sections

e.g. Art & Architecture section always prefers Jouveau-Dubreuil

Update inscription collection to multilingual embeddings

Backlog #6: update pallava_inscriptions to intfloat/multilingual-e5-large. Delete old collection, run py -3 vector_store.py --build. Only 9 entries — fast.

Regenerate all Wikipedia sections

Use improved retriever with weights + tag filter for better quality

Cross-corpus research queries

e.g. How Pallava sculptures relate to Bhagavatam stories

GET /gallery/image/{key}/related — cross-link images to corpus

For any gallery image, return related passages from pallava_knowledge ChromaDB. Clicking an image in gallery shows matching scholarly text from PDFs alongside it.

Video keyframe analysis

Extract keyframes from uploaded video → run each frame through POST /identify → deduplicate → return annotated timeline. Same pipeline, no new Vision logic needed.

Surface video keyframe images in research papers

Currently video ingestion captures keyframe PNGs to corpus/video_frames/{fingerprint}/frame_NNNN.png and stores scholarly descriptions as text chunks in ChromaDB — but the research paper image system ({IMAGE:label:page} markers) only knows how to render PDF pages. Video frames never appear as images in generated papers.

What to build: A new API endpoint GET /research/video-frame?source_id=&frame=NNNN that serves PNGs from corpus/video_frames/, alongside a new marker format {VIDEOFRAME:source_label:frame_index} embedded in [KEYFRAME] chunks at ingest time. The research UI renders both marker types — PDF pages and video frames — inline in the paper.

Why Phase 3: Only meaningful once multiple videos are ingested. A single video with thin corpus adds limited value — with 5–10 videos, keyframe images significantly enrich research paper output.

Populate Unicode evidence counts from inscription corpus

Wire up the evidence_count field in corpus/pallava_pua_chart.json by scanning corpus/corpus.json inscription entries and counting how many inscriptions each Pallava character is attested in. A character with ≥3 attestations becomes "Confirmed" on the Unicode Initiative page. Currently all 56 characters show evidence_count=0 — the field exists but was never populated. Required for the Unicode submission to the Unicode Technical Committee. Run after inscription corpus is substantially complete.

Phase 4 Research & Public Output Post Corpus

Generate full research papers

Comprehensive scholarly papers from complete corpus — text + images + citations

Build PowerPoint presentations

Themed presentations with gallery images + scholarly captions from KB

Export polished Word documents

Final research papers with embedded images and full references

Add PDF export to Research page

Export generated research papers as downloadable PDFs directly from the Research UI.

Current state: Word (.docx) and PowerPoint (.pptx) export already work. docx2pdf is already in requirements.txt — it converts the generated .docx to PDF via Microsoft Word's COM interface.

Windows (dev machine): Use existing docx2pdf — calls Microsoft Word COM, requires Word to be installed. Works on Windows only.

⚠️ EC2 Linux limitation: docx2pdf will not work on Linux — there is no Microsoft Word COM interface. On EC2, replace with either:
• LibreOffice headless — libreoffice --headless --convert-to pdf file.docx (free, good fidelity)
• weasyprint — generates PDF directly from HTML/CSS (no Word dependency, pip installable)

Recommended approach on EC2: Generate PDF directly from the paper HTML using weasyprint — avoids the Word → PDF conversion step entirely and works cross-platform.

Defer until: Corpus is ~90%+ complete and research output is polished enough to share externally. Word export covers interim needs.

Add copyright_status to Source Library

Backlog #7: add copyright_status field to source_library.py. Required before public website launch. Values: owned / licensed / public_domain / attributed_third_party / copyright_unknown.

Replace third-party photos with original photographs

Long-term goal per CLAUDE.md: all third-party gallery images to be replaced with researcher's own photos. Track replacement status per image in registry.

Re-transcribe all videos using Whisper small model with GPU acceleration

Why: Currently all videos >20 minutes are transcribed using Whisper base model (140 MB) on CPU. This is a speed compromise — base is ~3× faster on CPU than small, but produces noticeably weaker accuracy for multilingual narration (Tamil, Telugu, Sanskrit) which is common in Pallava archaeology lectures.

When: After EC2 migration. Use a g4dn.xlarge spot instance (NVIDIA T4, CUDA 12) — no need for a permanent GPU server. Spin it up for the batch, run all re-transcriptions, then terminate. Cost: ~$0.22/hr on spot (~$1–2 for a full batch of 10 videos). While stopped, you only pay ~$12/month EBS storage. On GPU, Whisper small transcribes a 1-hour video in ~15–20 min vs 2+ hours on CPU.

Code change (one line): In ingest/video_ingester.py, change:
WHISPER_MODEL_LONG = "base" → WHISPER_MODEL_LONG = "small"

Re-ingest all videos: Go to Source Library → filter by type: video → for each video source, call POST /admin/retry/{id}. This replaces existing transcript chunks with higher-quality ones. Keyframe manifests (corpus/video_frames/{fingerprint}/manifest.json) are preserved — no Vision API cost is incurred, only Whisper re-runs.

Estimated effort: One-time batch operation. ~15–20 min per 1-hour video on GPU (vs 2+ hours on CPU). If you have 10 videos, plan for a 3–4 hour background job. No manual review needed — chunks automatically replace old ones in ChromaDB.

Phase 5 Automated Knowledge Validation Loop Post Multi-Agent

Design & curate the test question set

Write 50+ natural language questions covering all 7 dynasties, inscription types, cross-dynasty comparisons, corpus gap probes, and contradiction probes. This is 1–2 hours of manual work and must be done by you — the query agent fires these questions, so the quality of the test set determines the quality of the validation.

Build Query Agent + Review Agent

Query Agent fires all test questions sequentially to the History Agent via the multi-agent system. Review Agent independently evaluates each answer against corpus chunks — checking citation presence, citation accuracy, hallucination, and ideological bias. Neither agent evaluates its own output. Review Agent produces a per-question scorecard (pass / flag / fail + reason).

Run validation loop — up to 3 rounds

Round 1: fire all N questions, collect scorecards. Between rounds: review flagged/failed questions, fill corpus gaps or refine prompts. Rounds 2–3: re-run only failed questions. Exit when Review Agent passes ≥90% of questions or max 3 rounds reached — whichever comes first. Force-exit after round 3 regardless, with a gap report.

Human review & sign-off

Review the consolidated validation report. If positive: loop ends, system is production-ready. If negative with specific feedback: one final manual round of corpus gap-filling or prompt adjustment, then re-run. All findings, gaps, improvements, and remaining limitations documented as the Phase 5 exit deliverable.

Longer-Term Ideas (not yet planned)

Source date/period tagging (Early/Imperial Pallava era) Per-section source pinning in About Pallavas Research paper versioning Gap detection (flag thin corpus areas) Cross-reference conflict map Inscription ↔ secondary source linking Mobile app (iOS/Android) using existing FastAPI backend CLIP visual embeddings for pixel-level image similarity Multi-dynasty comparison research papers Public-facing website with copyright-resolved images

Technical Backlog — Pending Implementation

✅ Fixes #1–5 — Tamil & Indic script support [Completed Mar 2026]

Context: Identified during testing with Pallaṉkōvil copper plates. All 5 fixes applied in one pass. Re-index completed — 4,991 chunks now embedded with multilingual model.

✅ Fix #1 — Multilingual embedding model · knowledge_store.py + rag/retriever.py
Replaced English-only multi-qa-mpnet-base-dot-v1 with intfloat/multilingual-e5-large (94 languages: Tamil, Telugu, Kannada, Malayalam, Sanskrit, Bengali, Gujarati, Grantha). Re-index via POST /admin/reindex required — completed.
✅ Fix #2 — Legacy Indic PDF encoding detection · ingest/pdf_ingester.py
Pre-2000 Tamil/Telugu/Kannada/Malayalam PDFs use legacy proprietary encodings (TSCII, TAB, ISCII, GIST-SUDA, GIST-DV, etc.) — PyMuPDF extracts them as extended-ASCII garbage (U+0080–U+00FF). Now detected by checking: non-empty text + no proper Indic Unicode across all 8 scripts + >30% chars in 0x80–0xFF range → page is forced through Claude Vision OCR instead. Applies to all Indic scripts, not just Tamil.
✅ Fix #3 — Daṇḍa sentence terminators · chunker.py
Tamil and Sanskrit texts use । (U+0964) and ॥ (U+0965) as sentence terminators, not . Without this, entire Tamil paragraphs were treated as one sentence, creating oversized chunks. Now split correctly.
✅ Fix #4 — Post-OCR script identity logging · ingest/pdf_ingester.py
After Claude Vision OCR, server now logs whether output contains: Indic Unicode (native script), IAST/Latin (romanised transliteration), or unrecognised script. Logging only — no behaviour change. Gives visibility into OCR output quality per page.
✅ Fix #5 — Lacunose marker rule in RAG prompt · rag/answerer.py
Scholarly editions mark damaged text with [?], [...], [ * ]. Rule #9 added to RAG system prompt: Opus must flag these as editorial conjectures rather than treating them as certain readings.

🔧 Backlog #6 — Inscription collection embedding model [Phase 3]

Why: pallava_inscriptions still uses ChromaDB's English-only default model (all-MiniLM-L6-v2). Tamil/Sanskrit-script queries won't retrieve inscription results well until this is updated to match the knowledge collection.

Steps to implement:

Open vector_store.py — find where pallava_inscriptions collection is created/opened
Add explicit embedding function: SentenceTransformerEmbeddingFunction(model_name="intfloat/multilingual-e5-large")
Delete the old corpus/chroma_db/pallava_inscriptions folder (incompatible vectors)
Run py -3 vector_store.py --build to re-embed all 9 inscriptions (~1 min)
Restart the server
Test: ask a Tamil or Sanskrit question about inscriptions — should now return results

Only 9 entries — fast when ready. Do alongside any inscription corpus expansion.

📋 Backlog #7 — Copyright status field in Source Library [Before Phase 4 launch]
Add copyright_status field to source_library.py catalog. Required before public website. Values: owned / licensed / public_domain / attributed_third_party / copyright_unknown.

🔧 Backlog #8 — Code consolidation: corpus_engine.py [Phase 1 — before new features]

Why: ART_VISION_PROMPT, _parse_vision(), and registry I/O are currently duplicated in both crawl_wisdomlib_gallery.py and app/routers/gallery.py. gallery.py also creates its own Anthropic client instead of using the shared dependency. All future features (POST /identify, video analysis, PDF image registration) depend on this shared code.

What to create: corpus_engine.py — single source of truth with: ART_VISION_PROMPT · parse_vision() · load_image_registry() · save_image_registry() · get_gallery_collection() · search_gallery() · describe_image().
Then refactor gallery.py and crawl script to import from it. Must be done before adding POST /identify or video support — otherwise these features will be built on duplicated foundations.

🔧 Backlog #9 — PDF page image saving (no-re-ingest strategy) ✅ Completed April 2026

Why: Every time a new metadata requirement emerges, PDFs are being re-ingested through Claude Vision (Opus ~$3–6 per PDF). PyMuPDF can render pages to PNG locally for free — if saved on first ingest, all future metadata changes are free patches.

What to modify: ingest/pdf_ingester.py Pass 2 — after rendering each page (already done for classify+describe), save PNG to corpus/pdf_pages/{fingerprint}/page_NNN.png and write manifest.json recording page_num, has_image, description, caption.

Result: Re-running Vision on new prompts, adding gallery entries, changing chunking — all free. Pay Claude once per PDF, never again. Disk: ~50–150 MB per 200-page PDF. One-time re-ingest of critical PDFs to populate saved pages, then no more re-ingests.

✅ Two-stage ingestion — Upload Only + Ingest Later Completed April 2026

What was built: POST /ingest/upload (Stage 1 — register file in library, no KB processing) and POST /admin/ingest/{id} (Stage 2 — trigger background ingest for pending files). POST /admin/reset-pending bulk-resets ingestion_status without deleting KB chunks. Library UI: green ⬇ Ingest Now button per pending row + Ingest All Pending bulk button. Home page upload form: radio toggle for Upload & Ingest Now vs Upload Only.

Conventions & Reminders

→ Tag Pallava files as pallava in Source Library

→ Add author name in the Description field when uploading — appears in References section

→ Use Opus for critical PDFs (Minakshi, Sastri, Rea); switch to Sonnet for secondary sources

→ After Phase 2 is built — set retriever default to Pallava Sources Only

🗺 Periodically update the architecture diagram — run py -3 create_project_docs.py and ask Claude to regenerate architecture.svg whenever new routers, ingestion types, or major features are added

✅ Fixes #1–5 (Indic script support) completed Mar 2026 — multilingual-e5-large re-index done (4,991 chunks). Ready to upload Tamil/Telugu/Kannada documents.

Progress and checkboxes are saved in your browser. Last updated: — Saved ✓