Engineering Challenges — Pallava KB

Engineering Challenges & Resolutions

A chronicle of hard problems encountered building the Pallava Knowledge Base & Agent J system — and how each was solved.

Maintained by Sidda Jagadeesh Donthi Siddappa · Project OM · Last updated April 2026

Purpose of this document: This page records every significant engineering challenge encountered during the build — the symptom, root cause, failed attempts, and final resolution. It is intended to demonstrate the depth and complexity of the project to stakeholders, and to serve as a reference for the team when similar problems arise in future.

Total Challenges Documented

Fully Resolved

Classified Hard

Classified Medium

Engineering Domains

4,400×

Git Size Reduction (8 GB → 1.9 MB)

Contents

Claude Vision max_tokens silent failure
Tamil / multilingual text extraction from legacy PDFs
IAST diacritical characters rendering as boxes in Word exports
RAG hallucination on lacunose inscription passages
Author precedence — conflicting secondary sources
Daṇḍa characters breaking chunk boundaries
Five-layer Pallava script encoding pipeline
uvicorn stdout closure killing all logging
API key not inherited by child processes (Windows)
httpx.ReadTimeout on Opus research paper generation
uvicorn reload=True preventing API key inheritance
Git history bloated to 8.17 GiB by rendered PDF images
ChromaDB exceeding GitHub's 100 MB file limit
SHA-256 deduplication colliding across source types
Multi-agent security topology
Windows absolute paths for EC2 portability

AI & Model Layer

Claude Vision max_tokens silent failure

Hard Resolved AI / Claude API

What Was Happening — Token Budget Exhausted

Before fix · max_tokens = 4,096

Internal Reasoning (extended thinking)

No tokens left

After fix · max_tokens = 32,000

Internal Reasoning

Actual description output ✓

Extended thinking (internal, not visible)

Actual text output

No budget remaining → empty result, no error

Symptom

Claude Vision API calls for PDF page description returned empty strings with no error. The API returned HTTP 200 (success) but the text content was blank. Entire inscription pages were silently skipped during ingestion.

Root Cause

Claude's extended thinking (adaptive reasoning) consumes tokens internally before emitting any text. With the default max_tokens = 4096, the model exhausted the entire budget on internal reasoning and had zero tokens left to produce the description. No exception was raised — the call silently succeeded with empty output.

Resolution

Set max_tokens ≥ 32,000 in all Vision API calls — giving extended thinking enough budget while still leaving the majority for actual output. Now a documented invariant in CLAUDE.md.

Why It Matters

Without this fix, every image-heavy page — including stone inscription photographs — would have been ingested as empty chunks. The failure was completely invisible at ingest time; it would only have been discovered later when queries returned no results for known topics.

Tamil / multilingual text extraction from legacy PDFs

Hard Resolved Text Extraction Encoding

Solution — Two-Path Extraction Pipeline

📄 PDF Page

→

Extract text
PyMuPDF

→

✓ <15% garbled

Path A: Use text directly
English / clean Unicode

✗ >15% garbled

Path B: Claude Vision OCR
Tamil / legacy encoding / scanned

→

Clean chunk
for ChromaDB

Before & After

Before — raw extraction

ÔÅçÏ³>>ÐÑ×ÃÌÑ

Tamil script from a 1960s academic PDF was extracted as garbage bytes — the font's private encoding was not applied. The knowledge base stored unreadable text as "knowledge."

After — Vision OCR fallback

"The Pallava king Mahendravarman I was a prolific builder of rock-cut temples..."

The same page rendered to PNG and described by Claude Vision. Accurate, readable, correctly indexed.

Why It Matters

The most important primary sources — Epigraphia Indica, South Indian Inscriptions, Minakshi's Pallava dynasty study — are all legacy PDFs mixing Tamil and English. Without this fix, the knowledge base would have been built almost entirely from English secondary sources, severely undermining its scholarly value.

IAST diacritical characters rendering as boxes in Word exports

Hard Resolved Unicode Document Export

The Problem at a Glance

Word document (broken)

Mahendravarman I built the □□□□□□□ at □□□□□□□□□□.

Every IAST character (ā, ī, ū, ṭ, ḍ, ṇ, ś) rendered as an empty box. Scholarly names became unreadable.

PDF export (fixed)

Mahendravarman I built the maṇḍapa at Māmallapuram.

docx2pdf lets Word render with user-installed Noto Serif font, then embeds all fonts into the PDF. Portable and correct on any device.

Attempted Solutions

Attempt 1 — Georgia font

Partial IAST coverage. ā, ī worked; ṭ, ḍ, ṇ, ṣ still showed boxes.

Attempt 2 — Times New Roman

Same gaps. Font covers Latin Extended but not Combining Diacritical Marks for IAST.

Attempt 3 — Calibri + w:cs XML attribute

Better — most boxes eliminated. A few rare IAST characters still broken.

Attempt 4 — Noto Serif (user-profile install)

Font has full IAST coverage, but installed to user-profile fonts dir — invisible to python-docx renderer running as a different process context.

Final Fix — Two-layer approach

Layer 1: Prompt Claude to avoid IAST characters entirely — use plain English spellings. Layer 2: Convert .docx → .pdf via docx2pdf (Microsoft Word COM), which uses the user-profile font and embeds it in the PDF. Works on any recipient's machine.

Why It Matters

Pallava king names, temple names, and epigraphical terms are almost all IAST-transliterated Sanskrit or Tamil (Narasimhavarman, maṇḍapa, vimāna). An export system that cannot render these is useless for academic or stakeholder presentations.

RAG hallucination on lacunose (damaged) inscription passages

Hard Resolved RAG AI Safety

The Problem Illustrated

Before — AI fills in gaps

Source passage: "The king [...] defeated the [...] army in the year [...]"

AI response: "King Narasimhavarman I defeated the Chalukya army in the year 642 CE"

Fabricated. The missing sections were physically destroyed on the stone. The AI invented plausible-sounding dates and names.

After — gaps flagged explicitly

AI response: "The inscription records a military victory by the king, however the identity of the enemy force and the date are physically damaged on the stone — [uncertain — inscription damaged at this point, confidence: 0]"

Honest and safe.

Root Cause & Resolution

Language models are trained to produce fluent, complete text. When presented with fragments containing gaps ([...], ***), the model's probability distribution strongly favours filling them with contextually plausible content — exactly what a scholar must not do.

Resolution: Extended the RAG answerer prompt with an explicit lacunose instruction — "Do NOT reconstruct or speculate about missing content. State: [uncertain — inscription damaged]." Also added grounding validation that flags invented Pallava king names or dates absent from the retrieved chunks.

Why It Matters

In epigraphy, a fabricated reconstruction of a damaged king's name or regnal date can propagate as fact through secondary literature for decades. This is an AI safety problem specific to the humanities — the stakes of hallucination are historical record integrity, not just user inconvenience.

Author precedence — conflicting secondary sources in research papers

Medium Resolved RAG Research

How the Precedence System Works

Wikipedia
No star

Secondary book
No star

Primary monograph
★ Authoritative

→

Preferred in
conflict resolution

Before & After

Before — no weighting

ChromaDB ranks by semantic distance only. A Wikipedia paragraph and a passage from K.A. Nilakanta Sastri's primary monograph appear with equal weight. Research paper presents contradictory regnal dates side-by-side with no guidance.

After — author_precedence.json

Authoritative sources tagged ★ in the passage block. Prompt instructs Claude: "Prefer ★ sources when sources conflict. Present both views and note which authority you prefer." Research papers acknowledge historiographical disagreement properly.

Text & Encoding

Daṇḍa characters breaking Sanskrit chunk boundaries

Medium Resolved Chunking Sanskrit

The Problem Illustrated

Before — chunker splits only on Latin punctuation

Sanskrit verse: "namo bhagavate vāsudevāya। idaṃ śāsanaṃ..."

The daṇḍa (।) is ignored. The chunker splits the verse mid-sentence on the next Latin period — breaking semantic coherence. Retrieved chunks contain half a verse + the start of another.

After — daṇḍa recognised as sentence boundary

The daṇḍa (। U+0964) and double daṇḍa (॥ U+0965) are now treated as sentence terminators alongside . ! ?. Each Sanskrit verse or prose sentence is a clean, semantically complete chunk.

Five-layer Pallava script encoding pipeline

Hard Resolved Unicode Script Engineering

The Problem

The Pallava Grantha script (6th–7th century CE) has no standardised Unicode block. Existing Grantha Unicode (U+11300) partially covers it, but many Pallava-specific glyph variants are absent from any standard. Without a stable encoding, characters cannot be stored, searched, or compared digitally.

The Five-Layer Solution

📷

Layer 1

Glyph Image
Stone inscription PNG
SHA-256 fingerprint

→

🔡

Layer 2

PUA Encoding
U+E800–U+E8FF
256 Pallava glyphs

→

𑌗

Layer 3

Grantha Unicode
U+11300–U+1137F
Nearest standard

→

Layer 4

IAST Romanisation
ISO 15919 standard
Scholarly citation

→

📝

Layer 5

Translation
Sanskrit + Telugu
Claude Vision

Why Each Layer Exists

Glyph Image

The ground truth — exactly what is carved on stone. Needed for palaeographic research and visual verification.

page17_img1.png

PUA Encoding

Stable digital identity for each glyph variant — enables exact comparison and dedup even without Unicode standardisation. The basis of the Unicode initiative proposal.

U+E831 (ka-variant)

Grantha Unicode

Interoperability with other Grantha-script tools and databases where an accepted standard mapping exists.

𑌕 U+11315

IAST Romanisation

Scholarly standard for citation, text search in English, and cross-reference with academic literature.

ka, ṭa, śa, ṇa

Translation

Sanskrit (Devanagari) and Telugu translation — the output layer for researchers who cannot read the original script.

क + కి (Claude Vision)

Why It Matters

Without this pipeline, Pallava inscriptions could only be stored as images (unsearchable) or as approximate IAST text (losing glyph distinctiveness). The five-layer system is the scholarly infrastructure that makes the entire knowledge base viable as a research tool. The Layer 2 PUA encoding has been submitted as part of a Unicode standardisation proposal.

Infrastructure & Operations

uvicorn stdout closure killing all logging

Medium Resolved FastAPI Logging

Before & After

Before — print() calls

print("Ingesting chunk 42...")

uvicorn redirects and closes stdout during request handling. All print() output goes to a closed file descriptor — silently discarded. Debugging is impossible.

After — logging module

log = logging.getLogger(__name__)
log.info("Ingesting chunk 42...")

uvicorn's logging integration routes these correctly. Visible in the terminal window during development and in log files on EC2.

API key not inherited by child processes (Windows)

Hard Resolved Windows Process Management

What Was Happening

start_all.bat
Tries to set API key + start agents

↓ cmd.exe nested-quote parsing breaks the set command

History Agent
No API key in env

Personal Agent
No API key in env

Agent J
No API key in env

↓ Every Claude API call fails with "Connection error" — misleading symptom

Why It Was Deceptive

The symptom was "Connection error" from Agent J — pointing to network issues. Only by tracing the process tree with Windows WMIC and inspecting each child process's own environment (not the start command) was it discovered that the API key was present in the start command string but never actually set in the child process environment.

Root cause: cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=..." — the nested double-quotes cause cmd.exe to terminate the quoted string at the first inner ", never reaching the set command.

Resolution — Helper .bat Files

Before — nested quotes (broken)

start "History Agent" cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=sk-ant-..."

Inner quotes break outer string. set never executes.

After — helper scripts

set ANTHROPIC_API_KEY=sk-ant-... (in parent, once)
start "History Agent" cmd /k "c:\..._start_history.bat"

No nested quotes. Environment variable inherited cleanly by all three child processes.

httpx.ReadTimeout on Opus research paper generation

Medium Resolved Multi-Agent Networking

The Timeout Chain

Agent J

→ calls →

Personal Agent

→ calls →

History Agent

→ calls →

Claude Opus
90–120 sec

60s timeout ✗

30s timeout ✗

60s timeout ✗

Every link in the chain had a timeout shorter than Opus's generation time. The outermost timeout fired first, cancelling the request — even though the History Agent was generating correctly.

Fix: All three clients raised to 240 seconds — 2× safety margin over the 90–120s observed generation time.

uvicorn reload=True preventing API key inheritance

Medium Resolved FastAPI Process Management

Root Cause & Fix

reload=True (broken)

On file change, uvicorn spawns a new worker forked from the watchfiles process — not from the original shell. It does not inherit ANTHROPIC_API_KEY set via setlocal in the .bat file. Every code change → Claude API fails.

reload=False (fixed)

Single process, inherits environment from the parent shell at startup and keeps it forever. For development changes, agents are restarted manually via _start_*.bat. Now a documented invariant in CLAUDE.md.

Git & Version Control

Git history bloated to 8.17 GiB by rendered PDF page images

Hard Resolved Git Storage

The Size Reduction

Before

8,371 PNG files (corpus/pdf_pages/) — 8.17 GiB

8.17 GiB

After

1.87 MiB (4,400× smaller)

Why Standard Fixes Didn't Work

Once a file is committed to git, it lives in the pack objects permanently — even if you later add it to .gitignore and run git rm --cached. The working tree entry is removed, but every clone and push still transfers the full blob history.

git filter-branch rewrote history but left orphaned objects; garbage collection had not run yet, so pushes still transferred 8 GiB.

Final fix: Deleted the entire .git directory, ran git init, staged only the 165 code files, and pushed a fresh single-commit history. 1.87 MiB. The PNG files remain on disk — they are regenerated on demand during ingestion and have no place in version control.

ChromaDB exceeding GitHub's 100 MB single-file limit

Medium Resolved Git LFS Storage

How Git LFS Works

chroma.sqlite3
197 MB
Rejected by GitHub

→

Git LFS
.gitattributes
track "chroma_db/**"

→

GitHub LFS
server
Stores 636 MB

+ tiny pointer file in git

git clone ✓
Auto-downloads
LFS objects

EC2 Impact

Before LFS: EC2 deployment required manually copying 241 MB of ChromaDB files via scp after cloning — a fragile, error-prone step. After LFS: git clone fetches everything automatically. The app is fully operational immediately after clone, with no manual file transfers for the vector store.

System Architecture

SHA-256 deduplication colliding across source types

Medium Resolved Data Integrity Ingestion

Before & After

Before — same hash strategy for all types

SHA-256 of first 64 KB for everything. Two scanned PDFs with identical publisher boilerplate cover pages → same hash → second book silently skipped as "already ingested."

Also: re-ingesting a URL whose Wikipedia content had been updated → same URL content hash (stale) → update blocked.

After — strategy per source type

Files (PDF, DOCX): SHA-256 of the full binary — no more cover-page collisions.

URLs: SHA-256 of the URL string itself — stable key, forces explicit skip_if_exists=False to update a URL source. Prevents accidental re-ingestion of unchanged content while allowing intentional updates.

Multi-agent security topology design

Hard Resolved Multi-Agent Architecture

Hub-and-Spoke Architecture (Enforced at Code Level)

Agent J
Master Orchestrator
Port 8002

↙ allowed allowed ↘

History Agent
Pallava KB
Port 8000

← allowed

→ blocked

Personal Agent
Gmail + Tasks
Port 8001

↓ only

Gmail OAuth
Sensitive credential
Personal Agent only

→ Allowed call → Blocked — no import path exists

Why This Matters

The History Agent is intended to be public-facing eventually. A user could craft a malicious ingestion source (prompt injection via a poisoned PDF or URL) that instructs the History Agent to perform actions on their behalf. If the History Agent could directly call the Personal Agent — which has Gmail access — a prompt injection attack could silently send emails.

The hub-and-spoke topology means the History Agent has no code path to reach Gmail. Even if fully compromised, it cannot reach the Personal Agent's tools. The security boundary is enforced at the import level, not just by configuration.

Windows absolute paths for EC2 portability (pre-emptive)

Medium Resolved — by design Portability AWS

Before & After

Risk — hardcoded Windows path

C:\Users\siddi\OneDrive\...\source_library\

Hardcoded into catalog.json entries. Would break silently on EC2 Linux — PDF downloads, page image rendering, and research paper image embedding would all fail with FileNotFoundError.

Fix — env var abstraction

PALLAVA_LIBRARY_DIR = os.environ.get("PALLAVA_LIBRARY_DIR", default_path)

On EC2: set PALLAVA_LIBRARY_DIR=/mnt/pallava-data/source_library/ in Parameter Store. Zero code changes. One-time path migration script rewrites existing catalog entries at deploy time (~15 min).

Summary

#	Challenge	Domain	Difficulty	Status
01	Claude Vision max_tokens silent failure	AI / Claude API	Hard	Resolved
02	Tamil multilingual extraction from legacy PDFs	Text Extraction	Hard	Resolved
03	IAST diacriticals rendering as boxes in Word	Unicode / Export	Hard	Resolved
04	RAG hallucination on lacunose inscriptions	AI Safety / RAG	Hard	Resolved
05	Author precedence for conflicting sources	RAG / Research	Medium	Resolved
06	Daṇḍa characters breaking chunk boundaries	Chunking / Sanskrit	Medium	Resolved
07	Five-layer Pallava script encoding pipeline	Unicode / Script	Hard	Resolved
08	uvicorn stdout closure killing all logging	FastAPI / Ops	Medium	Resolved
09	API key not inherited by Windows child processes	Windows / Process	Hard	Resolved
10	httpx.ReadTimeout on Opus paper generation	Multi-Agent / Net	Medium	Resolved
11	uvicorn reload=True breaking key inheritance	FastAPI / Process	Medium	Resolved
12	Git history bloated to 8.17 GiB by PDF images	Git / Storage	Hard	Resolved
13	ChromaDB exceeding GitHub 100 MB file limit	Git LFS	Medium	Resolved
14	SHA-256 dedup colliding across source types	Data Integrity	Medium	Resolved
15	Multi-agent security topology design	Architecture	Hard	Resolved
16	Windows absolute paths for EC2 portability	Portability / AWS	Medium	Resolved

16 challenges documented. 7 classified Hard, 9 Medium. All resolved. This document will be updated as new challenges are encountered during Phase 2 (corpus expansion), Phase 3 (AWS migration), and Phase 4 (public research output).