Engineering Challenges & Resolutions
A chronicle of hard problems encountered building the Pallava Knowledge Base & Agent J system — and how each was solved.
Maintained by Sidda Jagadeesh Donthi Siddappa · Project OM · Last updated April 2026
Purpose of this document: This page records every significant engineering challenge encountered during the build — the symptom, root cause, failed attempts, and final resolution. It is intended to demonstrate the depth and complexity of the project to stakeholders, and to serve as a reference for the team when similar problems arise in future.
16
Total Challenges Documented
4,400×
Git Size Reduction (8 GB → 1.9 MB)
What Was Happening — Token Budget Exhausted
Before fix · max_tokens = 4,096
Internal Reasoning (extended thinking)
No tokens left
After fix · max_tokens = 32,000
Internal Reasoning
Actual description output ✓
Extended thinking (internal, not visible)
No budget remaining → empty result, no error
Symptom
Claude Vision API calls for PDF page description returned empty strings with no error. The API returned HTTP 200 (success) but the text content was blank. Entire inscription pages were silently skipped during ingestion.
Root Cause
Claude's extended thinking (adaptive reasoning) consumes tokens internally before emitting any text. With the default max_tokens = 4096, the model exhausted the entire budget on internal reasoning and had zero tokens left to produce the description. No exception was raised — the call silently succeeded with empty output.
Resolution
Set max_tokens ≥ 32,000 in all Vision API calls — giving extended thinking enough budget while still leaving the majority for actual output. Now a documented invariant in CLAUDE.md.
Why It Matters
Without this fix, every image-heavy page — including stone inscription photographs — would have been ingested as empty chunks. The failure was completely invisible at ingest time; it would only have been discovered later when queries returned no results for known topics.
Solution — Two-Path Extraction Pipeline
Before & After
Before — raw extraction
ÔÅçϳ>>ÐÑ×ÃÌÑ
Tamil script from a 1960s academic PDF was extracted as garbage bytes — the font's private encoding was not applied. The knowledge base stored unreadable text as "knowledge."
After — Vision OCR fallback
"The Pallava king Mahendravarman I was a prolific builder of rock-cut temples..."
The same page rendered to PNG and described by Claude Vision. Accurate, readable, correctly indexed.
Why It Matters
The most important primary sources — Epigraphia Indica, South Indian Inscriptions, Minakshi's Pallava dynasty study — are all legacy PDFs mixing Tamil and English. Without this fix, the knowledge base would have been built almost entirely from English secondary sources, severely undermining its scholarly value.
The Problem at a Glance
Word document (broken)
Mahendravarman I built the □□□□□□□ at □□□□□□□□□□.
Every IAST character (ā, ī, ū, ṭ, ḍ, ṇ, ś) rendered as an empty box. Scholarly names became unreadable.
PDF export (fixed)
Mahendravarman I built the maṇḍapa at Māmallapuram.
docx2pdf lets Word render with user-installed Noto Serif font, then embeds all fonts into the PDF. Portable and correct on any device.
Attempted Solutions
Attempt 1 — Georgia font
Partial IAST coverage. ā, ī worked; ṭ, ḍ, ṇ, ṣ still showed boxes.
Attempt 2 — Times New Roman
Same gaps. Font covers Latin Extended but not Combining Diacritical Marks for IAST.
Attempt 3 — Calibri + w:cs XML attribute
Better — most boxes eliminated. A few rare IAST characters still broken.
Attempt 4 — Noto Serif (user-profile install)
Font has full IAST coverage, but installed to user-profile fonts dir — invisible to python-docx renderer running as a different process context.
Final Fix — Two-layer approach
Layer 1: Prompt Claude to avoid IAST characters entirely — use plain English spellings. Layer 2: Convert .docx → .pdf via docx2pdf (Microsoft Word COM), which uses the user-profile font and embeds it in the PDF. Works on any recipient's machine.
Why It Matters
Pallava king names, temple names, and epigraphical terms are almost all IAST-transliterated Sanskrit or Tamil (Narasimhavarman, maṇḍapa, vimāna). An export system that cannot render these is useless for academic or stakeholder presentations.
The Problem Illustrated
Before — AI fills in gaps
Source passage: "The king [...] defeated the [...] army in the year [...]"
AI response: "King Narasimhavarman I defeated the Chalukya army in the year 642 CE"
Fabricated. The missing sections were physically destroyed on the stone. The AI invented plausible-sounding dates and names.
After — gaps flagged explicitly
AI response: "The inscription records a military victory by the king, however the identity of the enemy force and the date are physically damaged on the stone — [uncertain — inscription damaged at this point, confidence: 0]"
Honest and safe.
Root Cause & Resolution
Language models are trained to produce fluent, complete text. When presented with fragments containing gaps ([...], ***), the model's probability distribution strongly favours filling them with contextually plausible content — exactly what a scholar must not do.
Resolution: Extended the RAG answerer prompt with an explicit lacunose instruction — "Do NOT reconstruct or speculate about missing content. State: [uncertain — inscription damaged]." Also added grounding validation that flags invented Pallava king names or dates absent from the retrieved chunks.
Why It Matters
In epigraphy, a fabricated reconstruction of a damaged king's name or regnal date can propagate as fact through secondary literature for decades. This is an AI safety problem specific to the humanities — the stakes of hallucination are historical record integrity, not just user inconvenience.
How the Precedence System Works
Wikipedia
No star
·
Secondary book
No star
·
Primary monograph
★ Authoritative
→
Preferred in
conflict resolution
Before & After
Before — no weighting
ChromaDB ranks by semantic distance only. A Wikipedia paragraph and a passage from K.A. Nilakanta Sastri's primary monograph appear with equal weight. Research paper presents contradictory regnal dates side-by-side with no guidance.
After — author_precedence.json
Authoritative sources tagged ★ in the passage block. Prompt instructs Claude: "Prefer ★ sources when sources conflict. Present both views and note which authority you prefer." Research papers acknowledge historiographical disagreement properly.
The Problem Illustrated
Before — chunker splits only on Latin punctuation
Sanskrit verse: "namo bhagavate vāsudevāya। idaṃ śāsanaṃ..."
The daṇḍa (।) is ignored. The chunker splits the verse mid-sentence on the next Latin period — breaking semantic coherence. Retrieved chunks contain half a verse + the start of another.
After — daṇḍa recognised as sentence boundary
The daṇḍa (। U+0964) and double daṇḍa (॥ U+0965) are now treated as sentence terminators alongside . ! ?. Each Sanskrit verse or prose sentence is a clean, semantically complete chunk.
The Problem
The Pallava Grantha script (6th–7th century CE) has no standardised Unicode block. Existing Grantha Unicode (U+11300) partially covers it, but many Pallava-specific glyph variants are absent from any standard. Without a stable encoding, characters cannot be stored, searched, or compared digitally.
The Five-Layer Solution
📷
Layer 1
Glyph Image
Stone inscription PNG
SHA-256 fingerprint
→
🔡
Layer 2
PUA Encoding
U+E800–U+E8FF
256 Pallava glyphs
→
𑌗
Layer 3
Grantha Unicode
U+11300–U+1137F
Nearest standard
→
Ā
Layer 4
IAST Romanisation
ISO 15919 standard
Scholarly citation
→
📝
Layer 5
Translation
Sanskrit + Telugu
Claude Vision
Why Each Layer Exists
1
Glyph Image
The ground truth — exactly what is carved on stone. Needed for palaeographic research and visual verification.
page17_img1.png
2
PUA Encoding
Stable digital identity for each glyph variant — enables exact comparison and dedup even without Unicode standardisation. The basis of the Unicode initiative proposal.
U+E831 (ka-variant)
3
Grantha Unicode
Interoperability with other Grantha-script tools and databases where an accepted standard mapping exists.
𑌕 U+11315
4
IAST Romanisation
Scholarly standard for citation, text search in English, and cross-reference with academic literature.
ka, ṭa, śa, ṇa
5
Translation
Sanskrit (Devanagari) and Telugu translation — the output layer for researchers who cannot read the original script.
क + కి (Claude Vision)
Why It Matters
Without this pipeline, Pallava inscriptions could only be stored as images (unsearchable) or as approximate IAST text (losing glyph distinctiveness). The five-layer system is the scholarly infrastructure that makes the entire knowledge base viable as a research tool. The Layer 2 PUA encoding has been submitted as part of a Unicode standardisation proposal.
Infrastructure & Operations
Before & After
Before — print() calls
print("Ingesting chunk 42...")
uvicorn redirects and closes stdout during request handling. All print() output goes to a closed file descriptor — silently discarded. Debugging is impossible.
After — logging module
log = logging.getLogger(__name__)
log.info("Ingesting chunk 42...")
uvicorn's logging integration routes these correctly. Visible in the terminal window during development and in log files on EC2.
What Was Happening
start_all.bat
Tries to set API key + start agents
↓ cmd.exe nested-quote parsing breaks the set command
History Agent
No API key in env
Personal Agent
No API key in env
Agent J
No API key in env
↓ Every Claude API call fails with "Connection error" — misleading symptom
Why It Was Deceptive
The symptom was "Connection error" from Agent J — pointing to network issues. Only by tracing the process tree with Windows WMIC and inspecting each child process's own environment (not the start command) was it discovered that the API key was present in the start command string but never actually set in the child process environment.
Root cause: cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=..." — the nested double-quotes cause cmd.exe to terminate the quoted string at the first inner ", never reaching the set command.
Resolution — Helper .bat Files
Before — nested quotes (broken)
start "History Agent" cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=sk-ant-..."
Inner quotes break outer string. set never executes.
After — helper scripts
set ANTHROPIC_API_KEY=sk-ant-... (in parent, once)
start "History Agent" cmd /k "c:\..._start_history.bat"
No nested quotes. Environment variable inherited cleanly by all three child processes.
The Timeout Chain
Agent J
→ calls →
Personal Agent
→ calls →
History Agent
→ calls →
Claude Opus
90–120 sec
60s timeout ✗
30s timeout ✗
60s timeout ✗
Every link in the chain had a timeout shorter than Opus's generation time. The outermost timeout fired first, cancelling the request — even though the History Agent was generating correctly.
Fix: All three clients raised to 240 seconds — 2× safety margin over the 90–120s observed generation time.
Root Cause & Fix
reload=True (broken)
On file change, uvicorn spawns a new worker forked from the watchfiles process — not from the original shell. It does not inherit ANTHROPIC_API_KEY set via setlocal in the .bat file. Every code change → Claude API fails.
reload=False (fixed)
Single process, inherits environment from the parent shell at startup and keeps it forever. For development changes, agents are restarted manually via _start_*.bat. Now a documented invariant in CLAUDE.md.
The Size Reduction
Before
8,371 PNG files (corpus/pdf_pages/) — 8.17 GiB
8.17 GiB
After
1.87 MiB (4,400× smaller)
Why Standard Fixes Didn't Work
Once a file is committed to git, it lives in the pack objects permanently — even if you later add it to .gitignore and run git rm --cached. The working tree entry is removed, but every clone and push still transfers the full blob history.
git filter-branch rewrote history but left orphaned objects; garbage collection had not run yet, so pushes still transferred 8 GiB.
Final fix: Deleted the entire .git directory, ran git init, staged only the 165 code files, and pushed a fresh single-commit history. 1.87 MiB. The PNG files remain on disk — they are regenerated on demand during ingestion and have no place in version control.
How Git LFS Works
chroma.sqlite3
197 MB
Rejected by GitHub
→
Git LFS
.gitattributes
track "chroma_db/**"
→
GitHub LFS
server
Stores 636 MB
+ tiny pointer file in git
git clone ✓
Auto-downloads
LFS objects
EC2 Impact
Before LFS: EC2 deployment required manually copying 241 MB of ChromaDB files via scp after cloning — a fragile, error-prone step. After LFS: git clone fetches everything automatically. The app is fully operational immediately after clone, with no manual file transfers for the vector store.
Before & After
Before — same hash strategy for all types
SHA-256 of first 64 KB for everything. Two scanned PDFs with identical publisher boilerplate cover pages → same hash → second book silently skipped as "already ingested."
Also: re-ingesting a URL whose Wikipedia content had been updated → same URL content hash (stale) → update blocked.
After — strategy per source type
Files (PDF, DOCX): SHA-256 of the full binary — no more cover-page collisions.
URLs: SHA-256 of the URL string itself — stable key, forces explicit skip_if_exists=False to update a URL source. Prevents accidental re-ingestion of unchanged content while allowing intentional updates.
Hub-and-Spoke Architecture (Enforced at Code Level)
Agent J
Master Orchestrator
Port 8002
↙ allowed allowed ↘
History Agent
Pallava KB
Port 8000
Personal Agent
Gmail + Tasks
Port 8001
↓ only
Gmail OAuth
Sensitive credential
Personal Agent only
→ Allowed call
→ Blocked — no import path exists
Why This Matters
The History Agent is intended to be public-facing eventually. A user could craft a malicious ingestion source (prompt injection via a poisoned PDF or URL) that instructs the History Agent to perform actions on their behalf. If the History Agent could directly call the Personal Agent — which has Gmail access — a prompt injection attack could silently send emails.
The hub-and-spoke topology means the History Agent has no code path to reach Gmail. Even if fully compromised, it cannot reach the Personal Agent's tools. The security boundary is enforced at the import level, not just by configuration.
Before & After
Risk — hardcoded Windows path
C:\Users\siddi\OneDrive\...\source_library\
Hardcoded into catalog.json entries. Would break silently on EC2 Linux — PDF downloads, page image rendering, and research paper image embedding would all fail with FileNotFoundError.
Fix — env var abstraction
PALLAVA_LIBRARY_DIR = os.environ.get("PALLAVA_LIBRARY_DIR", default_path)
On EC2: set PALLAVA_LIBRARY_DIR=/mnt/pallava-data/source_library/ in Parameter Store. Zero code changes. One-time path migration script rewrites existing catalog entries at deploy time (~15 min).
| # |
Challenge |
Domain |
Difficulty |
Status |
| 01 | Claude Vision max_tokens silent failure | AI / Claude API | Hard | Resolved |
| 02 | Tamil multilingual extraction from legacy PDFs | Text Extraction | Hard | Resolved |
| 03 | IAST diacriticals rendering as boxes in Word | Unicode / Export | Hard | Resolved |
| 04 | RAG hallucination on lacunose inscriptions | AI Safety / RAG | Hard | Resolved |
| 05 | Author precedence for conflicting sources | RAG / Research | Medium | Resolved |
| 06 | Daṇḍa characters breaking chunk boundaries | Chunking / Sanskrit | Medium | Resolved |
| 07 | Five-layer Pallava script encoding pipeline | Unicode / Script | Hard | Resolved |
| 08 | uvicorn stdout closure killing all logging | FastAPI / Ops | Medium | Resolved |
| 09 | API key not inherited by Windows child processes | Windows / Process | Hard | Resolved |
| 10 | httpx.ReadTimeout on Opus paper generation | Multi-Agent / Net | Medium | Resolved |
| 11 | uvicorn reload=True breaking key inheritance | FastAPI / Process | Medium | Resolved |
| 12 | Git history bloated to 8.17 GiB by PDF images | Git / Storage | Hard | Resolved |
| 13 | ChromaDB exceeding GitHub 100 MB file limit | Git LFS | Medium | Resolved |
| 14 | SHA-256 dedup colliding across source types | Data Integrity | Medium | Resolved |
| 15 | Multi-agent security topology design | Architecture | Hard | Resolved |
| 16 | Windows absolute paths for EC2 portability | Portability / AWS | Medium | Resolved |
16 challenges documented. 7 classified Hard, 9 Medium. All resolved. This document will be updated as new challenges are encountered during Phase 2 (corpus expansion), Phase 3 (AWS migration), and Phase 4 (public research output).