Engineering Challenges & Resolutions
A chronicle of hard problems encountered building the Pallava Knowledge Base & Agent J system — and how each was solved.
Purpose of this document: This page records every significant engineering challenge encountered during the build — the symptom, root cause, failed attempts, and final resolution. It is intended to demonstrate the depth and complexity of the project to stakeholders, and to serve as a reference for the team when similar problems arise in future.
16
Total Challenges Documented
16
Fully Resolved
7
Classified Hard
9
Classified Medium
5
Engineering Domains
4,400×
Git Size Reduction (8 GB → 1.9 MB)
Contents
  1. Claude Vision max_tokens silent failure
  2. Tamil / multilingual text extraction from legacy PDFs
  3. IAST diacritical characters rendering as boxes in Word exports
  4. RAG hallucination on lacunose inscription passages
  5. Author precedence — conflicting secondary sources
  6. Daṇḍa characters breaking chunk boundaries
  7. Five-layer Pallava script encoding pipeline
  8. uvicorn stdout closure killing all logging
  9. API key not inherited by child processes (Windows)
  10. httpx.ReadTimeout on Opus research paper generation
  11. uvicorn reload=True preventing API key inheritance
  12. Git history bloated to 8.17 GiB by rendered PDF images
  13. ChromaDB exceeding GitHub's 100 MB file limit
  14. SHA-256 deduplication colliding across source types
  15. Multi-agent security topology
  16. Windows absolute paths for EC2 portability
AI & Model Layer
01
Claude Vision max_tokens silent failure
Hard Resolved AI / Claude API
What Was Happening — Token Budget Exhausted
Before fix  ·  max_tokens = 4,096
Internal Reasoning (extended thinking)
No tokens left
After fix  ·  max_tokens = 32,000
Internal Reasoning
Actual description output ✓
Extended thinking (internal, not visible)
Actual text output
No budget remaining → empty result, no error
Symptom

Claude Vision API calls for PDF page description returned empty strings with no error. The API returned HTTP 200 (success) but the text content was blank. Entire inscription pages were silently skipped during ingestion.

Root Cause

Claude's extended thinking (adaptive reasoning) consumes tokens internally before emitting any text. With the default max_tokens = 4096, the model exhausted the entire budget on internal reasoning and had zero tokens left to produce the description. No exception was raised — the call silently succeeded with empty output.

Resolution

Set max_tokens ≥ 32,000 in all Vision API calls — giving extended thinking enough budget while still leaving the majority for actual output. Now a documented invariant in CLAUDE.md.

Why It Matters

Without this fix, every image-heavy page — including stone inscription photographs — would have been ingested as empty chunks. The failure was completely invisible at ingest time; it would only have been discovered later when queries returned no results for known topics.

02
Tamil / multilingual text extraction from legacy PDFs
Hard Resolved Text Extraction Encoding
Solution — Two-Path Extraction Pipeline
📄 PDF Page
Extract text
PyMuPDF
✓ <15% garbled
Path A: Use text directly
English / clean Unicode
✗ >15% garbled
Path B: Claude Vision OCR
Tamil / legacy encoding / scanned
Clean chunk
for ChromaDB
Before & After
Before — raw extraction
ÔÅçϳ>>ÐÑ×ÃÌÑ

Tamil script from a 1960s academic PDF was extracted as garbage bytes — the font's private encoding was not applied. The knowledge base stored unreadable text as "knowledge."
After — Vision OCR fallback
"The Pallava king Mahendravarman I was a prolific builder of rock-cut temples..."

The same page rendered to PNG and described by Claude Vision. Accurate, readable, correctly indexed.
Why It Matters

The most important primary sources — Epigraphia Indica, South Indian Inscriptions, Minakshi's Pallava dynasty study — are all legacy PDFs mixing Tamil and English. Without this fix, the knowledge base would have been built almost entirely from English secondary sources, severely undermining its scholarly value.

03
IAST diacritical characters rendering as boxes in Word exports
Hard Resolved Unicode Document Export
The Problem at a Glance
Word document (broken)
Mahendravarman I built the □□□□□□□ at □□□□□□□□□□.

Every IAST character (ā, ī, ū, ṭ, ḍ, ṇ, ś) rendered as an empty box. Scholarly names became unreadable.
PDF export (fixed)
Mahendravarman I built the maṇḍapa at Māmallapuram.

docx2pdf lets Word render with user-installed Noto Serif font, then embeds all fonts into the PDF. Portable and correct on any device.
Attempted Solutions
Attempt 1 — Georgia font
Partial IAST coverage. ā, ī worked; ṭ, ḍ, ṇ, ṣ still showed boxes.
Attempt 2 — Times New Roman
Same gaps. Font covers Latin Extended but not Combining Diacritical Marks for IAST.
Attempt 3 — Calibri + w:cs XML attribute
Better — most boxes eliminated. A few rare IAST characters still broken.
Attempt 4 — Noto Serif (user-profile install)
Font has full IAST coverage, but installed to user-profile fonts dir — invisible to python-docx renderer running as a different process context.
Final Fix — Two-layer approach
Layer 1: Prompt Claude to avoid IAST characters entirely — use plain English spellings. Layer 2: Convert .docx → .pdf via docx2pdf (Microsoft Word COM), which uses the user-profile font and embeds it in the PDF. Works on any recipient's machine.
Why It Matters

Pallava king names, temple names, and epigraphical terms are almost all IAST-transliterated Sanskrit or Tamil (Narasimhavarman, maṇḍapa, vimāna). An export system that cannot render these is useless for academic or stakeholder presentations.

04
RAG hallucination on lacunose (damaged) inscription passages
Hard Resolved RAG AI Safety
The Problem Illustrated
Before — AI fills in gaps
Source passage: "The king [...] defeated the [...] army in the year [...]"

AI response: "King Narasimhavarman I defeated the Chalukya army in the year 642 CE"

Fabricated. The missing sections were physically destroyed on the stone. The AI invented plausible-sounding dates and names.
After — gaps flagged explicitly
AI response: "The inscription records a military victory by the king, however the identity of the enemy force and the date are physically damaged on the stone — [uncertain — inscription damaged at this point, confidence: 0]"

Honest and safe.
Root Cause & Resolution

Language models are trained to produce fluent, complete text. When presented with fragments containing gaps ([...], ***), the model's probability distribution strongly favours filling them with contextually plausible content — exactly what a scholar must not do.

Resolution: Extended the RAG answerer prompt with an explicit lacunose instruction — "Do NOT reconstruct or speculate about missing content. State: [uncertain — inscription damaged]." Also added grounding validation that flags invented Pallava king names or dates absent from the retrieved chunks.

Why It Matters

In epigraphy, a fabricated reconstruction of a damaged king's name or regnal date can propagate as fact through secondary literature for decades. This is an AI safety problem specific to the humanities — the stakes of hallucination are historical record integrity, not just user inconvenience.

05
Author precedence — conflicting secondary sources in research papers
Medium Resolved RAG Research
How the Precedence System Works
Wikipedia
No star
·
Secondary book
No star
·
Primary monograph
★ Authoritative
Preferred in
conflict resolution
Before & After
Before — no weighting
ChromaDB ranks by semantic distance only. A Wikipedia paragraph and a passage from K.A. Nilakanta Sastri's primary monograph appear with equal weight. Research paper presents contradictory regnal dates side-by-side with no guidance.
After — author_precedence.json
Authoritative sources tagged ★ in the passage block. Prompt instructs Claude: "Prefer ★ sources when sources conflict. Present both views and note which authority you prefer." Research papers acknowledge historiographical disagreement properly.
Text & Encoding
06
Daṇḍa characters breaking Sanskrit chunk boundaries
Medium Resolved Chunking Sanskrit
The Problem Illustrated
Before — chunker splits only on Latin punctuation
Sanskrit verse: "namo bhagavate vāsudevāya। idaṃ śāsanaṃ..."

The daṇḍa (।) is ignored. The chunker splits the verse mid-sentence on the next Latin period — breaking semantic coherence. Retrieved chunks contain half a verse + the start of another.
After — daṇḍa recognised as sentence boundary
The daṇḍa (। U+0964) and double daṇḍa (॥ U+0965) are now treated as sentence terminators alongside . ! ?. Each Sanskrit verse or prose sentence is a clean, semantically complete chunk.
07
Five-layer Pallava script encoding pipeline
Hard Resolved Unicode Script Engineering
The Problem

The Pallava Grantha script (6th–7th century CE) has no standardised Unicode block. Existing Grantha Unicode (U+11300) partially covers it, but many Pallava-specific glyph variants are absent from any standard. Without a stable encoding, characters cannot be stored, searched, or compared digitally.

The Five-Layer Solution
📷
Layer 1
Glyph Image
Stone inscription PNG
SHA-256 fingerprint
🔡
Layer 2
PUA Encoding
U+E800–U+E8FF
256 Pallava glyphs
𑌗
Layer 3
Grantha Unicode
U+11300–U+1137F
Nearest standard
Ā
Layer 4
IAST Romanisation
ISO 15919 standard
Scholarly citation
📝
Layer 5
Translation
Sanskrit + Telugu
Claude Vision
Why Each Layer Exists
1
Glyph Image
The ground truth — exactly what is carved on stone. Needed for palaeographic research and visual verification.
page17_img1.png
2
PUA Encoding
Stable digital identity for each glyph variant — enables exact comparison and dedup even without Unicode standardisation. The basis of the Unicode initiative proposal.
U+E831 (ka-variant)
3
Grantha Unicode
Interoperability with other Grantha-script tools and databases where an accepted standard mapping exists.
𑌕 U+11315
4
IAST Romanisation
Scholarly standard for citation, text search in English, and cross-reference with academic literature.
ka, ṭa, śa, ṇa
5
Translation
Sanskrit (Devanagari) and Telugu translation — the output layer for researchers who cannot read the original script.
क + కి (Claude Vision)
Why It Matters

Without this pipeline, Pallava inscriptions could only be stored as images (unsearchable) or as approximate IAST text (losing glyph distinctiveness). The five-layer system is the scholarly infrastructure that makes the entire knowledge base viable as a research tool. The Layer 2 PUA encoding has been submitted as part of a Unicode standardisation proposal.

Infrastructure & Operations
08
uvicorn stdout closure killing all logging
Medium Resolved FastAPI Logging
Before & After
Before — print() calls
print("Ingesting chunk 42...")

uvicorn redirects and closes stdout during request handling. All print() output goes to a closed file descriptor — silently discarded. Debugging is impossible.
After — logging module
log = logging.getLogger(__name__)
log.info("Ingesting chunk 42...")

uvicorn's logging integration routes these correctly. Visible in the terminal window during development and in log files on EC2.
09
API key not inherited by child processes (Windows)
Hard Resolved Windows Process Management
What Was Happening
start_all.bat
Tries to set API key + start agents
↓ cmd.exe nested-quote parsing breaks the set command
History Agent
No API key in env
Personal Agent
No API key in env
Agent J
No API key in env
↓ Every Claude API call fails with "Connection error" — misleading symptom
Why It Was Deceptive

The symptom was "Connection error" from Agent J — pointing to network issues. Only by tracing the process tree with Windows WMIC and inspecting each child process's own environment (not the start command) was it discovered that the API key was present in the start command string but never actually set in the child process environment.

Root cause: cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=..." — the nested double-quotes cause cmd.exe to terminate the quoted string at the first inner ", never reaching the set command.

Resolution — Helper .bat Files
Before — nested quotes (broken)
start "History Agent" cmd /k "cd /d "c:\...\pallava-translator" && set ANTHROPIC_API_KEY=sk-ant-..."

Inner quotes break outer string. set never executes.
After — helper scripts
set ANTHROPIC_API_KEY=sk-ant-... (in parent, once)
start "History Agent" cmd /k "c:\..._start_history.bat"

No nested quotes. Environment variable inherited cleanly by all three child processes.
10
httpx.ReadTimeout on Opus research paper generation
Medium Resolved Multi-Agent Networking
The Timeout Chain
Agent J
→ calls →
Personal Agent
→ calls →
History Agent
→ calls →
Claude Opus
90–120 sec
60s timeout ✗
30s timeout ✗
60s timeout ✗
Every link in the chain had a timeout shorter than Opus's generation time. The outermost timeout fired first, cancelling the request — even though the History Agent was generating correctly.
Fix: All three clients raised to 240 seconds — 2× safety margin over the 90–120s observed generation time.
11
uvicorn reload=True preventing API key inheritance
Medium Resolved FastAPI Process Management
Root Cause & Fix
reload=True (broken)
On file change, uvicorn spawns a new worker forked from the watchfiles process — not from the original shell. It does not inherit ANTHROPIC_API_KEY set via setlocal in the .bat file. Every code change → Claude API fails.
reload=False (fixed)
Single process, inherits environment from the parent shell at startup and keeps it forever. For development changes, agents are restarted manually via _start_*.bat. Now a documented invariant in CLAUDE.md.
Git & Version Control
12
Git history bloated to 8.17 GiB by rendered PDF page images
Hard Resolved Git Storage
The Size Reduction
Before
8,371 PNG files (corpus/pdf_pages/) — 8.17 GiB
8.17 GiB
After
·
1.87 MiB   (4,400× smaller)
Why Standard Fixes Didn't Work

Once a file is committed to git, it lives in the pack objects permanently — even if you later add it to .gitignore and run git rm --cached. The working tree entry is removed, but every clone and push still transfers the full blob history.

git filter-branch rewrote history but left orphaned objects; garbage collection had not run yet, so pushes still transferred 8 GiB.

Final fix: Deleted the entire .git directory, ran git init, staged only the 165 code files, and pushed a fresh single-commit history. 1.87 MiB. The PNG files remain on disk — they are regenerated on demand during ingestion and have no place in version control.

13
ChromaDB exceeding GitHub's 100 MB single-file limit
Medium Resolved Git LFS Storage
How Git LFS Works
chroma.sqlite3
197 MB
Rejected by GitHub
Git LFS
.gitattributes
track "chroma_db/**"
GitHub LFS
server
Stores 636 MB
+ tiny pointer file in git
git clone ✓
Auto-downloads
LFS objects
EC2 Impact

Before LFS: EC2 deployment required manually copying 241 MB of ChromaDB files via scp after cloning — a fragile, error-prone step. After LFS: git clone fetches everything automatically. The app is fully operational immediately after clone, with no manual file transfers for the vector store.

System Architecture
14
SHA-256 deduplication colliding across source types
Medium Resolved Data Integrity Ingestion
Before & After
Before — same hash strategy for all types
SHA-256 of first 64 KB for everything. Two scanned PDFs with identical publisher boilerplate cover pages → same hash → second book silently skipped as "already ingested."

Also: re-ingesting a URL whose Wikipedia content had been updated → same URL content hash (stale) → update blocked.
After — strategy per source type
Files (PDF, DOCX): SHA-256 of the full binary — no more cover-page collisions.

URLs: SHA-256 of the URL string itself — stable key, forces explicit skip_if_exists=False to update a URL source. Prevents accidental re-ingestion of unchanged content while allowing intentional updates.
15
Multi-agent security topology design
Hard Resolved Multi-Agent Architecture
Hub-and-Spoke Architecture (Enforced at Code Level)
Agent J
Master Orchestrator
Port 8002
↙ allowed             allowed ↘
History Agent
Pallava KB
Port 8000
← allowed
→ blocked
Personal Agent
Gmail + Tasks
Port 8001
↓ only
Gmail OAuth
Sensitive credential
Personal Agent only
Allowed call Blocked — no import path exists
Why This Matters

The History Agent is intended to be public-facing eventually. A user could craft a malicious ingestion source (prompt injection via a poisoned PDF or URL) that instructs the History Agent to perform actions on their behalf. If the History Agent could directly call the Personal Agent — which has Gmail access — a prompt injection attack could silently send emails.

The hub-and-spoke topology means the History Agent has no code path to reach Gmail. Even if fully compromised, it cannot reach the Personal Agent's tools. The security boundary is enforced at the import level, not just by configuration.

16
Windows absolute paths for EC2 portability (pre-emptive)
Medium Resolved — by design Portability AWS
Before & After
Risk — hardcoded Windows path
C:\Users\siddi\OneDrive\...\source_library\

Hardcoded into catalog.json entries. Would break silently on EC2 Linux — PDF downloads, page image rendering, and research paper image embedding would all fail with FileNotFoundError.
Fix — env var abstraction
PALLAVA_LIBRARY_DIR = os.environ.get("PALLAVA_LIBRARY_DIR", default_path)

On EC2: set PALLAVA_LIBRARY_DIR=/mnt/pallava-data/source_library/ in Parameter Store. Zero code changes. One-time path migration script rewrites existing catalog entries at deploy time (~15 min).
Summary
# Challenge Domain Difficulty Status
01Claude Vision max_tokens silent failureAI / Claude APIHardResolved
02Tamil multilingual extraction from legacy PDFsText ExtractionHardResolved
03IAST diacriticals rendering as boxes in WordUnicode / ExportHardResolved
04RAG hallucination on lacunose inscriptionsAI Safety / RAGHardResolved
05Author precedence for conflicting sourcesRAG / ResearchMediumResolved
06Daṇḍa characters breaking chunk boundariesChunking / SanskritMediumResolved
07Five-layer Pallava script encoding pipelineUnicode / ScriptHardResolved
08uvicorn stdout closure killing all loggingFastAPI / OpsMediumResolved
09API key not inherited by Windows child processesWindows / ProcessHardResolved
10httpx.ReadTimeout on Opus paper generationMulti-Agent / NetMediumResolved
11uvicorn reload=True breaking key inheritanceFastAPI / ProcessMediumResolved
12Git history bloated to 8.17 GiB by PDF imagesGit / StorageHardResolved
13ChromaDB exceeding GitHub 100 MB file limitGit LFSMediumResolved
14SHA-256 dedup colliding across source typesData IntegrityMediumResolved
15Multi-agent security topology designArchitectureHardResolved
16Windows absolute paths for EC2 portabilityPortability / AWSMediumResolved
16 challenges documented. 7 classified Hard, 9 Medium. All resolved. This document will be updated as new challenges are encountered during Phase 2 (corpus expansion), Phase 3 (AWS migration), and Phase 4 (public research output).