This is part two of my PyCon India 2025 notes. The first post covered MCP and multi-agent patterns. This one focuses on the production side: making AI systems fast, secure, and cost-efficient.
Prompt Caching — 80% Cost Reduction Is Real
The caching talk resonated because I've implemented exactly this pattern at work. The core idea: separate your prompts into static and dynamic components.
The Strategy
Static prefix — instructions, schemas, rules. These don't change between requests.
# This part gets cached — same across all documents
system_prompt = """You are a data extraction expert.
Extract information in the following JSON format...
[detailed schema and rules]"""
# This part varies per request
user_prompt = f"Extract data from: {document_content}"Three levels of caching:
- Prompt-level caching — cache the static instruction prefix. AWS Bedrock supports this natively with ephemeral caching.
- Session-level caching — when processing batches from the same source, cache source-specific schemas across the batch. Companies saw 2-5x throughput improvements.
- Cache breakpoints — cache intermediate results between pipeline steps (extraction → validation → formatting). 15-25% overall time reduction.
The rule: avoid caching ephemeral reasoning or user state changes. Only cache what's truly stable.
The numbers: organizations implementing all three levels saw 30-60% reduction in API costs and 40-70% faster response times.
Security Layers for AI Systems
The security talks followed a clear pattern: defense in depth, same as traditional systems but with AI-specific concerns.
The Graduated Approach
Layer 1 — Basic protections:
- Rate limits per user and globally. Companies reduced API abuse by 90%+ with proper rate limiting.
- Topic filters — reject documents that don't match expected formats. 95% reduction in processing errors from irrelevant uploads.
Layer 2 — Content safety:
- Services like AWS Bedrock Guardrails or NeMo Guardrails
- Detect inappropriate content, malicious prompts, potential security threats
- 80% reduction in security incidents after implementation
Layer 3 — Prompt attack guardrails:
- Catch injection attempts and malicious prompt modifications
- 99%+ attack prevention rate with proper implementation
The layered approach reduced overall risk by 95% in the case studies presented.
PII/PHI Handling
This was particularly relevant for anyone processing financial or medical documents:
- Redaction — remove entirely. SSN "123-45-6789" becomes "[REDACTED]"
- Anonymization — replace with safe placeholders. "John Smith" becomes "Employee_001"
- Pseudonymization — consistent but unlinkable identifiers. Same person always becomes "ID_7429" across documents
The balance: preserve enough structure for processing while removing identifiers. Companies implementing these strategies saw 50-80% reduction in privacy incidents, and processing time actually improved 20-30% due to simpler data structures.
Python Performance: 40x with Queue-Based Multiprocessing
The performance talk had the most dramatic benchmarks. The setup: processing large numerical datasets (think 8000×200000 matrices).
The Three Approaches
Sequential — baseline. Simple, debuggable, slow.
Naive multiprocessing — basic multiprocessing.Pool. Better, but hits limitations with memory sharing, recursion depth, and GC pressure.
Queue-based multiprocessing — asynchronous task distribution with proper resource management. This is where the magic happens.
The Benchmarks
| Dataset | Sequential | Naive MP | Queue-based |
|---|---|---|---|
| 8000×200000 | 1097s | 25s | 17.9s (61x) |
| 16000×200000 | 3496s | 84s | 27.9s (125x) |
| 800×20000 | 199s | 5.2s | 3.5s (57x) |
Key Techniques
The queue-based approach combined several optimizations:
- Cross-process variable sharing via
multiprocessing.Manager()— 50-80% reduction in memory duplication - Garbage collection optimization — strategic
gc.collect()calls eliminated memory-related crashes - Numba JIT compilation — 10-100x speedup for numerical bottlenecks
- Recursion pruning — converting recursive algorithms to iterative, 99% reduction in stack overflow crashes
- Runtime datatype decisions — switching between arrays and sparse matrices based on data density
The overarching lesson: profile before optimizing. Teams that measured first saw 3x better results than those who guessed at bottlenecks.
DuckDB Is the New Pandas
This wasn't a full talk but came up in multiple sessions: DuckDB is increasingly replacing Pandas for analytical workloads.
Why it's winning:
- Up to 20x faster for data analysis tasks
- Automatically uses all CPU cores
- Handles datasets larger than memory
- SQL syntax — more widely known than Pandas API, and LLMs generate it better
- Runs in browsers via WebAssembly
When to reach for DuckDB over Pandas:
- Datasets over 1GB
- Complex aggregations and joins
- Performance-critical pipelines
- Teams with SQL expertise
I haven't migrated anything yet, but I'm planning to benchmark it against our Pandas-heavy ETL workflows.
My Takeaways
Three things I'm implementing immediately:
- Audit our caching strategy — we already do prompt caching, but session-level caching for batch processing is an easy win we're leaving on the table
- Add formal PII detection — we handle sensitive documents but haven't formalized the anonymization pipeline
- Profile before touching performance code — the 40x improvement story was a good reminder that intuition about bottlenecks is usually wrong
The best production AI systems aren't the ones with the fanciest models. They're the ones with proper caching, security layers, and performance engineering around the model.
Notes and interpretations from PyCon India 2025. Metrics cited are from conference presentations.