Production AI and Python Performance — More from PyCon India 2025

This is part two of my PyCon India 2025 notes. The first post covered MCP and multi-agent patterns. This one focuses on the production side: making AI systems fast, secure, and cost-efficient.

Prompt Caching — 80% Cost Reduction Is Real

The caching talk resonated because I've implemented exactly this pattern at work. The core idea: separate your prompts into static and dynamic components.

The Strategy

Static prefix — instructions, schemas, rules. These don't change between requests.

# This part gets cached — same across all documents
system_prompt = """You are a data extraction expert.
Extract information in the following JSON format...
[detailed schema and rules]"""
 
# This part varies per request
user_prompt = f"Extract data from: {document_content}"

Three levels of caching:

Prompt-level caching — cache the static instruction prefix. AWS Bedrock supports this natively with ephemeral caching.
Session-level caching — when processing batches from the same source, cache source-specific schemas across the batch. Companies saw 2-5x throughput improvements.
Cache breakpoints — cache intermediate results between pipeline steps (extraction → validation → formatting). 15-25% overall time reduction.

The rule: avoid caching ephemeral reasoning or user state changes. Only cache what's truly stable.

The numbers: organizations implementing all three levels saw 30-60% reduction in API costs and 40-70% faster response times.

Security Layers for AI Systems

The security talks followed a clear pattern: defense in depth, same as traditional systems but with AI-specific concerns.

The Graduated Approach

Layer 1 — Basic protections:

Rate limits per user and globally. Companies reduced API abuse by 90%+ with proper rate limiting.
Topic filters — reject documents that don't match expected formats. 95% reduction in processing errors from irrelevant uploads.

Layer 2 — Content safety:

Services like AWS Bedrock Guardrails or NeMo Guardrails
Detect inappropriate content, malicious prompts, potential security threats
80% reduction in security incidents after implementation

Layer 3 — Prompt attack guardrails:

Catch injection attempts and malicious prompt modifications
99%+ attack prevention rate with proper implementation

The layered approach reduced overall risk by 95% in the case studies presented.

PII/PHI Handling

This was particularly relevant for anyone processing financial or medical documents:

Redaction — remove entirely. SSN "123-45-6789" becomes "[REDACTED]"
Anonymization — replace with safe placeholders. "John Smith" becomes "Employee_001"
Pseudonymization — consistent but unlinkable identifiers. Same person always becomes "ID_7429" across documents

The balance: preserve enough structure for processing while removing identifiers. Companies implementing these strategies saw 50-80% reduction in privacy incidents, and processing time actually improved 20-30% due to simpler data structures.

Python Performance: 40x with Queue-Based Multiprocessing

The performance talk had the most dramatic benchmarks. The setup: processing large numerical datasets (think 8000×200000 matrices).

The Three Approaches

Sequential — baseline. Simple, debuggable, slow.

Naive multiprocessing — basic multiprocessing.Pool. Better, but hits limitations with memory sharing, recursion depth, and GC pressure.

Queue-based multiprocessing — asynchronous task distribution with proper resource management. This is where the magic happens.

The Benchmarks

Dataset	Sequential	Naive MP	Queue-based
8000×200000	1097s	25s	17.9s (61x)
16000×200000	3496s	84s	27.9s (125x)
800×20000	199s	5.2s	3.5s (57x)

Key Techniques

The queue-based approach combined several optimizations:

Cross-process variable sharing via multiprocessing.Manager() — 50-80% reduction in memory duplication
Garbage collection optimization — strategic gc.collect() calls eliminated memory-related crashes
Numba JIT compilation — 10-100x speedup for numerical bottlenecks
Recursion pruning — converting recursive algorithms to iterative, 99% reduction in stack overflow crashes
Runtime datatype decisions — switching between arrays and sparse matrices based on data density

The overarching lesson: profile before optimizing. Teams that measured first saw 3x better results than those who guessed at bottlenecks.

DuckDB Is the New Pandas

This wasn't a full talk but came up in multiple sessions: DuckDB is increasingly replacing Pandas for analytical workloads.

Why it's winning:

Up to 20x faster for data analysis tasks
Automatically uses all CPU cores
Handles datasets larger than memory
SQL syntax — more widely known than Pandas API, and LLMs generate it better
Runs in browsers via WebAssembly

When to reach for DuckDB over Pandas:

Datasets over 1GB
Complex aggregations and joins
Performance-critical pipelines
Teams with SQL expertise

I haven't migrated anything yet, but I'm planning to benchmark it against our Pandas-heavy ETL workflows.

My Takeaways

Three things I'm implementing immediately:

Audit our caching strategy — we already do prompt caching, but session-level caching for batch processing is an easy win we're leaving on the table
Add formal PII detection — we handle sensitive documents but haven't formalized the anonymization pipeline
Profile before touching performance code — the 40x improvement story was a good reminder that intuition about bottlenecks is usually wrong

The best production AI systems aren't the ones with the fanciest models. They're the ones with proper caching, security layers, and performance engineering around the model.

Notes and interpretations from PyCon India 2025. Metrics cited are from conference presentations.