← Back to Projects
Production · Internal
Distributed Web Crawler
Large-scale distributed crawler processing 10M pages per day across 300+ e-commerce websites using a leader-follower architecture with 400 nodes spanning AWS, GCP, and third-party proxies.
GoPerlRabbitMQPostgreSQLRedisAWSGCPDistributed Systems
What it does
A production web crawler built to extract structured product data from 300+ e-commerce websites at scale. One leader node orchestrates work across 400 follower nodes distributed across multiple cloud providers and proxy networks, processing 10 million pages per day.
Architecture highlights
- Leader-follower architecture: single leader dispatches crawl tasks to 400 worker nodes via RabbitMQ
- Multi-cloud deployment across AWS, GCP, and third-party forward proxy networks for IP diversity
- 10M pages crawled per day with configurable per-site rate limiting and politeness policies
- Adaptive request handling: request pacing, header variation, and fingerprint management
- Intelligent proxy rotation with automatic routing through better-performing endpoints
- Health check system monitoring node availability, success rates, and proxy quality
- PostgreSQL for crawl state management, Redis for distributed caching and rate limiting
Backend patterns
- Go for the high-performance crawler core — concurrent HTTP clients with connection pooling
- Perl for extraction rule engine — pattern matching across 1000+ site-specific rulesets
- RabbitMQ for reliable task distribution with acknowledgment-based delivery
- Automated QA system detecting data extraction anomalies and flagging degraded sites
- Crawl scheduling with priority queues — high-value sites crawled more frequently
- Cost optimization: automatic decommissioning of low-value sites based on API usage metrics