← Back to Projects
Production · Internal

Distributed Web Crawler

Large-scale distributed crawler processing 10M pages per day across 300+ e-commerce websites using a leader-follower architecture with 400 nodes spanning AWS, GCP, and third-party proxies.

GoPerlRabbitMQPostgreSQLRedisAWSGCPDistributed Systems

What it does

A production web crawler built to extract structured product data from 300+ e-commerce websites at scale. One leader node orchestrates work across 400 follower nodes distributed across multiple cloud providers and proxy networks, processing 10 million pages per day.

Architecture highlights

  • Leader-follower architecture: single leader dispatches crawl tasks to 400 worker nodes via RabbitMQ
  • Multi-cloud deployment across AWS, GCP, and third-party forward proxy networks for IP diversity
  • 10M pages crawled per day with configurable per-site rate limiting and politeness policies
  • Adaptive request handling: request pacing, header variation, and fingerprint management
  • Intelligent proxy rotation with automatic routing through better-performing endpoints
  • Health check system monitoring node availability, success rates, and proxy quality
  • PostgreSQL for crawl state management, Redis for distributed caching and rate limiting

Backend patterns

  • Go for the high-performance crawler core — concurrent HTTP clients with connection pooling
  • Perl for extraction rule engine — pattern matching across 1000+ site-specific rulesets
  • RabbitMQ for reliable task distribution with acknowledgment-based delivery
  • Automated QA system detecting data extraction anomalies and flagging degraded sites
  • Crawl scheduling with priority queues — high-value sites crawled more frequently
  • Cost optimization: automatic decommissioning of low-value sites based on API usage metrics