System Design Cases10 lessons22 quiz questions

Design: Web Crawler

10-session plan from basic BFS crawling to distributed petabyte-scale web crawling.

What You Will Learn

  • Core Concepts & Requirements
  • URL Frontier Design
  • URL Deduplication
  • Fetcher & robots.txt
  • HTML Parsing & Link Extraction
  • Content Deduplication
  • Data Model & Storage
  • Fault Tolerance & Recrawl
  • Distributed Architecture
  • Mock Interview

Overview

10-session plan from basic BFS crawling to distributed petabyte-scale web crawling. Design: Web Crawler: Requirements & Scope Functional Requirements Before designing anything, clarify what the system needs to do: Core features — What are the must-have features? User types — Who uses the system? End users, admins, API consumers? Scale — How many users? What's the read/write ratio? Non-Functional Requirements Availability: 99.9% (8.7 hours downtime/year) or 99.99%? Latency: < 100ms for reads? < 500ms for writes? Consistency: Strong or eventual? CAP trade-off? Durability: Can we lose data? How much? Clarifying Questions to Ask What's the expected DAU (Daily Active Users)? What's the read-to-write ratio? Do we need real-time updates? What's our budget constraint? What geography do we serve? Interview Tip "I always spend the first 5 minutes clarifying requirements. This shows structured thinking and prevents designing the wrong system. I explicitly separate functional from non-functional requirements." Java Implementation Python Implementation

Sample Quiz Questions

1. Why is BFS (Breadth-First Search) preferred over DFS for web crawling?

·Difficulty: easy/5

2. What is the purpose of a Bloom filter in a web crawler?

·Difficulty: medium/5

3. A website generates infinite unique URLs like /products?page=1, /products?page=2 ... /products?page=1000000. How do you handle this?

·Difficulty: hard/5

+ 19 more questions available in the full app.

Related Topics

Master Design: Web Crawler for Your Next Interview

Get access to full lessons, adaptive quizzes, cheat sheets, code playground, and progress tracking — completely free.