Search Engines

  1. Web Crawling
    1. Role of Web Crawlers
      1. Discovering New Content
        1. Updating Existing Content
          1. Monitoring Web Changes
          2. How Crawlers Work
            1. Starting with Seed URLs
              1. Seed List Selection
                1. Importance of Seed Diversity
                  1. Quality vs. Quantity
                  2. Fetching Page Content
                    1. HTTP Requests and Responses
                      1. Handling Different Content Types
                        1. Dealing with Dynamic Content
                          1. JavaScript Rendering
                        2. Crawling Policies and Etiquette
                          1. Robots Exclusion Protocol
                            1. robots.txt Syntax and Directives
                              1. Limitations and Enforcement
                                1. User-agent Specifications
                                2. Sitemap Protocol
                                  1. sitemap.xml Structure and Usage
                                    1. Prioritizing URLs
                                      1. Sitemap Index Files
                                      2. Politeness Policies
                                        1. Rate Limiting
                                          1. Respecting Server Load
                                            1. Crawl Delay Implementation
                                          2. Crawling Strategies
                                            1. Breadth-First Crawling
                                              1. Implementation Details
                                                1. Advantages and Disadvantages
                                                2. Depth-First Crawling
                                                  1. Use Cases
                                                    1. Memory Requirements
                                                    2. Focused Crawling
                                                      1. Topic-Specific Crawling
                                                        1. Relevance Estimation
                                                          1. Classifier Training
                                                          2. Incremental Crawling
                                                            1. Change Detection
                                                              1. Scheduling Re-crawls
                                                                1. Freshness Optimization
                                                              2. Challenges in Crawling
                                                                1. Scale of the Web
                                                                  1. Handling Billions of Pages
                                                                    1. Distributed Crawling Architecture
                                                                      1. Resource Allocation
                                                                      2. Handling Duplicate Content
                                                                        1. Duplicate Detection Techniques
                                                                          1. Canonicalization
                                                                            1. Near-duplicate Identification
                                                                            2. Deep Web and Dark Web
                                                                              1. Limitations of Crawlers
                                                                                1. Accessing Non-linked Content
                                                                                  1. Password-protected Content
                                                                                  2. Crawler Traps
                                                                                    1. Infinite Loops
                                                                                      1. Session IDs and Calendar Traps
                                                                                        1. Mitigation Strategies
                                                                                        2. Content Freshness
                                                                                          1. Detecting Updates
                                                                                            1. Balancing Freshness and Coverage
                                                                                              1. Priority-based Crawling