Useful Links
Computer Science
Other Tools and Technologies
Search Engines
1. Introduction to Search Engines
2. Web Crawling
3. Indexing
4. Query Processing and Information Retrieval
5. Ranking Algorithms
6. Search Engine Architecture and Infrastructure
7. Search User Interface and Experience
8. Search Engine Optimization
9. Business and Societal Impact
10. Future of Search
Web Crawling
Role of Web Crawlers
Discovering New Content
Updating Existing Content
Monitoring Web Changes
How Crawlers Work
Starting with Seed URLs
Seed List Selection
Importance of Seed Diversity
Quality vs. Quantity
Following Hyperlinks
Link Extraction
Link Prioritization
Link Graph Construction
Fetching Page Content
HTTP Requests and Responses
Handling Different Content Types
Dealing with Dynamic Content
JavaScript Rendering
Crawling Policies and Etiquette
Robots Exclusion Protocol
robots.txt Syntax and Directives
Limitations and Enforcement
User-agent Specifications
Sitemap Protocol
sitemap.xml Structure and Usage
Prioritizing URLs
Sitemap Index Files
Politeness Policies
Rate Limiting
Respecting Server Load
Crawl Delay Implementation
Crawling Strategies
Breadth-First Crawling
Implementation Details
Advantages and Disadvantages
Depth-First Crawling
Use Cases
Memory Requirements
Focused Crawling
Topic-Specific Crawling
Relevance Estimation
Classifier Training
Incremental Crawling
Change Detection
Scheduling Re-crawls
Freshness Optimization
Challenges in Crawling
Scale of the Web
Handling Billions of Pages
Distributed Crawling Architecture
Resource Allocation
Handling Duplicate Content
Duplicate Detection Techniques
Canonicalization
Near-duplicate Identification
Deep Web and Dark Web
Limitations of Crawlers
Accessing Non-linked Content
Password-protected Content
Crawler Traps
Infinite Loops
Session IDs and Calendar Traps
Mitigation Strategies
Content Freshness
Detecting Updates
Balancing Freshness and Coverage
Priority-based Crawling
Previous
1. Introduction to Search Engines
Go to top
Next
3. Indexing