Useful Links
Computer Science
Data Science
Data Engineering
1. Introduction to Data Engineering
2. Foundational Programming Skills
3. Computer Science and Software Engineering Foundations
4. Database Systems and Data Storage
5. Data Warehousing and Analytics
6. Modern Data Storage Architectures
7. Batch Data Processing Systems
8. Stream Processing and Real-Time Data
9. Data Pipeline Architecture and Orchestration
10. Cloud Data Engineering Platforms
11. Data Operations and Infrastructure Management
12. Data Governance, Quality, and Security
13. Advanced Data Engineering Topics
Batch Data Processing Systems
ETL Process Design
Data Extraction Methods
Full Extraction
Incremental Extraction
Change Data Capture
API-Based Extraction
Data Transformation Techniques
Data Cleansing Operations
Data Type Conversions
Business Rule Implementation
Data Enrichment Processes
Aggregation and Summarization
Data Loading Strategies
Full Load Operations
Incremental Load Patterns
Upsert Operations
Bulk Loading Techniques
ELT Process Implementation
Extract and Load First Approach
Raw Data Preservation
Storage-First Strategy
In-Database Transformations
SQL-Based Transformations
Stored Procedure Usage
Database Engine Optimization
ELT vs. ETL Trade-offs
Processing Location Decisions
Resource Utilization Patterns
Scalability Considerations
Apache Hadoop Ecosystem
Hadoop Distributed File System
HDFS Architecture Components
NameNode and DataNode Roles
Block Storage and Replication
Fault Tolerance Mechanisms
File System Operations
MapReduce Programming Model
Map Phase Processing
Shuffle and Sort Operations
Reduce Phase Aggregation
Job Execution Workflow
Performance Tuning Strategies
Hadoop Ecosystem Tools
Apache Hive for SQL Queries
Apache Pig for Data Flow
Apache HBase for NoSQL Storage
Apache Sqoop for Data Transfer
Apache Spark Framework
Core Spark Concepts
Resilient Distributed Datasets
DataFrame API
Dataset API
Spark Context and Sessions
Spark Architecture
Driver Program Role
Executor Process Management
Cluster Manager Integration
Task Distribution and Execution
Spark SQL Engine
Structured Data Processing
Catalyst Optimizer
Hive Integration
JDBC/ODBC Connectivity
Spark Performance Optimization
Data Partitioning Strategies
Caching and Persistence
Resource Allocation Tuning
Shuffle Operation Optimization
Broadcast Variables Usage
Previous
6. Modern Data Storage Architectures
Go to top
Next
8. Stream Processing and Real-Time Data