Useful Links
Computer Science
Big Data
Apache Spark
1. Introduction to Apache Spark
2. Core Spark Concepts
3. Spark Architecture and Execution
4. Spark SQL and Structured APIs
5. Structured Streaming
6. Machine Learning with MLlib
7. Graph Processing with GraphX
8. Performance Tuning and Optimization
Core Spark Concepts
Spark Application Architecture
Driver Program
Driver Responsibilities
SparkContext Management
Cluster Communication
SparkSession
Unified Entry Point
Configuration Management
Resource Access
Cluster Manager Integration
Resource Allocation
Cluster Manager Types
Deployment Coordination
Executors and Workers
Task Execution Model
Memory Management
Resource Utilization
Inter-Executor Communication
Resilient Distributed Datasets
RDD Fundamentals
Core Abstraction Concept
Distributed Collection Model
Immutability Principle
RDD Properties
Fault Tolerance through Lineage
Lazy Evaluation
Partitioning Strategy
Hash Partitioning
Range Partitioning
Custom Partitioning
Persistence Options
RDD Creation Methods
Parallelizing Collections
Local Data Parallelization
Collection Distribution
External Data Sources
Text Files
Sequence Files
Hadoop InputFormats
Database Connections
RDD Operations
Transformations
Narrow Transformations
map Operations
filter Operations
flatMap Operations
sample Operations
union Operations
Wide Transformations
groupByKey Operations
reduceByKey Operations
sortByKey Operations
join Operations
cogroup Operations
repartition Operations
Actions
Collection Actions
collect Operations
take Operations
first Operations
Aggregation Actions
count Operations
reduce Operations
aggregate Operations
Output Actions
saveAsTextFile Operations
foreach Operations
Shared Variables
Broadcast Variables
Read-Only Shared Data
Memory Efficiency
Implementation Patterns
Best Practices
Accumulators
Write-Only Variables
Distributed Counters
Custom Accumulators
Usage Limitations
Previous
1. Introduction to Apache Spark
Go to top
Next
3. Spark Architecture and Execution