Useful Links
Computer Science
Big Data
Apache Spark
1. Introduction to Apache Spark
2. Core Spark Concepts
3. Spark Architecture and Execution
4. Spark SQL and Structured APIs
5. Structured Streaming
6. Machine Learning with MLlib
7. Graph Processing with GraphX
8. Performance Tuning and Optimization
Spark SQL and Structured APIs
Structured API Foundation
Motivation for Structure
Performance Optimization
Code Generation
Query Planning
API Comparison
RDD vs DataFrame vs Dataset
Type Safety Considerations
Performance Characteristics
DataFrame API
DataFrame Concepts
Tabular Data Abstraction
Schema Definition
Catalyst Integration
Schema Management
StructType Definition
StructField Configuration
Data Type System
Primitive Types
Complex Types
Null Handling
DataFrame Creation
From RDDs
From External Sources
From Collections
Schema Inference
Dataset API
Type Safety Features
Compile-Time Type Checking
Runtime Type Safety
Encoder System
Automatic Encoders
Custom Encoders
Serialization Optimization
DataFrame-Dataset Relationship
Conversion Methods
API Compatibility
Performance Implications
DataFrame and Dataset Operations
Untyped Transformations
Column Selection
select Operations
withColumn Operations
drop Operations
Row Filtering
filter Operations
where Operations
Grouping and Aggregation
groupBy Operations
agg Operations
pivot Operations
Joining Operations
Inner Joins
Outer Joins
Cross Joins
Broadcast Joins
Ordering Operations
orderBy Operations
sort Operations
Typed Transformations
Functional Operations
map Operations
flatMap Operations
filter Operations
Grouping Operations
groupByKey Operations
mapGroups Operations
flatMapGroups Operations
Action Operations
Data Collection
show Operations
collect Operations
take Operations
Aggregation Actions
count Operations
reduce Operations
Output Actions
write Operations
foreach Operations
SQL Interface
SQL Query Execution
SQL Context
Query Registration
Result Processing
View Management
Temporary Views
Global Temporary Views
View Lifecycle
User-Defined Functions
UDF Registration
UDF Usage in SQL
Performance Considerations
Data Sources and I/O
Data Source API
DataFrameReader
DataFrameWriter
Options Configuration
File Formats
Parquet
Columnar Storage Benefits
Schema Evolution
Predicate Pushdown
ORC
Optimization Features
Compression Options
JSON
Schema Inference
Nested Data Handling
CSV
Header Processing
Type Inference
Avro
Schema Registry Integration
Evolution Support
External Systems
File Systems
HDFS Integration
S3 Connectivity
Local File System
Relational Databases
JDBC Connectivity
MySQL Integration
PostgreSQL Integration
SQL Server Integration
NoSQL Systems
Cassandra Integration
HBase Connectivity
MongoDB Support
Query Optimization
Catalyst Optimizer
Logical Plan Optimization
Physical Plan Generation
Rule-Based Optimization
Cost-Based Optimization
Tungsten Execution Engine
Whole-Stage Code Generation
Vectorized Processing
Memory Management
Binary Processing
Previous
3. Spark Architecture and Execution
Go to top
Next
5. Structured Streaming