Useful Links
Computer Science
Programming
GPU Programming
1. Introduction to Parallel Computing and GPU Architecture
2. GPU Programming Models and APIs
3. Fundamentals of CUDA Programming
4. Intermediate CUDA Programming
5. Performance Optimization and Profiling
6. Advanced CUDA Programming
7. OpenCL Programming
8. Alternative GPU Programming Frameworks
9. Parallel Algorithms and Patterns
10. Applications and Case Studies
11. Performance Analysis and Optimization
12. Debugging and Testing
Performance Analysis and Optimization
Performance Modeling
Roofline Model
Arithmetic Intensity
Memory Bandwidth Limits
Compute Limits
Performance Bounds
Theoretical Peak Performance
Memory Bandwidth Limits
Latency Considerations
Scalability Analysis
Strong Scaling
Weak Scaling
Efficiency Metrics
Bottleneck Analysis
Memory-Bound vs. Compute-Bound
Identification Techniques
Optimization Strategies
Trade-off Analysis
Communication Bottlenecks
Host-Device Transfer
Inter-GPU Communication
Synchronization Overhead
Resource Utilization
Occupancy Analysis
Warp Efficiency
Memory Throughput
Advanced Optimization Techniques
Kernel Fusion
Reducing Memory Traffic
Eliminating Intermediate Results
Implementation Strategies
Memory Optimization
Data Layout Transformation
Memory Pooling
Prefetching Strategies
Instruction-Level Optimization
Loop Unrolling
Vectorization
Instruction Scheduling
Previous
10. Applications and Case Studies
Go to top
Next
12. Debugging and Testing