UsefulLinks
Computer Science
Programming
GPU Programming
1. Introduction to Parallel Computing and GPU Architecture
2. GPU Programming Models and APIs
3. Fundamentals of CUDA Programming
4. Intermediate CUDA Programming
5. Performance Optimization and Profiling
6. Advanced CUDA Programming
7. OpenCL Programming
8. Alternative GPU Programming Frameworks
9. Parallel Algorithms and Patterns
10. Applications and Case Studies
11. Performance Analysis and Optimization
12. Debugging and Testing
5.
Performance Optimization and Profiling
5.1.
Performance Analysis Methodology
5.1.1.
Bottleneck Identification
5.1.2.
Performance Metrics
5.1.3.
Optimization Workflow
5.1.4.
Measurement Techniques
5.2.
Memory Optimization
5.2.1.
Memory Access Patterns
5.2.1.1.
Coalesced Access Optimization
5.2.1.2.
Stride Minimization
5.2.1.3.
Cache-Friendly Patterns
5.2.2.
Memory Bandwidth Utilization
5.2.2.1.
Theoretical vs. Achieved Bandwidth
5.2.2.2.
Memory Throughput Analysis
5.2.2.3.
Bandwidth-Bound vs. Compute-Bound
5.2.3.
Memory Hierarchy Optimization
5.2.3.1.
Cache Utilization
5.2.3.2.
Shared Memory Usage
5.2.3.3.
Register Optimization
5.3.
Compute Optimization
5.3.1.
Occupancy Maximization
5.3.1.1.
Occupancy Definition
5.3.1.2.
Limiting Factors
5.3.1.3.
Occupancy Calculator Usage
5.3.1.4.
Trade-offs Analysis
5.3.2.
Instruction Throughput
5.3.2.1.
Warp Scheduling
5.3.2.2.
Latency Hiding
5.3.2.3.
Instruction Mix Optimization
5.3.3.
Branch Divergence Minimization
5.3.3.1.
Divergence Causes
5.3.3.2.
Mitigation Strategies
5.3.3.3.
Predication Techniques
5.4.
Profiling Tools and Techniques
5.4.1.
NVIDIA Nsight Systems
5.4.1.1.
Timeline Analysis
5.4.1.2.
API Tracing
5.4.1.3.
System-Wide Profiling
5.4.2.
NVIDIA Nsight Compute
5.4.2.1.
Kernel Analysis
5.4.2.2.
Performance Metrics
5.4.2.3.
Roofline Analysis
5.4.3.
Command-Line Profiling
5.4.3.1.
ncu Usage
5.4.3.2.
Metric Collection
5.4.3.3.
Automated Analysis
5.5.
Advanced Optimization Techniques
5.5.1.
Asynchronous Operations
5.5.1.1.
CUDA Streams
5.5.1.2.
Concurrent Execution
5.5.1.3.
Memory Transfer Overlap
5.5.2.
Multi-GPU Optimization
5.5.2.1.
Load Balancing
5.5.2.2.
Communication Minimization
5.5.2.3.
Scaling Strategies
Previous
4. Intermediate CUDA Programming
Go to top
Next
6. Advanced CUDA Programming