Useful Links
Computer Science
Programming
GPU Programming
1. Introduction to Parallel Computing and GPU Architecture
2. GPU Programming Models and APIs
3. Fundamentals of CUDA Programming
4. Intermediate CUDA Programming
5. Performance Optimization and Profiling
6. Advanced CUDA Programming
7. OpenCL Programming
8. Alternative GPU Programming Frameworks
9. Parallel Algorithms and Patterns
10. Applications and Case Studies
11. Performance Analysis and Optimization
12. Debugging and Testing
Performance Optimization and Profiling
Performance Analysis Methodology
Bottleneck Identification
Performance Metrics
Optimization Workflow
Measurement Techniques
Memory Optimization
Memory Access Patterns
Coalesced Access Optimization
Stride Minimization
Cache-Friendly Patterns
Memory Bandwidth Utilization
Theoretical vs. Achieved Bandwidth
Memory Throughput Analysis
Bandwidth-Bound vs. Compute-Bound
Memory Hierarchy Optimization
Cache Utilization
Shared Memory Usage
Register Optimization
Compute Optimization
Occupancy Maximization
Occupancy Definition
Limiting Factors
Occupancy Calculator Usage
Trade-offs Analysis
Instruction Throughput
Warp Scheduling
Latency Hiding
Instruction Mix Optimization
Branch Divergence Minimization
Divergence Causes
Mitigation Strategies
Predication Techniques
Profiling Tools and Techniques
NVIDIA Nsight Systems
Timeline Analysis
API Tracing
System-Wide Profiling
NVIDIA Nsight Compute
Kernel Analysis
Performance Metrics
Roofline Analysis
Command-Line Profiling
ncu Usage
Metric Collection
Automated Analysis
Advanced Optimization Techniques
Asynchronous Operations
CUDA Streams
Concurrent Execution
Memory Transfer Overlap
Multi-GPU Optimization
Load Balancing
Communication Minimization
Scaling Strategies
Previous
4. Intermediate CUDA Programming
Go to top
Next
6. Advanced CUDA Programming