Preface
This series documents my problem-solving journey on LeetGPU. For each problem, I’ll provide the complete approach from basic implementation to optimization. If you’re new to CUDA, I recommend familiarizing yourself with:
- CUDA kernel writing (
__global__) - Thread hierarchy (Grid / Block / Thread)
- Device memory allocation and transfer
Problem Description
Given two floating-point arrays A and B, add their corresponding elements and store the result in array C:
$$ C_i = A_i + B_i \quad (i = 0, 1, 2, \dots, N-1) $$Input/Output
- Input: Device pointers
A,Band data sizeN - Output: Device pointer
C
Solution v0: Basic Implementation
| |
Code Analysis
- Each thread handles one element addition
- One-dimensional thread layout:
blockDim.x * blockIdx.x + threadIdx.xcomputes global index - Boundary check
if (idx < N)prevents out-of-bounds access
Performance Characteristics
| Advantages | Disadvantages |
|---|---|
| Clean implementation, easy to understand | Grid configuration tightly coupled with data size |
| Naturally parallel, no data dependencies | Excessive threads increase scheduling overhead |
| Coalesced memory access |
Optimization 1: Grid-Stride Loop
v0’s problem is grid configuration tightly coupled with data size. To achieve optimal Occupancy, we typically want to fix the launch thread count based on device SM count. Grid-Stride Loop perfectly decouples the two:
| |
Core Concept
- Each thread processes multiple data elements, stride = total thread count
- Any launch configuration can process all data
- Can even degrade to
<<<1, 1>>>for serial execution
Optimization 2: Vectorized Loading
Further reduce loop overhead by processing 4 elements at once:
| |
Performance Improvement
In v1, processing 4 elements requires 4 loops, 4 boundary checks, and 4 stride increments. v2 needs only 1 loop, significantly reducing ALU and instruction issue overhead.
Performance Comparison
N = (1 « 28) + 3
| Version | Total Instructions | Relative to v0 |
|---|---|---|
| v0 | 268,435,456 | 100% (baseline) |
| v1 | 139,657,216 | 📉 48% reduction |
| v2 | 48,758,784 | 📉 82% reduction |
Since this kernel is completely memory-bandwidth bound, instruction count reduction doesn’t improve execution time. However, for compute-intensive kernels, v2 is expected to bring significant performance gains.
Summary
This problem is an entry-level GPU parallel computing question. Key points:
- Index Calculation - Master one-dimensional thread index mapping
- Boundary Check - Prevent out-of-bounds access
- Coalesced Access - Consecutive threads access consecutive memory
- Grid-Stride Loop - Decouple grid configuration from data size
- Vectorization - Reduce loop overhead
More complex optimization techniques will be introduced on this basis.