[Vector Addition] LeetGPU Problem 1

Preface

This series documents my problem-solving journey on LeetGPU. For each problem, I’ll provide the complete approach from basic implementation to optimization. If you’re new to CUDA, I recommend familiarizing yourself with:

CUDA kernel writing (__global__)
Thread hierarchy (Grid / Block / Thread)
Device memory allocation and transfer

Problem Description

Given two floating-point arrays A and B, add their corresponding elements and store the result in array C:

$$ C_i = A_i + B_i \quad (i = 0, 1, 2, \dots, N-1) $$

Input/Output

Input: Device pointers A, B and data size N
Output: Device pointer C

Solution v0: Basic Implementation

1
2
3
4
5
__global__ void vector_add_v0(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

Code Analysis

Each thread handles one element addition
One-dimensional thread layout: blockDim.x * blockIdx.x + threadIdx.x computes global index
Boundary check if (idx < N) prevents out-of-bounds access

Performance Characteristics

Advantages	Disadvantages
Clean implementation, easy to understand	Grid configuration tightly coupled with data size
Naturally parallel, no data dependencies	Excessive threads increase scheduling overhead
Coalesced memory access

Optimization 1: Grid-Stride Loop

v0’s problem is grid configuration tightly coupled with data size. To achieve optimal Occupancy, we typically want to fix the launch thread count based on device SM count. Grid-Stride Loop perfectly decouples the two:

1
2
3
4
5
6
__global__ void vector_add_v1(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    int step = blockDim.x * gridDim.x;  // total thread count
    for (int i = idx; i < N; i += step)
        C[i] = A[i] + B[i];
}

Core Concept

Each thread processes multiple data elements, stride = total thread count
Any launch configuration can process all data
Can even degrade to <<<1, 1>>> for serial execution

Optimization 2: Vectorized Loading

Further reduce loop overhead by processing 4 elements at once:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__global__ void vector_add_v2(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    int step = blockDim.x * gridDim.x;
    const float4* a4 = (const float4*)A;
    const float4* b4 = (const float4*)B;
    float4* c4 = (float4*)C;
    int N4 = N / 4;

    for (int i = idx; i < N4; i += step) {
        float4 at = a4[i];
        float4 bt = b4[i];
        c4[i] = make_float4(at.x + bt.x, at.y + bt.y, at.z + bt.z, at.w + bt.w);
    }

    // Handle remaining tail elements
    int tail = 4 * N4;
    for (int i = idx + tail; i < N; i += step)
        C[i] = A[i] + B[i];
}

Performance Improvement

In v1, processing 4 elements requires 4 loops, 4 boundary checks, and 4 stride increments. v2 needs only 1 loop, significantly reducing ALU and instruction issue overhead.

Performance Comparison

N = (1 « 28) + 3

Version	Total Instructions	Relative to v0
v0	268,435,456	100% (baseline)
v1	139,657,216	📉 48% reduction
v2	48,758,784	📉 82% reduction

Since this kernel is completely memory-bandwidth bound, instruction count reduction doesn’t improve execution time. However, for compute-intensive kernels, v2 is expected to bring significant performance gains.

Summary

This problem is an entry-level GPU parallel computing question. Key points:

Index Calculation - Master one-dimensional thread index mapping
Boundary Check - Prevent out-of-bounds access
Coalesced Access - Consecutive threads access consecutive memory
Grid-Stride Loop - Decouple grid configuration from data size
Vectorization - Reduce loop overhead

More complex optimization techniques will be introduced on this basis.