[Vector Addition] LeetGPU Problem 1 - Detailed Explanation

This article provides a detailed explanation of the first problem 'Vector Addition' on LeetGPU, covering solution approach and performance optimization strategies.

Preface

This series documents my problem-solving journey on LeetGPU. For each problem, I’ll provide the complete approach from basic implementation to optimization. If you’re new to CUDA, I recommend familiarizing yourself with:

  • CUDA kernel writing (__global__)
  • Thread hierarchy (Grid / Block / Thread)
  • Device memory allocation and transfer

Problem Description

Given two floating-point arrays A and B, add their corresponding elements and store the result in array C:

$$ C_i = A_i + B_i \quad (i = 0, 1, 2, \dots, N-1) $$

Input/Output

  • Input: Device pointers A, B and data size N
  • Output: Device pointer C

Solution v0: Basic Implementation

1
2
3
4
5
__global__ void vector_add_v0(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < N)
        C[idx] = A[idx] + B[idx];
}

Code Analysis

  • Each thread handles one element addition
  • One-dimensional thread layout: blockDim.x * blockIdx.x + threadIdx.x computes global index
  • Boundary check if (idx < N) prevents out-of-bounds access

Performance Characteristics

AdvantagesDisadvantages
Clean implementation, easy to understandGrid configuration tightly coupled with data size
Naturally parallel, no data dependenciesExcessive threads increase scheduling overhead
Coalesced memory access

Optimization 1: Grid-Stride Loop

v0’s problem is grid configuration tightly coupled with data size. To achieve optimal Occupancy, we typically want to fix the launch thread count based on device SM count. Grid-Stride Loop perfectly decouples the two:

1
2
3
4
5
6
__global__ void vector_add_v1(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    int step = blockDim.x * gridDim.x;  // total thread count
    for (int i = idx; i < N; i += step)
        C[i] = A[i] + B[i];
}

Core Concept

  • Each thread processes multiple data elements, stride = total thread count
  • Any launch configuration can process all data
  • Can even degrade to <<<1, 1>>> for serial execution

Optimization 2: Vectorized Loading

Further reduce loop overhead by processing 4 elements at once:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__global__ void vector_add_v2(const float* A, const float* B, float* C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    int step = blockDim.x * gridDim.x;
    const float4* a4 = (const float4*)A;
    const float4* b4 = (const float4*)B;
    float4* c4 = (float4*)C;
    int N4 = N / 4;

    for (int i = idx; i < N4; i += step) {
        float4 at = a4[i];
        float4 bt = b4[i];
        c4[i] = make_float4(at.x + bt.x, at.y + bt.y, at.z + bt.z, at.w + bt.w);
    }

    // Handle remaining tail elements
    int tail = 4 * N4;
    for (int i = idx + tail; i < N; i += step)
        C[i] = A[i] + B[i];
}

Performance Improvement

In v1, processing 4 elements requires 4 loops, 4 boundary checks, and 4 stride increments. v2 needs only 1 loop, significantly reducing ALU and instruction issue overhead.

Performance Comparison

N = (1 « 28) + 3

VersionTotal InstructionsRelative to v0
v0268,435,456100% (baseline)
v1139,657,216📉 48% reduction
v248,758,784📉 82% reduction

Since this kernel is completely memory-bandwidth bound, instruction count reduction doesn’t improve execution time. However, for compute-intensive kernels, v2 is expected to bring significant performance gains.


Summary

This problem is an entry-level GPU parallel computing question. Key points:

  1. Index Calculation - Master one-dimensional thread index mapping
  2. Boundary Check - Prevent out-of-bounds access
  3. Coalesced Access - Consecutive threads access consecutive memory
  4. Grid-Stride Loop - Decouple grid configuration from data size
  5. Vectorization - Reduce loop overhead

More complex optimization techniques will be introduced on this basis.

🪐 本站总访问量 次 | 📖 本文阅读量
Built with Hugo
Theme Stack designed by Jimmy