Skip to content

Kernel Programming

Launch Configuration

While an almost arbitrarily large number of workitems can be executed per kernel launch, the hardware can only support executing a limited number of wavefronts at one time.

To alleviate this, the compiler calculates the "occupancy" of each compiled kernel (which is the number of wavefronts that can be simultaneously executing on the GPU), and passes this information to the hardware; the hardware then launches a limited number of wavefronts at once, based on the kernel's "occupancy" values.

The rest of the wavefronts are not launched until hardware resources become available, which means that a kernel with better occupancy will see more of its wavefronts executing simultaneously (which often leads to better performance). Suffice to say, it's important to know the occupancy of kernels if you want the best performance.

Like CUDA.jl, AMDGPU.jl has the ability to calculate kernel occupancy, with the launch_configuration function:

julia
kernel = @roc launch=false mykernel(args...)
occupancy = AMDGPU.launch_configuration(kernel)
@show occupancy.gridsize
@show occupancy.groupsize

Specifically, launch_configuration calculates the occupancy of mykernel(args...), and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:

julia
@roc groupsize=occupancy.groupsize mykernel(args...)
AMDGPU.@roc Macro
julia
@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

  • launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.

  • Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.

  • Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.

source
AMDGPU.Runtime.HIPKernel Type
julia
(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

  • gridsize::ROCDim = 1: Size of the grid.

  • groupsize::ROCDim = 1: Size of the workgroup.

  • shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.

  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.

source
AMDGPU.Compiler.hipfunction Function
julia
hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

  • name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.

  • unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.

source

Atomics

AMDGPU.jl relies on Atomix.jl for atomics.

Example of a kernel that computes atomic max:

julia
using AMDGPU

function ker_atomic_max!(target, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    idx = indices[i]
    v = source[i]
    AMDGPU.@atomic max(target[idx], v)
    return
end

n, bins = 1024, 32
source = ROCArray(rand(UInt32, n))
indices = ROCArray(rand(1:bins, n))
target = ROCArray(zeros(UInt32, bins))
@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)

Wave Matrix Multiply Accumulate (WMMA)

Perform following computation D = A ⋅ B + C. Currently only RDNA 3 is supported and following types:

  • FP16 ⋅ FP16 + FP32 -> FP32;

  • BFP16 ⋅ BFP16 + FP32 -> FP32.

All WMMA functionality is in the AMDGPU.Device.WMMA submodule. The tile dimensions are fixed at 16×16×16 (WMMA.M, WMMA.N, WMMA.K).

Layout types

Two layout types control how matrices are read from and written to memory:

  • WMMA.ColMajor — column-major (Julia/Fortran) order: element (row, col) is at ptr[col * stride + row].

  • WMMA.RowMajor — row-major (C) order: element (row, col) is at ptr[row * stride + col].

API

AMDGPU.Device.WMMA.Fragment Type
julia
Fragment{M, N, T, L}

A fragment of a matrix for WMMA operations.

  • M, N: logical matrix dimensions this fragment represents a piece of

  • T: element type

  • L: number of elements stored per thread

For wave32 mode on RDNA 3:

  • A fragment (16xK): 16 elements per thread (8 VGPRs for FP16)

  • B fragment (Kx16): 16 elements per thread (8 VGPRs for FP16)

  • C/D fragment (16x16): 8 elements per thread (8 VGPRs for FP32, or 8 VGPRs holding 16 FP16)

source
AMDGPU.Device.WMMA.fill_c Function
julia
fill_c(::Type{Float32}, x::Float32)

Create and return fragment for C filled with given value x.

source
AMDGPU.Device.WMMA.load_a Function
julia
load_a(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix A (M×K) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

  • ColMajor: column-major storage, ptr[col * stride + row]

  • RowMajor: row-major storage, ptr[row * stride + col]

source
AMDGPU.Device.WMMA.load_b Function
julia
load_b(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix B (K×N) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

  • ColMajor: column-major storage, ptr[col * stride + row]

  • RowMajor: row-major storage, ptr[row * stride + col]

source
AMDGPU.Device.WMMA.load_c Function
julia
load_c(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix C (M×N) from memory and return a FragmentC_F32. stride is the leading dimension in number of elements. T may be Float32, Float16, or BFloat16; non-Float32 values are widened to Float32 on load.

  • ColMajor: column-major storage, ptr[col * stride + row]

  • RowMajor: row-major storage, ptr[row * stride + col]

source
AMDGPU.Device.WMMA.store_d Function
julia
store_d(ptr::LLVMPtr{T}, frag::FragmentC_F32, stride::Int32, layout) where T

Store the result matrix D to the memory location given by ptr. T may be Float32, Float16, or BFloat16; fragment values are narrowed from Float32 on store.

Arguments

  • ptr: Address to store the matrix to.

  • frag: Corresponding fragment.

  • stride: Leading dimension of the matrix for ptr in number of elements.

  • layout: ColMajor (default) or RowMajor.

source
AMDGPU.Device.WMMA.mma Function
julia
mma(
    a::FragmentA{T}, b::FragmentB{T}, c::FragmentC_F32,
) where T <: Union{Float16, BFloat16}

Perform matrix multiply-accumulate operation D = A ⋅ B + C with loaded fragments. A and B can be either in Float16 or in BFloat16.

source

load_c and store_d accept pointer types Float32, Float16, and BFloat16. When T is Float16 or BFloat16, values are widened to Float32 on load and narrowed back on store, so the FragmentC_F32 accumulator type is always Float32 regardless of the backing buffer type.

Example

Below is a matrix multiplication kernel using WMMA with column-major inputs. Pass WMMA.RowMajor instead to load from row-major (C-style) buffers.

julia
using AMDGPU
using AMDGPU.Device: WMMA

function wmma_kernel!(C, A::AbstractArray{T}, B, M::Int32, N::Int32, K::Int32, layout) where T
    tile_row = (workgroupIdx().x - Int32(1)) * Int32(WMMA.M)
    tile_col = (workgroupIdx().y - Int32(1)) * Int32(WMMA.N)

    C_ptr = pointer(C)
    A_ptr = pointer(A)
    B_ptr = pointer(B)

    c_frag = WMMA.fill_c(Float32, 0f0)
    k = Int32(0)
    while k < K
        a_ptr, a_stride = _a_tile(A_ptr, layout, tile_row, k, M, K, T)
        b_ptr, b_stride = _b_tile(B_ptr, layout, tile_col, k, N, K, T)

        a_frag = WMMA.load_a(a_ptr, a_stride, layout)
        b_frag = WMMA.load_b(b_ptr, b_stride, layout)
        c_frag = WMMA.mma(a_frag, b_frag, c_frag)

        k += Int32(WMMA.K)
    end

    c_ptr = C_ptr + (tile_col * M + tile_row) * Int32(sizeof(Float32))
    WMMA.store_d(c_ptr, c_frag, M, WMMA.ColMajor)
    return
end

# Tile pointer + stride helpers — dispatched on layout, DCE'd by the compiler.
_a_tile(ptr, ::Type{WMMA.ColMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (k * M + tile_row) * Int32(sizeof(T)), M
_a_tile(ptr, ::Type{WMMA.RowMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (tile_row * K + k) * Int32(sizeof(T)), K

_b_tile(ptr, ::Type{WMMA.ColMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (tile_col * K + k) * Int32(sizeof(T)), K
_b_tile(ptr, ::Type{WMMA.RowMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (k * N + tile_col) * Int32(sizeof(T)), N

M, N, K = 32, 32, 32
A_host = Float16.(rand(M, K))
B_host = Float16.(rand(K, N))
A, B = ROCArray(A_host), ROCArray(B_host)
C = ROCArray(zeros(Float32, M, N))

tiles_m, tiles_n = M ÷ WMMA.M, N ÷ WMMA.N
@roc gridsize=(tiles_m, tiles_n) groupsize=32 wmma_kernel!(
    C, A, B, Int32(M), Int32(N), Int32(K), WMMA.ColMajor)

@assert maximum(abs.(Float32.(C) .- (Float32.(A) * Float32.(B)))) < 0.1

Device Intrinsics

Wavefront-Level Primitives

AMDGPU.Device.wavefrontsize Function
julia
wavefrontsize()::Cuint

Get the wavefront size of the device that executes current kernel.

source
AMDGPU.Device.activelane Function
julia
activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia
julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = i
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Array(x)
1×8 Matrix{Int32}:
 0  1  2  3  4  5  6  7
source
AMDGPU.Device.ballot Function
julia
ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia
julia> function ker!(x)
           x[1] = AMDGPU.Device.ballot(true)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Culong}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 0x00000000ffffffff
source
AMDGPU.Device.ballot_sync Function
julia
ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia
julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.ballot_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt64}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"
source
AMDGPU.Device.activemask Function
julia
activemask()::UInt64

Get the mask of all active lanes in a warp.

source
AMDGPU.Device.bpermute Function
julia
bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           # `addr` points to the next immediate lane.
           addr::Cint = ((i + 0x1) % 0x8) * 0x4 # VGPRs are 4 bytes wide
           # Read data from the next immediate lane.
           x[i + 0x1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0
source
AMDGPU.Device.permute Function
julia
permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia
julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Put data into the next immediate lane.
           x[i + 1] = AMDGPU.Device.permute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 7  0  1  2  3  4  5  6
source
AMDGPU.Device.shfl Function
julia
shfl(val, lane::Cint, width::Cuint = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl(i, i + 0x1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

If width is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           ws::Cuint = 0x4 # <-- Notice width = 4.
           x[i + 0x1] = AMDGPU.Device.shfl(i, i + 0x1, ws)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  0  5  6  7  4
source
AMDGPU.Device.shfl_sync Function
julia
shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

source
AMDGPU.Device.shfl_up Function
julia
shfl_up(val, δ::Cint, width::Cuint = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl_up(i, Cint(0x1))
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0  0  1  2  3  4  5  6
source
AMDGPU.Device.shfl_up_sync Function
julia
shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

source
AMDGPU.Device.shfl_down Function
julia
shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           ws::Cuint = Cuint(0x8)
           x[i + 0x1] = AMDGPU.Device.shfl_down(i, Cint(0x1), ws)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  7
source
AMDGPU.Device.shfl_down_sync Function
julia
shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

source
AMDGPU.Device.shfl_xor Function
julia
shfl_xor(val, lane_mask::Cint, width::Cuint = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia
julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl_xor(i, Cint(0x1))
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  0  3  2  5  4  7  6
source
AMDGPU.Device.shfl_xor_sync Function
julia
shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

source
AMDGPU.Device.any_sync Function
julia
any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia
julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.any_sync(mask, i == 0)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
source
AMDGPU.Device.all_sync Function
julia
all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia
julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.all_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
source