Kernel Programming

Launch Configuration

While an almost arbitrarily large number of workitems can be executed per kernel launch, the hardware can only support executing a limited number of wavefronts at one time.

To alleviate this, the compiler calculates the "occupancy" of each compiled kernel (which is the number of wavefronts that can be simultaneously executing on the GPU), and passes this information to the hardware; the hardware then launches a limited number of wavefronts at once, based on the kernel's "occupancy" values.

The rest of the wavefronts are not launched until hardware resources become available, which means that a kernel with better occupancy will see more of its wavefronts executing simultaneously (which often leads to better performance). Suffice to say, it's important to know the occupancy of kernels if you want the best performance.

Like CUDA.jl, AMDGPU.jl has the ability to calculate kernel occupancy, with the launch_configuration function:

julia

kernel = @roc launch=false mykernel(args...)
occupancy = AMDGPU.launch_configuration(kernel)
@show occupancy.gridsize
@show occupancy.groupsize

Specifically, launch_configuration calculates the occupancy of mykernel(args...), and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:

julia

@roc groupsize=occupancy.groupsize mykernel(args...)

AMDGPU.@roc Macro

julia

@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.

AMDGPU.Runtime.HIPKernel Type

julia

(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

gridsize::ROCDim = 1: Size of the grid.
groupsize::ROCDim = 1: Size of the workgroup.
shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.

AMDGPU.Compiler.hipfunction Function

julia

hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.

Atomics

AMDGPU.jl relies on Atomix.jl for atomics.

Example of a kernel that computes atomic max:

julia

using AMDGPU

function ker_atomic_max!(target, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    idx = indices[i]
    v = source[i]
    AMDGPU.@atomic max(target[idx], v)
    return
end

n, bins = 1024, 32
source = ROCArray(rand(UInt32, n))
indices = ROCArray(rand(1:bins, n))
target = ROCArray(zeros(UInt32, bins))
@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)

Wave Matrix Multiply Accumulate (WMMA)

Perform following computation D = A ⋅ B + C.

RDNA 3 (gfx1100-gfx1199)

Currently RDNA 3 supports the following types:

FP16 ⋅ FP16 + FP32 -> FP32;
BFP16 ⋅ BFP16 + FP32 -> FP32.

All WMMA functionality for RDNA 3 is in the AMDGPU.Device.WMMA_RDNA3 submodule. The tile dimensions are fixed at 16×16×16 (WMMA_RDNA3.M, WMMA_RDNA3.N, WMMA_RDNA3.K).

AMDGPU v2

The AMDGPU.Device.WMMA_RDNA3 module is also exported as AMDGPU.Device.WMMA for the AMDGPU v2 release cycle.

RDNA 4 (gfx1200+)

RDNA 4 introduces a simplified VGPR layout for WMMA operations with the following improvements:

Cleaner data distribution with no duplication (128-bit vs 256-bit in RDNA 3)
Each lane handles 8 elements for A and B fragments (vs 16 with duplication in RDNA 3)
Support for FP8 and BF8 types (requires LLVM 18+ and ROCm 6.0+)
New intrinsic names with _gfx12 suffix

All WMMA functionality for RDNA 4 is in the AMDGPU.Device.WMMA_RDNA4 submodule. The tile dimensions remain at 16×16×16 (WMMA_RDNA4.M, WMMA_RDNA4.N, WMMA_RDNA4.K).

Supported types on RDNA 4:

FP16 ⋅ FP16 + FP32 -> FP32
BFP16 ⋅ BFP16 + FP32 -> FP32
FP8 ⋅ FP8 + FP32 -> FP32 (experimental)
BF8 ⋅ BF8 + FP32 -> FP32 (experimental)

Common Features

Both RDNA 3 and RDNA 4 support the following layout types:

Layout types

Two layout types control how matrices are read from and written to memory:

WMMA_RDNA3.ColMajor / WMMA_RDNA4.ColMajor — column-major (Julia/Fortran) order: element (row, col) is at ptr[col * stride + row].
WMMA_RDNA3.RowMajor / WMMA_RDNA4.RowMajor — row-major (C) order: element (row, col) is at ptr[row * stride + col].

API

RDNA 3 API

AMDGPU.Device.WMMA_RDNA3.Fragment Type

julia

Fragment{M, N, T, L}

A fragment of a matrix for WMMA operations.

M, N: logical matrix dimensions this fragment represents a piece of
T: element type
L: number of elements stored per thread

For wave32 mode on RDNA 3:

A fragment (16xK): 16 elements per thread (8 VGPRs for FP16)
B fragment (Kx16): 16 elements per thread (8 VGPRs for FP16)
C/D fragment (16x16): 8 elements per thread (8 VGPRs for FP32, or 8 VGPRs holding 16 FP16)

AMDGPU.Device.WMMA_RDNA3.fill_c Function

julia

fill_c(::Type{Float32}, x::Float32)

Create and return fragment for C filled with given value x.

AMDGPU.Device.WMMA_RDNA3.load_a Function

julia

load_a(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix A (M×K) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA3.load_b Function

julia

load_b(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix B (K×N) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA3.load_c Function

julia

load_c(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix C (M×N) from memory and return a FragmentC_F32. stride is the leading dimension in number of elements. T may be Float32, Float16, or BFloat16; non-Float32 values are widened to Float32 on load.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA3.store_d Function

julia

store_d(ptr::LLVMPtr{T}, frag::FragmentC_F32, stride::Int32, layout) where T

Store the result matrix D to the memory location given by ptr. T may be Float32, Float16, or BFloat16; fragment values are narrowed from Float32 on store.

Arguments

ptr: Address to store the matrix to.
frag: Corresponding fragment.
stride: Leading dimension of the matrix for ptr in number of elements.
layout: ColMajor (default) or RowMajor.

AMDGPU.Device.WMMA_RDNA3.mma Function

julia

mma(
    a::FragmentA{T}, b::FragmentB{T}, c::FragmentC_F32,
) where T <: Union{Float16, BFloat16}

Perform matrix multiply-accumulate operation D = A ⋅ B + C with loaded fragments. A and B can be either in Float16 or in BFloat16.

RDNA 4 API

AMDGPU.Device.WMMA_RDNA4.Fragment Type

julia

Fragment{M, N, T, L}

A fragment of a matrix for WMMA operations.

M, N: logical matrix dimensions this fragment represents a piece of
T: element type
L: number of elements stored per thread

For wave32 mode on RDNA 4:

A fragment (16xK): 8 elements per thread (4 VGPRs for FP16/BF16)
B fragment (Kx16): 8 elements per thread (4 VGPRs for FP16/BF16)
C/D fragment (16x16): 8 elements per thread (4 VGPRs for FP32)

This is simpler than RDNA3 where A and B fragments had 16 elements per thread with data duplication.

AMDGPU.Device.WMMA_RDNA4.fill_c Function

julia

fill_c(::Type{Float32}, x::Float32)

Create and return fragment for C filled with given value x.

AMDGPU.Device.WMMA_RDNA4.load_a Function

julia

load_a(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix A (M×K) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

For RDNA4, each lane loads 8 elements of the matrix (vs 16 in RDNA3 with duplication). The distribution is:

lane 0-15: rows 0-15, columns 0-7 (half 0) and 8-15 (half 1)
lane 16-31: rows 0-15, columns 16-23 and 24-31

This results in a clean distribution with no data duplication.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA4.load_b Function

julia

load_b(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix B (K×N) from memory and return the resulting fragment. stride is the leading dimension in number of elements.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA4.load_c Function

julia

load_c(ptr::LLVMPtr{T}, stride::Int32, layout) where T

Load matrix C (M×N) from memory and return a FragmentC_F32. stride is the leading dimension in number of elements. T may be Float32, Float16, or BFloat16; non-Float32 values are widened to Float32 on load.

ColMajor: column-major storage, ptr[col * stride + row]
RowMajor: row-major storage, ptr[row * stride + col]

AMDGPU.Device.WMMA_RDNA4.store_d Function

julia

store_d(ptr::LLVMPtr{T}, frag::FragmentC_F32, stride::Int32, layout) where T

Store the result matrix D to the memory location given by ptr. T may be Float32, Float16, or BFloat16; fragment values are narrowed from Float32 on store.

Arguments

ptr: Address to store the matrix to.
frag: Corresponding fragment.
stride: Leading dimension of the matrix for ptr in number of elements.
layout: ColMajor (default) or RowMajor.

AMDGPU.Device.WMMA_RDNA4.mma Function

julia

mma(a::FragmentA{T}, b::FragmentB{T}, c::FragmentC_F32) where T

Perform matrix multiply-accumulate operation D = A \cdot B + C with loaded fragments. A and B can be either in Float16 or in BFloat16.

load_c and store_d accept pointer types Float32, Float16, and BFloat16. When T is Float16 or BFloat16, values are widened to Float32 on load and narrowed back on store, so the FragmentC_F32 accumulator type is always Float32 regardless of the backing buffer type.

Note: For RDNA 4, the same behavior applies, but the underlying LLVM intrinsics use the new _gfx12 suffix and have a simplified VGPR layout.

Example

Below is a matrix multiplication kernel using WMMA with column-major inputs. Pass WMMA_RDNA3.RowMajor instead to load from row-major (C-style) buffers.

Hardware Requirements

WMMA instructions require RDNA 3 (gfx11) or newer GPUs. This code will only execute successfully on compatible hardware with appropriate ROCm/LLVM support.

julia

using AMDGPU
using AMDGPU.Device: WMMA_RDNA3

function wmma_kernel!(C, A::AbstractArray{T}, B, M::Int32, N::Int32, K::Int32, layout) where T
    tile_row = (workgroupIdx().x - Int32(1)) * Int32(WMMA_RDNA3.M)
    tile_col = (workgroupIdx().y - Int32(1)) * Int32(WMMA_RDNA3.N)

    C_ptr = pointer(C)
    A_ptr = pointer(A)
    B_ptr = pointer(B)

    c_frag = WMMA_RDNA3.fill_c(Float32, 0f0)
    k = Int32(0)
    while k < K
        a_ptr, a_stride = _a_tile(A_ptr, layout, tile_row, k, M, K, T)
        b_ptr, b_stride = _b_tile(B_ptr, layout, tile_col, k, N, K, T)

        a_frag = WMMA_RDNA3.load_a(a_ptr, a_stride, layout)
        b_frag = WMMA_RDNA3.load_b(b_ptr, b_stride, layout)
        c_frag = WMMA_RDNA3.mma(a_frag, b_frag, c_frag)

        k += Int32(WMMA_RDNA3.K)
    end

    c_ptr = C_ptr + (tile_col * M + tile_row) * Int32(sizeof(Float32))
    WMMA_RDNA3.store_d(c_ptr, c_frag, M, WMMA_RDNA3.ColMajor)
    return
end

# Tile pointer + stride helpers — dispatched on layout, DCE'd by the compiler.
_a_tile(ptr, ::Type{WMMA_RDNA3.ColMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (k * M + tile_row) * Int32(sizeof(T)), M
_a_tile(ptr, ::Type{WMMA_RDNA3.RowMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (tile_row * K + k) * Int32(sizeof(T)), K

_b_tile(ptr, ::Type{WMMA_RDNA3.ColMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (tile_col * K + k) * Int32(sizeof(T)), K
_b_tile(ptr, ::Type{WMMA_RDNA3.RowMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (k * N + tile_col) * Int32(sizeof(T)), N

M, N, K = 32, 32, 32
A_host = Float16.(rand(M, K))
B_host = Float16.(rand(K, N))
A, B = ROCArray(A_host), ROCArray(B_host)
C = ROCArray(zeros(Float32, M, N))

tiles_m, tiles_n = M ÷ WMMA_RDNA3.M, N ÷ WMMA_RDNA3.N
@roc gridsize=(tiles_m, tiles_n) groupsize=32 wmma_kernel!(
    C, A, B, Int32(M), Int32(N), Int32(K), WMMA_RDNA3.ColMajor)

@assert maximum(abs.(Float32.(C) .- (Float32.(A) * Float32.(B)))) < 0.1

RDNA 4 Example

Here's the same example adapted for RDNA 4:

Hardware Requirements

WMMA instructions for RDNA 4 require gfx1200+ GPUs. This code will only execute successfully on compatible hardware with ROCm 6.0+ and LLVM 18+.

julia

using AMDGPU
using AMDGPU.Device: WMMA_RDNA4

function wmma_rdna4_kernel!(C, A::AbstractArray{T}, B, M::Int32, N::Int32, K::Int32, layout) where T
    tile_row = (workgroupIdx().x - Int32(1)) * Int32(WMMA_RDNA4.M)
    tile_col = (workgroupIdx().y - Int32(1)) * Int32(WMMA_RDNA4.N)

    C_ptr = pointer(C)
    A_ptr = pointer(A)
    B_ptr = pointer(B)

    c_frag = WMMA_RDNA4.fill_c(Float32, 0f0)
    k = Int32(0)
    while k < K
        a_ptr, a_stride = _a_tile(A_ptr, layout, tile_row, k, M, K, T)
        b_ptr, b_stride = _b_tile(B_ptr, layout, tile_col, k, N, K, T)

        a_frag = WMMA_RDNA4.load_a(a_ptr, a_stride, layout)
        b_frag = WMMA_RDNA4.load_b(b_ptr, b_stride, layout)
        c_frag = WMMA_RDNA4.mma(a_frag, b_frag, c_frag)

        k += Int32(WMMA_RDNA4.K)
    end

    c_ptr = C_ptr + (tile_col * M + tile_row) * Int32(sizeof(Float32))
    WMMA_RDNA4.store_d(c_ptr, c_frag, M, WMMA_RDNA4.ColMajor)
    return
end

# Tile pointer + stride helpers — dispatched on layout, DCE'd by the compiler.
_a_tile(ptr, ::Type{WMMA_RDNA4.ColMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (k * M + tile_row) * Int32(sizeof(T)), M
_a_tile(ptr, ::Type{WMMA_RDNA4.RowMajor}, tile_row, k, M, K, ::Type{T}) where T =
    ptr + (tile_row * K + k) * Int32(sizeof(T)), K

_b_tile(ptr, ::Type{WMMA_RDNA4.ColMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (tile_col * K + k) * Int32(sizeof(T)), K
_b_tile(ptr, ::Type{WMMA_RDNA4.RowMajor}, tile_col, k, N, K, ::Type{T}) where T =
    ptr + (k * N + tile_col) * Int32(sizeof(T)), N

M, N, K = 32, 32, 32
A_host = Float16.(rand(M, K))
B_host = Float16.(rand(K, N))
A, B = ROCArray(A_host), ROCArray(B_host)
C = ROCArray(zeros(Float32, M, N))

tiles_m, tiles_n = M ÷ WMMA_RDNA4.M, N ÷ WMMA_RDNA4.N
@roc groupsize=32 gridsize=(tiles_m, tiles_n) wmma_rdna4_kernel!(
    C, A, B, Int32(M), Int32(N), Int32(K), WMMA_RDNA4.ColMajor)

@assert maximum(abs.(Float32.(C) .- (Float32.(A) * Float32.(B)))) < 0.1

Device Intrinsics

Wavefront-Level Primitives

AMDGPU.Device.wavefrontsize Function

julia

wavefrontsize()::Cuint

Get the wavefront size of the device that executes current kernel.

AMDGPU.Device.activelane Function

julia

activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = i
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Array(x)
1×8 Matrix{Int32}:
 0  1  2  3  4  5  6  7

AMDGPU.Device.ballot Function

julia

ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia

julia> function ker!(x)
           x[1] = AMDGPU.Device.ballot(true)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Culong}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 0x00000000ffffffff

AMDGPU.Device.ballot_sync Function

julia

ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.ballot_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt64}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"

AMDGPU.Device.activemask Function

julia

activemask()::UInt64

Get the mask of all active lanes in a warp.

AMDGPU.Device.bpermute Function

julia

bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           # `addr` points to the next immediate lane.
           addr::Cint = ((i + 0x1) % 0x8) * 0x4 # VGPRs are 4 bytes wide
           # Read data from the next immediate lane.
           x[i + 0x1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

AMDGPU.Device.permute Function

julia

permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Put data into the next immediate lane.
           x[i + 1] = AMDGPU.Device.permute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 7  0  1  2  3  4  5  6

AMDGPU.Device.shfl Function

julia

shfl(val, lane::Cint, width::Cuint = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl(i, i + 0x1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

If width is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           ws::Cuint = 0x4 # <-- Notice width = 4.
           x[i + 0x1] = AMDGPU.Device.shfl(i, i + 0x1, ws)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  0  5  6  7  4

AMDGPU.Device.shfl_sync Function

julia

shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

AMDGPU.Device.shfl_up Function

julia

shfl_up(val, δ::Cint, width::Cuint = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl_up(i, Cint(0x1))
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0  0  1  2  3  4  5  6

AMDGPU.Device.shfl_up_sync Function

julia

shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

AMDGPU.Device.shfl_down Function

julia

shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           ws::Cuint = Cuint(0x8)
           x[i + 0x1] = AMDGPU.Device.shfl_down(i, Cint(0x1), ws)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  7

AMDGPU.Device.shfl_down_sync Function

julia

shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

AMDGPU.Device.shfl_xor Function

julia

shfl_xor(val, lane_mask::Cint, width::Cuint = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia

julia> function ker!(x)
           i::Cint = unsafe_trunc(Cint, AMDGPU.Device.activelane())
           x[i + 0x1] = AMDGPU.Device.shfl_xor(i, Cint(0x1))
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  0  3  2  5  4  7  6

AMDGPU.Device.shfl_xor_sync Function

julia

shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

AMDGPU.Device.any_sync Function

julia

any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.any_sync(mask, i == 0)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1

AMDGPU.Device.all_sync Function

julia

all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.all_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1