Kernel Programming

Launch Configuration

While an almost arbitrarily large number of workitems can be executed per kernel launch, the hardware can only support executing a limited number of wavefronts at one time.

To alleviate this, the compiler calculates the "occupancy" of each compiled kernel (which is the number of wavefronts that can be simultaneously executing on the GPU), and passes this information to the hardware; the hardware then launches a limited number of wavefronts at once, based on the kernel's "occupancy" values.

The rest of the wavefronts are not launched until hardware resources become available, which means that a kernel with better occupancy will see more of its wavefronts executing simultaneously (which often leads to better performance). Suffice to say, it's important to know the occupancy of kernels if you want the best performance.

Like CUDA.jl, AMDGPU.jl has the ability to calculate kernel occupancy, with the launch_configuration function:

kernel = @roc launch=false mykernel(args...)
occupancy = AMDGPU.launch_configuration(kernel)
@show occupancy.gridsize
@show occupancy.groupsize

Specifically, launch_configuration calculates the occupancy of mykernel(args...), and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:

@roc groupsize=occupancy.groupsize mykernel(args...)
AMDGPU.@rocMacro
@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

  • launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
  • Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
  • Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.
source
AMDGPU.Runtime.HIPKernelType
(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

  • gridsize::ROCDim = 1: Size of the grid.
  • groupsize::ROCDim = 1: Size of the workgroup.
  • shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.
source
AMDGPU.Compiler.hipfunctionFunction
hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

  • name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
  • unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
source

Atomics

AMDGPU.jl relies on Atomix.jl for atomics.

Example of a kernel that computes atomic max:

using AMDGPU

function ker_atomic_max!(target, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    idx = indices[i]
    v = source[i]
    AMDGPU.@atomic max(target[idx], v)
    return
end

n, bins = 1024, 32
source = ROCArray(rand(UInt32, n))
indices = ROCArray(rand(1:bins, n))
target = ROCArray(zeros(UInt32, bins))
@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)

Device Intrinsics

Wavefront-Level Primitives

AMDGPU.Device.activelaneFunction
activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = i
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Array(x)
1×8 Matrix{Int32}:
 0  1  2  3  4  5  6  7
source
AMDGPU.Device.ballotFunction
ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia> function ker!(x)
           x[1] = AMDGPU.Device.ballot(true)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Culong}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 0x00000000ffffffff
source
AMDGPU.Device.ballot_syncFunction
ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.ballot_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt64}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"
source
AMDGPU.Device.bpermuteFunction
bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Read data from the next immediate lane.
           x[i + 1] = AMDGPU.Device.bpermute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0
source
AMDGPU.Device.permuteFunction
permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia> function ker!(x)
           i::Cint = AMDGPU.Device.activelane()
           # `addr` points to the next immediate lane.
           addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
           # Put data into the next immediate lane.
           x[i + 1] = AMDGPU.Device.permute(addr, i)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Cint}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 7  0  1  2  3  4  5  6
source
AMDGPU.Device.shflFunction
shfl(val, lane, width = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  0

If width is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.

julia> function ker!(x)
           i::UInt32 = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl(i, i + 1, 4) # <-- Notice width = 4.
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{UInt32}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  0  5  6  7  4
source
AMDGPU.Device.shfl_syncFunction
shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

source
AMDGPU.Device.shfl_upFunction
shfl_up(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0  0  1  2  3  4  5  6
source
AMDGPU.Device.shfl_up_syncFunction
shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

source
AMDGPU.Device.shfl_downFunction
shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  2  3  4  5  6  7  7
source
AMDGPU.Device.shfl_down_syncFunction
shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

source
AMDGPU.Device.shfl_xorFunction
shfl_xor(val, lane_mask, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1, 8);

julia> @roc groupsize=8 ker!(x);

julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 1  0  3  2  5  4  7  6
source
AMDGPU.Device.shfl_xor_syncFunction
shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

source
AMDGPU.Device.any_syncFunction
any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.any_sync(mask, i == 0)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
source
AMDGPU.Device.all_syncFunction
all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia> function ker!(x)
           i = AMDGPU.Device.activelane()
           if i % 2 == 0
               mask = 0x0000000055555555 # Only even threads.
               x[1] = AMDGPU.Device.all_sync(mask, true)
           end
           return
       end
ker! (generic function with 1 method)

julia> x = ROCArray{Bool}(undef, 1);

julia> @roc groupsize=32 ker!(x);

julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 1
source