Kernel Programming
Launch Configuration
While an almost arbitrarily large number of workitems can be executed per kernel launch, the hardware can only support executing a limited number of wavefronts at one time.
To alleviate this, the compiler calculates the "occupancy" of each compiled kernel (which is the number of wavefronts that can be simultaneously executing on the GPU), and passes this information to the hardware; the hardware then launches a limited number of wavefronts at once, based on the kernel's "occupancy" values.
The rest of the wavefronts are not launched until hardware resources become available, which means that a kernel with better occupancy will see more of its wavefronts executing simultaneously (which often leads to better performance). Suffice to say, it's important to know the occupancy of kernels if you want the best performance.
Like CUDA.jl, AMDGPU.jl has the ability to calculate kernel occupancy, with the launch_configuration
function:
kernel = @roc launch=false mykernel(args...)
occupancy = AMDGPU.launch_configuration(kernel)
@show occupancy.gridsize
@show occupancy.groupsize
Specifically, launch_configuration
calculates the occupancy of mykernel(args...)
, and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:
@roc groupsize=occupancy.groupsize mykernel(args...)
AMDGPU.@roc
— Macro@roc [kwargs...] func(args...)
High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.
Several keyword arguments are supported:
launch::Bool = true
: whether to launch the kernel. Iffalse
, then returns a compiled kernel which can be launched by calling it and passing arguments.- Arguments that influence kernel compilation, see
AMDGPU.Compiler.hipfunction
. - Arguments that influence kernel launch, see
AMDGPU.Runtime.HIPKernel
.
AMDGPU.Runtime.HIPKernel
— Type(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)
Launch compiled HIPKernel by passing arguments to it.
The following kwargs are supported:
gridsize::ROCDim = 1
: Size of the grid.groupsize::ROCDim = 1
: Size of the workgroup.shmem::Integer = 0
: Amount of dynamically-allocated shared memory in bytes.stream::HIP.HIPStream = AMDGPU.stream()
: Stream on which to launch the kernel.
AMDGPU.Compiler.hipfunction
— Functionhipfunction(f::F, tt::TT = Tuple{}; kwargs...)
Compile Julia function f
to a HIP kernel given a tuple of argument's types tt
that it accepts.
The following kwargs are supported:
name::Union{String, Nothing} = nothing
: A unique name to give a compiled kernel.unsafe_fp_atomics::Bool = true
: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
Atomics
AMDGPU.jl relies on Atomix.jl for atomics.
Example of a kernel that computes atomic max:
using AMDGPU
function ker_atomic_max!(target, source, indices)
i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
idx = indices[i]
v = source[i]
AMDGPU.@atomic max(target[idx], v)
return
end
n, bins = 1024, 32
source = ROCArray(rand(UInt32, n))
indices = ROCArray(rand(1:bins, n))
target = ROCArray(zeros(UInt32, bins))
@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)
Device Intrinsics
Wavefront-Level Primitives
AMDGPU.Device.wavefrontsize
— Functionwavefrontsize()::Cuint
Get the wavefront size of the device that executes current kernel.
AMDGPU.Device.activelane
— Functionactivelane()::Cuint
Get id of the current lane within a wavefront/warp.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = i
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Array(x)
1×8 Matrix{Int32}:
0 1 2 3 4 5 6 7
AMDGPU.Device.ballot
— Functionballot(predicate::Bool)::UInt64
Return a value whose N
th bit is set if and only if predicate
evaluates to true
for the N
th lane and the lane is active.
julia> function ker!(x)
x[1] = AMDGPU.Device.ballot(true)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Culong}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
0x00000000ffffffff
AMDGPU.Device.ballot_sync
— Functionballot_sync(mask::UInt64, predicate::Bool)::UInt64
Evaluate predicate
for all non-exited threads in mask
and return an integer whose Nth bit is set if and only if predicate
is true
for the Nth thread of the wavefront and the Nth thread is active.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.ballot_sync(mask, true)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt64}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> bitstring(Array(x)[1])
"0000000000000000000000000000000001010101010101010101010101010101"
AMDGPU.Device.activemask
— Functionactivemask()::UInt64
Get the mask of all active lanes in a warp.
AMDGPU.Device.bpermute
— Functionbpermute(addr::Integer, val::Cint)::Cint
Read data stored in val
from the lane VGPR (vector general purpose register) given by addr
.
The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr
VGPR should be desired_lane_id * 4
, since VGPR values are 4 bytes wide.
Example below shifts all values in the wavefront by 1 to the "left".
julia> function ker!(x)
i::Cint = AMDGPU.Device.activelane()
# `addr` points to the next immediate lane.
addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
# Read data from the next immediate lane.
x[i + 1] = AMDGPU.Device.bpermute(addr, i)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 0
AMDGPU.Device.permute
— Functionpermute(addr::Integer, val::Cint)::Cint
Put data stored in val
to the lane VGPR (vector general purpose register) given by addr
.
Example below shifts all values in the wavefront by 1 to the "right".
julia> function ker!(x)
i::Cint = AMDGPU.Device.activelane()
# `addr` points to the next immediate lane.
addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
# Put data into the next immediate lane.
x[i + 1] = AMDGPU.Device.permute(addr, i)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Cint}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
7 0 1 2 3 4 5 6
AMDGPU.Device.shfl
— Functionshfl(val, lane, width = wavefrontsize())
Read data stored in val
from a lane
(this is a more high-level op than bpermute
).
If lane
is outside the range [0:width - 1]
, the value returned corresponds to the value held by the lane modulo width
(within the same subsection).
julia> function ker!(x)
i::UInt32 = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt32}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 0
If width
is less than wavefront size then each subsection of the wavefront behaves as a separate entity with a starting logical lane ID of 0.
julia> function ker!(x)
i::UInt32 = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl(i, i + 1, 4) # <-- Notice width = 4.
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{UInt32}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> Int.(x)
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 0 5 6 7 4
AMDGPU.Device.shfl_sync
— Functionshfl_sync(mask::UInt64, val, lane, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
ID.
AMDGPU.Device.shfl_up
— Functionshfl_up(val, δ, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, accepts δ
that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
0 0 1 2 3 4 5 6
AMDGPU.Device.shfl_up_sync
— Functionshfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
with lower ID relative to the caller.
AMDGPU.Device.shfl_down
— Functionshfl_down(val, δ, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, accepts δ
that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 2 3 4 5 6 7 7
AMDGPU.Device.shfl_down_sync
— Functionshfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane
with higher ID relative to the caller.
AMDGPU.Device.shfl_xor
— Functionshfl_xor(val, lane_mask, width = wavefrontsize())
Same as shfl
, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask
.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Int}(undef, 1, 8);
julia> @roc groupsize=8 ker!(x);
julia> x
1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
1 0 3 2 5 4 7 6
AMDGPU.Device.shfl_xor_sync
— Functionshfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())
Synchronize threads according to a mask
and read data stored in val
from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask
.
AMDGPU.Device.any_sync
— Functionany_sync(mask::UInt64, predicate::Bool)::Bool
Evaluate predicate
for all non-exited threads in mask
and return non-zero if and only if predicate
evaluates to non-zero for any of them.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.any_sync(mask, i == 0)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Bool}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
1
AMDGPU.Device.all_sync
— Functionall_sync(mask::UInt64, predicate::Bool)::Bool
Evaluate predicate
for all non-exited threads in mask
and return non-zero if and only if predicate
evaluates to non-zero for all of them.
julia> function ker!(x)
i = AMDGPU.Device.activelane()
if i % 2 == 0
mask = 0x0000000055555555 # Only even threads.
x[1] = AMDGPU.Device.all_sync(mask, true)
end
return
end
ker! (generic function with 1 method)
julia> x = ROCArray{Bool}(undef, 1);
julia> @roc groupsize=32 ker!(x);
julia> x
1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
1