Fast Prefix Sum Implementation Using Subgroups in GLSL Compute Shaders

蓬森莉 · 2025-6-2 22:54:53

利用 Vulkan 1.1 的 subgroup 特性加速 ComputeShader 的前缀和计算，参考：
Vulkan Subgroup Tutorial - Khronos Blog - The Khronos Group Inc
Single-pass Parallel Prefix Scan with Decoupled Look-back | Research
相关知识

Compute模型

flowchart TD subgraph Subgroup["Subgroup"] Inv0["invocation 0"] Inv1["invocation 1"] InvDots["..."] Inv31["invocation 31"] end subgraph Workgroup["Workgroup"] SG0["Subgroup 0"] SG1["Subgroup 1"] SGDots["..."] SGM["Subgroup m"] end subgraph Dispatch["Dispatch"] WG0["Workgroup 0"] WG1["Workgroup 1"] WGDots["..."] WGN["Workgroup n"] end %% 设置水平排 WG0 --- WG1 --- WGDots --- WGN SG0 --- SG1 --- SGDots --- SGM Inv0 --- Inv1 --- InvDots --- Inv31shared memory

shared 变量在单个 work group 内共享，本文用于记录多个 subgroup 的前缀和结果
subgroup

GPU 上，线程通常以小组（通常为 32 或 64 个线程）的形式执行，本文利用 subgroupInclusiveAdd 计算单个 subgroup 内的前缀和，具体参考 https://www.khronos.org/blog/vulkan-subgroup-tutorial
假设有8个块，其active状态如下

id : 0 1 2 3 4 5 6 7
val: 0 1 0 1 1 0 0 1
//subgroupInclusiveAdd
val: 0 1 1 2 3 3 3 4

复制代码

流程概要

目标：计算size = n的数据的前缀和

拆分成 work_group_nums = (n + 1023) / 1024 个 local_size = 的 work_group 的前缀和，一个 work_group 有 1024 个 invocation，1024 个 invocation 拆分成 32 个 sub_group 的前缀和（sub_group_size = 32 on NIVDIA）
subgroupInclusiveAdd 计算 32 个 sub_group 内的前缀和，每个 sub_group 的最后一个结果（local_id = 31）存入 shared uint sg_offset[32]; （shared 变量在当前 work_group 内共享）
subgroupInclusiveAdd 计算 sg_offset 的前缀和，直接更新到 sg_offset 内，那么 sg_offset[gl_SubgroupSize - 1] 即为当前 work_group 的前缀和，结果存入 ss_wg_offset_[gl_WorkGroupID.x]
final pass 对 ss_wg_offset_ 再做一次前缀和，由于单位已经不是 work_group 内的 invocation，subgroupInclusiveAdd 无法 group 工作，于是手动遍历累加写入atomicExchange(ss_wg_offset_[gl_WorkGroupID.x], final_res);

实现细节

layout(local_size_x = 1024, local_size_y = 1, local_size_z = 1) in;
//shared memory跨subgroup暂存结果
shared uint sg_offset[32];
//sub_group_id
uint sg_id = gl_LocalInvocationIndex / gl_SubgroupSize;
// 前一个块是否有活跃voxel
uint prev_inv_actives = invocationActives(gl_GlobalInvocationID.x - 1) > 0 ? 1 : 0;
// sub_group 内的前缀和
uint wg_offset = subgroupInclusiveAdd(prev_inv_actives);
// sg_offset 存储32个 sub_group 最后的前缀和
if (gl_SubgroupInvocationID == gl_SubgroupSize-1) {
sg_offset[sg_id] = wg_offset;
}
barrier();
if (sg_id == 0) {
// 对 sg_offset 计算一次前缀和，直接更新到 sg_offset 内
sg_offset[gl_SubgroupInvocationID] =
subgroupInclusiveAdd(sg_offset[gl_SubgroupInvocationID]);
// 结果存入 ss_wg_offset_, 省略ecnode过程
atomicExchange(ss_wg_offset_[gl_WorkGroupID.x], your_value_encode);
}
barrier();
// 简单的 final pass, 省略
barrier();

复制代码

tips: GLSL没有提供 atomicRead，可以通过 atomicCompSwap(target, 0, 0) 实现

来源：程序园用户自行投稿发布，如果侵权，请联系站长删除
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

襁壮鸢 · 2025-11-27 03:22:59

感谢发布原创作品，程序园因你更精彩

抑卞枯 · 2025-11-30 08:44:16

鼓励转贴优秀软件安全工具和文档！

恿深疏 · 6 天前

yyds。多谢分享

笃扇 · 6 天前

感谢，下载保存了

殳世英 · 5 天前

感谢分享

求几少 · 4 天前

懂技术并乐意极积无私分享的人越来越少。珍惜

账号		自动登录	找回密码
密码			立即注册

Fast Prefix Sum Implementation Using Subgroups in GLSL Compute Shaders

相关帖子

回复

浏览过的版块

签约作者

Fast Prefix Sum Implementation Using Subgroups in GLSL Compute Shaders

相关帖子

相关推荐

回复

浏览过的版块

签约作者