Vulkan Cooperative Matrix 简明教程

慢秤发表于 7 天前

1. 引言

本教程介绍 Vulkan 中的 Cooperative Matrix（协作矩阵）技术，包括工作原理、Vulkan 扩展 API、Shader 函数详解，以及在 NCNN 深度学习框架中的实际应用。
在阅读本文前您需要一定的GPU编程知识，以及Vulkan Compute基本知识，这里推荐白牛大佬的：如何火急火燎地上手 Vulkan Compute - 知乎
本文参考了以下资料：
vulkan tensorcore 笔记 - 知乎
Tencent/ncnn: ncnn is a high-performance neural network inference framework optimized for the mobile platform
为以上内容作者nihui大佬献上赞歌~
本文作者也只是初学者，如有不到之处欢迎批评指正。
本文配套代码：futz12/cm_gemm_example
2. 矩阵分块基础

在深入学习 Cooperative Matrix 之前，理解矩阵分块（Matrix Tiling/Blocking）是必要的。矩阵分块是高性能矩阵计算的核心技术，也是 Cooperative Matrix 的工作基础。
2.1 为什么需要矩阵分块

大矩阵计算的挑战

考虑两个大矩阵相乘：\(C = A \times B\)，其中 \(A\) 是 \(4096 \times 4096\)，\(B\) 是 \(4096 \times 4096\)。
计算量：

\[\text{FLOP} = 2 \times 4096 \times 4096 \times 4096 \approx 137 \text{ GFLOP}\]
内存访问量（朴素实现）：

\[\text{读取 A} = 4096 \times 4096 \times 4 \text{ bytes} \times 4096 = 256 \text{ GB}\]

\[\text{读取 B} = 4096 \times 4096 \times 4 \text{ bytes} \times 4096 = 256 \text{ GB}\]

\[\text{写入 C} = 4096 \times 4096 \times 4 \text{ bytes} = 64 \text{ MB}\]
朴素实现中，每个输出元素 \(C_{ij}\) 需要读取 A 的一整行和 B 的一整列，导致大量重复的内存访问。
分块解决方案

将大矩阵划分为小块，每次只处理一个小块：

\
分块矩阵乘法：

\
分块示意（以 \(128 \times 128\) 矩阵，\(16 \times 16\) 分块为例）：

\
每个 \(\boxed{A_{ij}}\) 是 \(32 \times 32\) 的子矩阵。
2.2 分块的优势

数据复用

分块矩阵乘法的核心优势是数据复用：在计算单个输出块时，A 的行数据和 B 的列数据被多次复用。
以计算 \(C_{00}\) 为例（假设 \(C_{00}\) 是 \(32 \times 32\) 的块，共 1024 个输出元素）：

\
数据需求分析：
计算 \(32 \times 32 = 1024\) 个输出元素需要：

\[\begin{array}{c|c|c|c}\text{数据} & \text{范围} & \text{元素数量} & \text{说明} \\\hline\text{A 的行} & \text{第 0 到 31 行} & 32 \times 128 = 4096 & \text{每行 128 个元素} \\\text{B 的列} & \text{第 0 到 31 列} & 128 \times 32 = 4096 & \text{每列 128 个元素}\end{array}\]
关键观察：

[*]A 的每个元素 \(a_{ik}\) 会参与该行所有 32 个输出元素的计算：\(c_{i0}, c_{i1}, \ldots, c_{i31}\)
[*]B 的每个元素 \(b_{kj}\) 会参与该列所有 32 个输出元素的计算：\(c_{0j}, c_{1j}, \ldots, c_{31j}\)
[*]即每个 A 元素被复用 32 次，每个 B 元素也被复用 32 次
内存访问对比（计算 1024 个输出元素）：

\[\begin{array}{c|c|c|c|c}\text{方式} & \text{A 的访问次数} & \text{B 的访问次数} & \text{总访问次数} & \text{说明} \\\hline\text{不分块} & 1024 \times 128 = 131072 & 1024 \times 128 = 131072 & 262144 & \text{每个输出元素独立读取 A 行和 B 列} \\\text{分块} & 4096 & 4096 & 8192 & \text{A 和 B 的每个元素只加载 1 次到 Shared Memory}\end{array}\]
复用倍数：

\[\text{内存访问减少倍数} = \frac{262144}{8192} = 32 \text{ 倍}\]
分块计算通过将 A 和 B 的数据缓存到 Shared Memory，使得每个数据元素只需加载一次，就能被 32 个输出元素复用，内存访问量减少 32 倍。
缓存友好

内存层次结构：

\[\begin{array}{c|c|c}\text{层次} & \text{容量} & \text{延迟} \\\hline\text{寄存器} & \sim 256 \text{ KB} & 1 \text{ cycle} \\\text{L1 缓存} & \sim 32 \text{ KB} & \sim 4 \text{ cycles} \\\text{L2 缓存} & \sim 2 \text{ MB} & \sim 40 \text{ cycles} \\\text{全局内存} & \sim 8 \text{ GB} & \sim 400 \text{ cycles}\end{array}\]
分块策略：选择块大小使数据能放入高速缓存

\[\text{块大小} \leq \text{缓存大小}\]
例如：L1 缓存 32KB，FP32 数据：

\[\text{最大块元素数} = \frac{32 \times 1024}{4} = 8192 \text{ 元素}\]

\[\text{可选块大小} = 64 \times 64 = 4096 \text{ 元素（安全）}\]
2.3 分块矩阵乘法示例

具体数值示例

计算 \(C = A \times B\)，其中 \(A\) 是 \(4 \times 4\)，\(B\) 是 \(4 \times 4\)，使用 \(2 \times 2\) 分块：
矩阵 A 分块：

\
矩阵 B 分块：

\
计算 \(C_{00}\)：

\

\[= \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \times \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} \times \begin{bmatrix} 3 & 0 \\ 0 & 3 \end{bmatrix}\]

\[= \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 15 & 18 \\ 21 & 24 \end{bmatrix} = \begin{bmatrix} 16 & 20 \\ 24 & 28 \end{bmatrix}\]
分块计算流程图

\[\begin{array}{c}\text{计算 } C_{00} \\\downarrow \\\begin{array}{|c|c|}\hline\text{Step 1} & \text{加载 } A_{00}, B_{00} \\\hline\text{Step 2} & C_{00}^{(1)} = A_{00} \times B_{00} \\\hline\text{Step 3} & \text{加载 } A_{01}, B_{10} \\\hline\text{Step 4} & C_{00}^{(2)} = A_{01} \times B_{10} \\\hline\text{Step 5} & C_{00} = C_{00}^{(1)} + C_{00}^{(2)} \\\hline\end{array}\end{array}\]
2.4 GPU 上的分块实现

线程块映射

在 GPU 上，每个线程块负责计算一个或多个输出块：

\[\begin{array}{c|c}\text{GPU 概念} & \text{矩阵分块对应} \\\hline\text{Workgroup / Thread Block} & \text{计算一个或多个输出块 } C_{ij} \\\text{Subgroup / Warp} & \text{协作计算一个子块} \\\text{Thread} & \text{持有子块中的部分元素}\end{array}\]
映射示意（\(4 \times 4\) 输出块，\(2 \times 2\) 子块）：

\

[*]线程块 0（红色）：计算 \(C_{00}\)
[*]线程块 1（蓝色）：计算 \(C_{01}\)
[*]线程块 2（绿色）：计算 \(C_{10}\)
[*]线程块 3（橙色）：计算 \(C_{11}\)
Shared Memory 使用

每个线程块使用 Shared Memory 缓存当前计算所需的数据块：
// Shared Memory 声明
shared float A_tile;
shared float B_tile;

// 分块计算
for (int k = 0; k < K; k += TILE_K) {
// 协作加载 A 和 B 的当前块到 Shared Memory
A_tile = A;
B_tile = B[(k + ty) * N + col];

barrier();// 同步

// 计算当前块的贡献
for (int i = 0; i < TILE_K; i++) {
   sum += A_tile * B_tile;
}

barrier();// 同步
}2.5 分块大小与硬件的关系

分块大小的选择不是任意的，而是由硬件特性决定的。
Tensor Core 的固定尺寸

现代 GPU 配备了专门的矩阵计算单元（NVIDIA 称为 Tensor Core，Intel 称为 XMX），这些硬件单元在单周期内完成固定尺寸的矩阵乘加运算：

\[\text{矩阵计算单元操作} = A_{M \times K} \times B_{K \times N} + C_{M \times N}\]
常见矩阵计算单元尺寸：

\[\begin{array}{c|c|c|c|c}M \times N \times K & \text{A 尺寸} & \text{B 尺寸} & \text{C/D 尺寸} & \text{硬件} \\\hline16 \times 8 \times 16 & 16\times16 & 16\times8 & 16\times8 & \text{NVIDIA Volta/Turing/Ampere} \\16 \times 16 \times 16 & 16\times16 & 16\times16 & 16\times16 & \text{NVIDIA Ampere} \\8 \times 8 \times 16 & 8\times16 & 16\times8 & 8\times8 & \text{Intel XMX}\end{array}\]
分块大小的约束

由于矩阵计算单元的尺寸是固定的，分块大小必须满足：

\[\text{分块大小} = n \times \text{硬件单元尺寸}\]
例如，如果硬件支持 16×8×16 的 Tensor Core 操作，那么分块大小应该是 16、8、16 的整数倍。
分块大小选择的核心原则：匹配硬件矩阵计算单元的固定尺寸。这为后续 Cooperative Matrix 的设计奠定了基础——它封装了这些硬件细节，让开发者无需手动处理。
2.6 从分块到 Cooperative Matrix

理解矩阵分块后，Cooperative Matrix 的概念就自然清晰了：

\[\begin{array}{c|c|c}\text{概念} & \text{传统分块} & \text{Cooperative Matrix} \\\hline\text{分块单位} & \text{手动管理} & \text{固定尺寸 (16×8 等)} \\\text{数据存储} & \text{Shared Memory} & \text{线程寄存器（分布式）} \\\text{计算单元} & \text{CUDA Core} & \text{Tensor Core} \\\text{线程协作} & \text{手动同步} & \text{Subgroup 隐式同步} \\\text{编程模型} & \text{显式管理} & \text{声明式 API}\end{array}\]
Cooperative Matrix 本质上是将矩阵分块技术与 Tensor Core 硬件结合，通过标准化的 API 让开发者无需关心底层细节。
3. Cooperative Matrix 原理

3.1 什么是 Cooperative Matrix

Cooperative Matrix（协作矩阵）是 Vulkan 提供的一种特殊矩阵类型，其存储和计算分布在某个作用域（通常是 Subgroup）内的所有调用之间。这些调用协同工作，高效地执行矩阵乘法运算。
Cooperative Matrix 的核心特点：

[*]分布式存储：矩阵数据分布在 Subgroup 的所有线程中，每个线程只持有部分数据
[*]协同计算：所有线程共同参与矩阵运算，充分利用 Tensor Core 硬件加速
[*]中等尺寸：矩阵尺寸通常为 8×8、16×8、16×16 等，适合 Tensor Core 的计算单元
[*]透明优化：矩阵乘加操作由硬件（Tensor Core）执行，开发者无需手动管理线程间数据分配
3.1.1 Cooperative Matrix 的意义

问题背景：传统 GPU 矩阵乘法的困境
在传统 GPU 编程中，矩阵乘法的实现面临以下挑战：

[*]内存访问效率低：每个线程独立计算输出元素，导致大量重复的内存访问
[*]无法利用专用硬件：现代 GPU 配备了 Tensor Core 等矩阵计算专用单元，但传统编程模型无法直接使用
[*]优化复杂度高：需要精心设计 tiling、共享内存管理、bank conflict 避免等，代码复杂且难以移植
Cooperative Matrix 的解决方案
Cooperative Matrix 通过以下方式解决这些问题：

\[\begin{array}{c|c|c}\text{问题} & \text{传统方法} & \text{Cooperative Matrix} \\\hline\text{内存访问} & \text{每线程独立访问，重复读取} & \text{协作加载，数据复用} \\\text{硬件利用} & \text{仅使用 CUDA/Shader Core} & \text{直接使用 Tensor Core} \\\text{编程复杂度} & \text{手动 tiling、共享内存} & \text{声明式编程，驱动优化} \\\text{可移植性} & \text{需针对不同 GPU 优化} & \text{驱动处理硬件差异}\end{array}\]
性能提升：相比传统实现，Cooperative Matrix 可获得显著性能提升。根据 NVIDIA 的分析，Tensor Core 矩阵乘法可在 1 个周期内完成，而传统 FFMA 需要多个周期，具体加速比取决于矩阵尺寸和内存访问模式。
3.1.2 矩阵乘法的数学定义

设矩阵乘法运算 \(D = A \times B + C\)，其中：

\
元素级计算：

\
计算复杂度分析：

\[\text{乘法次数} = M \times N \times K\]

\[\text{加法次数} = M \times N \times K\]

\[\text{总 FLOP} = 2 \times M \times N \times K\]
3.2 Tensor Core 加速原理

现代 GPU（如 NVIDIA RTX 系列、Intel Arc 系列）配备了专门的矩阵计算单元，称为 Tensor Core（NVIDIA）或 XMX（Intel）。这些硬件单元能够在单个时钟周期内完成小矩阵的乘加运算。
3.2.1 Tensor Core 的工作方式

Tensor Core 在单周期内完成小矩阵乘加运算。典型的 16×8×16 运算：

\[\begin{bmatrix}d_{0,0} & d_{0,1} & \cdots & d_{0,7} \\d_{1,0} & d_{1,1} & \cdots & d_{1,7} \\\vdots & \vdots & \ddots & \vdots \\d_{15,0} & d_{15,1} & \cdots & d_{15,7}\end{bmatrix}=\begin{bmatrix}a_{0,0} & a_{0,1} & \cdots & a_{0,15} \\a_{1,0} & a_{1,1} & \cdots & a_{1,15} \\\vdots & \vdots & \ddots & \vdots \\a_{15,0} & a_{15,1} & \cdots & a_{15,15}\end{bmatrix}\times\begin{bmatrix}b_{0,0} & b_{0,1} & \cdots & b_{0,7} \\b_{1,0} & b_{1,1} & \cdots & b_{1,7} \\\vdots & \vdots & \ddots & \vdots \\b_{15,0} & b_{15,1} & \cdots & b_{15,7}\end{bmatrix}+\begin{bmatrix}c_{0,0} & c_{0,1} & \cdots & c_{0,7} \\c_{1,0} & c_{1,1} & \cdots & c_{1,7} \\\vdots & \vdots & \ddots & \vdots \\c_{15,0} & c_{15,1} & \cdots & c_{15,7}\end{bmatrix}\]
计算复杂度分析：

[*]乘法次数：\(M \times N \times K = 16 \times 8 \times 16 = 2048\) 次
[*]加法次数：\(M \times N \times K = 2048\) 次
[*]吞吐量：4096 FLOP/周期（Ampere 架构数据）
注意：Tensor Core 的"1周期"指的是吞吐量而非延迟。实际执行中，Tensor Core 操作有流水线延迟（通常 4-16 周期），但由于流水线设计，可以每个周期发射新的 Tensor Core 指令。
3.2.2 支持的矩阵尺寸

\[\begin{array}{c|c|c|c|c}M \times N \times K & \text{A 类型} & \text{B 类型} & \text{C/D 类型} & \text{说明} \\\hline16 \times 8 \times 16 & \text{FP16} & \text{FP16} & \text{FP16/FP32} & \text{最常用，NVIDIA GPU} \\16 \times 8 \times 8 & \text{FP16} & \text{FP16} & \text{FP16/FP32} & \text{较小尺寸} \\8 \times 8 \times 16 & \text{FP16} & \text{FP16} & \text{FP16/FP32} & \text{通用支持} \\16 \times 16 \times 16 & \text{FP16} & \text{FP16} & \text{FP16/FP32} & \text{较大尺寸，高性能} \\16 \times 8 \times 8 & \text{BF16} & \text{BF16} & \text{FP32} & \text{NVIDIA Ampere+}\end{array}\]
3.2.3 Tensor Core 性能数据

以下是 NVIDIA Ampere 架构的典型延迟数据（来源：NVIDIA 官方分析）：

\[\begin{array}{c|c|c}\text{操作类型} & \text{周期数} & \text{说明} \\\hline\text{全局内存访问} & \sim 380 \text{ 周期} & \text{访问 HBM/GDDR 显存} \\\text{L2 缓存访问} & \sim 200 \text{ 周期} & \text{片上二级缓存} \\\text{L1 缓存/共享内存} & \sim 34 \text{ 周期} & \text{每个 SM 的本地存储} \\\text{FFMA（标量乘加）} & 4 \text{ 周期} & \text{CUDA Core 执行} \\\text{Tensor Core 操作} & 1 \text{ 周期吞吐} & \text{流水线延迟 4-16 周期}\end{array}\]
性能对比示例（32×32 矩阵乘法）：

\[\begin{array}{c|c|c}\text{实现方式} & \text{周期数} & \text{计算方式} \\\hline\text{传统 CUDA Core} & \sim 504 & \text{全局内存 + 8次共享内存 + 8次FFMA} \\\text{Tensor Core} & \sim 235 & \text{全局内存 + 1次共享内存 + 1次Tensor Core}\end{array}\]

\[\text{加速比} = \frac{504}{235} \approx 2.1\times\]
实际加速比取决于矩阵尺寸、内存访问模式和 GPU 架构。大矩阵（如 GEMM）可获得更高加速比，因为 Tensor Core 的优势在计算密集型操作中更明显。
3.3 与传统实现的对比

3.3.1 传统标量实现

每个线程独立计算一个或多个输出元素：

\
线程 \(t\) 计算的输出元素集合：\(\{(i, j) \mid t = f(i, j)\}\)
传统线程计算示意（以 4×4 输出矩阵，2×2 微块为例）：

\
不同颜色代表不同线程计算的 2×2 微块，每个线程独立完成：

[*]线程 0（红色）：\(d_{0,0}, d_{0,1}, d_{1,0}, d_{1,1}\)（左上角 2×2 块）
[*]线程 1（蓝色）：\(d_{0,2}, d_{0,3}, d_{1,2}, d_{1,3}\)（右上角 2×2 块）
[*]线程 2（绿色）：\(d_{2,0}, d_{2,1}, d_{3,0}, d_{3,1}\)（左下角 2×2 块）
[*]线程 3（橙色）：\(d_{2,2}, d_{2,3}, d_{3,2}, d_{3,3}\)（右下角 2×2 块）
传统实现的问题：

\[\begin{array}{c|c}\text{问题} & \text{具体表现} \\\hline\text{内存访问效率低} & \text{每个线程独立读取 A 的行和 B 的列，大量重复访问} \\\text{无法利用 Tensor Core} & \text{CUDA Core 一次只能执行一次乘加} \\\text{计算密度低} & \text{内存带宽成为瓶颈，计算单元利用率低} \\\text{优化复杂} & \text{需要手动实现 tiling、共享内存、避免 bank conflict}\end{array}\]
3.3.2 Cooperative Matrix 实现

Subgroup 内所有线程协作完成矩阵块计算：

\[\mathbf{D}_{M \times N} = \mathbf{A}_{M \times K} \times \mathbf{B}_{K \times N} + \mathbf{C}_{M \times N}\]
每个线程持有矩阵的部分元素，协同完成计算。
Cooperative Matrix 协作计算示意（以 16×8 输出矩阵为例）：

\
所有 32 线程协作计算此矩阵块
数据分布（每个线程持有 4 个元素）：

\[\begin{array}{c|c}\text{线程} & \text{持有元素} \\\hlineT_0 & d_{0,0}, d_{0,1}, d_{0,2}, d_{0,3} \\T_1 & d_{0,4}, d_{0,5}, d_{0,6}, d_{0,7} \\\vdots & \vdots \\T_{31} & d_{12,4}, d_{12,5}, d_{12,6}, d_{12,7}\end{array}\]
Cooperative Matrix 的优势：

\[\begin{array}{c|c}\text{优势} & \text{说明} \\\hline\text{内存合并访问} & \text{所有线程协作加载，内存访问模式优化} \\\text{Tensor Core 加速} & \text{直接利用专用矩阵计算单元} \\\text{高计算密度} & \text{单次操作完成大量计算} \\\text{代码简洁} & \text{声明式编程，无需手动优化}\end{array}\]
3.3.3 性能对比

\[\begin{array}{c|c|c}\text{特性} & \text{传统实现} & \text{Cooperative Matrix} \\\hline\text{计算单元} & \text{CUDA Core / Shader Core} & \text{Tensor Core / XMX} \\\text{线程模型} & \text{每个线程独立计算输出元素} & \text{线程协作计算矩阵块} \\\text{内存访问} & \text{需精心设计避免 bank conflict} & \text{数据布局由硬件处理，但仍需合理设计} \\\text{代码复杂度} & \text{高，需手动 tiling 和共享内存管理} & \text{低，声明式编程} \\\text{性能可移植性} & \text{低，需针对不同 GPU 优化} & \text{高，驱动处理硬件差异} \\\text{计算吞吐量} & \text{基准} & \text{显著提升（取决于矩阵尺寸）}\end{array}\]
3.4 分布式存储模型

3.4.1 数据分布原理

Cooperative Matrix 的核心特性是数据分布在 Subgroup 的所有线程中。
设 Subgroup 大小为 \(S\)，协作矩阵尺寸为 \(M \times N\)。每个线程 \(t \in [0, S)\) 持有的元素数量：

\[\text{elements}_t = \frac{M \times N}{S}\]
典型配置：

\[\begin{array}{c|c|c|c|c}\text{矩阵尺寸} & \text{元素总数} & \text{Subgroup 大小} & \text{每线程元素数} & \text{典型 GPU} \\\hline16 \times 8 & 128 & 32 & 4 & \text{NVIDIA} \\16 \times 16 & 256 & 32 & 8 & \text{NVIDIA} \\8 \times 8 & 64 & 16 & 4 & \text{Intel} \\8 \times 16 & 128 & 64 & 2 & \text{AMD}\end{array}\]
2.4.2 数据分布示意

以 16×8 矩阵、32 线程 Subgroup 为例：

\[\text{总元素数} = 16 \times 8 = 128\]

\[\text{每线程元素数} = \frac{128}{32} = 4\]
矩阵元素分布示意：

\[\begin{bmatrix}\color{red}{e_0} & \color{red}{e_1} & \color{red}{e_2} & \color{red}{e_3} & \color{blue}{e_4} & \color{blue}{e_5} & \color{blue}{e_6} & \color{blue}{e_7} \\\color{green}{e_8} & \color{green}{e_9} & \color{green}{e_{10}} & \color{green}{e_{11}} & \color{orange}{e_{12}} & \color{orange}{e_{13}} & \color{orange}{e_{14}} & \color{orange}{e_{15}} \\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\\color{purple}{e_{120}} & \color{purple}{e_{121}} & \color{purple}{e_{122}} & \color{purple}{e_{123}} & \color{cyan}{e_{124}} & \color{cyan}{e_{125}} & \color{cyan}{e_{126}} & \color{cyan}{e_{127}}\end{bmatrix}\]
线程与元素对应关系：

\[\begin{array}{c|c|c}\text{线程 ID} & \text{颜色} & \text{持有元素} \\\hlineT_0 & \color{red}{\text{红}} & e_0, e_1, e_2, e_3 \\T_1 & \color{blue}{\text{蓝}} & e_4, e_5, e_6, e_7 \\T_2 & \color{green}{\text{绿}} & e_8, e_9, e_{10}, e_{11} \\T_3 & \color{orange}{\text{橙}} & e_{12}, e_{13}, e_{14}, e_{15} \\\vdots & \vdots & \vdots \\T_{30} & \color{purple}{\text{紫}} & e_{120}, e_{121}, e_{122}, e_{123} \\T_{31} & \color{cyan}{\text{青}} & e_{124}, e_{125}, e_{126}, e_{127}\end{array}\]
元素分配公式：

\[\begin{aligned}\text{Thread 0} &\rightarrow \text{elem} \\\text{Thread 1} &\rightarrow \text{elem} \\&\vdots \\\text{Thread 31} &\rightarrow \text{elem}\end{aligned}\]
Vulkan 规范没有规定具体的元素分配方式，这是由硬件实现决定的。程序员不需要关心具体分配，硬件会自动处理。
2.4.3 尺寸匹配要求

矩阵尺寸必须与 Subgroup 大小匹配：\((M \times N) \mod S = 0\)
硬件会验证矩阵元素总数能被 Subgroup 大小整除。不匹配的组合会导致运行时错误或未定义行为。
3.5 计算过程详解

3.5.1 矩阵乘加的执行流程

Cooperative Matrix 执行 \(D = A \times B + C\) 的过程：
步骤 1：数据加载
所有线程调用 coopMatLoad，从全局内存加载矩阵数据：

\[\text{coopMatLoad}(M, \text{buffer}, \text{offset}, \text{stride}, \text{layout})\]
硬件根据 Tensor Core 的数据布局要求将数据分配到各线程的寄存器中。
数据加载示意：

\[\begin{array}{c}\text{全局内存中的大矩阵 A (M × K)} \\\left[\begin{array}{ccccc}\cdot & \cdot & \cdot & \cdot & \cdot \\\cdot & {\boxed{\begin{array}{ccc} a_{0,0} & a_{0,1} & \cdots \\ a_{1,0} & a_{1,1} & \cdots \\ \vdots & \vdots & \ddots \end{array}}} & \cdot \\\cdot & \cdot & \cdot & \cdot & \cdot\end{array}\right] \\\big\downarrow \\ext{coopMatLoad(offset, stride)} \\\big\downarrow \\\underbrace{\begin{bmatrix} a_{0,0} & a_{0,1} & \cdots \\ a_{1,0} & a_{1,1} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}}_{\text{加载的子矩阵块}} \\\big\downarrow \quad \text{硬件按 Tensor Core 布局分配} \\\underbrace{\begin{array}{|c|c|c|c|}\hlineT_0 & T_1 & \cdots & T_{31} \\\hlinee_0..e_3 & e_4..e_7 & \cdots & e_{124}..e_{127} \\\hline\end{array}}_{\text{数据分布到各线程寄存器}}\end{array}\]
步骤 2：矩阵乘加
所有线程调用 coopMatMulAdd，Tensor Core 执行计算：

\
等价于：

\
矩阵乘加示意：

\[\begin{array}{ccc}\text{线程寄存器中的数据} & \rightarrow & \text{Tensor Core 计算} \\\begin{array}{|c|}\hlineT_0: a_0..a_3, b_0..b_3, c_0..c_3 \\T_1: a_4..a_7, b_4..b_7, c_4..c_7 \\\vdots \\T_{31}: a_{124}..a_{127}, b_{124}..b_{127}, c_{124}..c_{127} \\\hline\end{array}&\rightarrow&\begin{array}{|c|}\hline\text{Tensor Core} \\D = A \times B + C \\\text{流水线执行} \\\hline\end{array}\end{array}\]
步骤 3：结果存储
所有线程调用 coopMatStore，将结果写回全局内存：

\[\text{coopMatStore}(D, \text{buffer}, \text{offset}, \text{stride}, \text{layout})\]
结果存储示意：

\[\begin{array}{c}\underbrace{\begin{array}{|c|c|c|c|}\hlineT_0 & T_1 & \cdots & T_{31} \\\hlined_0..d_3 & d_4..d_7 & \cdots & d_{124}..d_{127} \\\hline\end{array}}_{\text{各线程持有结果的一部分}} \\\big\downarrow \\\text{coopMatStore(offset, stride)} \\\big\downarrow \\\underbrace{\begin{bmatrix} d_{0,0} & d_{0,1} & \cdots \\ d_{1,0} & d_{1,1} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}}_{\text{写入的子矩阵块}} \\\big\downarrow \\\text{全局内存中的大矩阵 D (M × N)} \\\left[\begin{array}{ccccc}\cdot & \cdot & \cdot & \cdot & \cdot \\\cdot & {\boxed{\begin{array}{ccc} d_{0,0} & d_{0,1} & \cdots \\ d_{1,0} & d_{1,1} & \cdots \\ \vdots & \vdots & \ddots \end{array}}} & \cdot \\\cdot & \cdot & \cdot & \cdot & \cdot\end{array}\right]\end{array}\]
3.5.2 Uniform Control Flow 要求

Subgroup 内的所有线程必须同时调用 CM 函数，否则行为未定义！

\[\forall t \in
为什么需要 Uniform Control Flow？

[*]矩阵存储分布在所有线程：数据分散在各线程中，需要所有线程参与
[*]协作计算：Tensor Core 需要所有线程协同工作
[*]硬件同步：这些函数在硬件层面隐式执行同步操作
正确示例：
// ✅ 正确：所有线程都调用
coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseA> matA;
coopMatLoad(matA, buffer, offset, stride, layout);// 所有 32 线程都调用错误示例：
// ❌ 错误：只有部分线程调用
if (gl_SubgroupInvocationID < 16) {
coopMatLoad(matA, buffer, offset, stride, layout);// 未定义行为！
}3.5.3 与普通变量赋值的对比

\[\begin{array}{c|c|c}\text{操作类型} & \text{普通变量} & \text{Cooperative Matrix} \\\hline\text{声明} & \texttt{float x;} & \texttt{coopmat m;} \\& \text{每个线程有独立的 x} & \text{所有线程共享一个逻辑矩阵 m} \\\text{赋值} & \texttt{x = 1.0;} & \texttt{m = coopmat(1.0);} \\& \text{每个线程的 x 都变成 1.0} & \text{矩阵所有元素都变成 1.0} \\\text{加载} & \texttt{x = buffer;} & \texttt{coopMatLoad(m, buffer, ...);} \\& \text{每个线程加载一个值} & \text{所有线程协作加载整个矩阵} \\\text{计算} & \texttt{y = a * b + c;} & \texttt{m = coopMatMulAdd(A, B, C);} \\& \text{每个线程独立计算} & \text{所有线程协作计算（使用 Tensor Core）}\end{array}\]
4. Vulkan 扩展介绍

4.1 VK_KHR_cooperative_matrix

VK_KHR_cooperative_matrix 是 Khronos 标准化的协作矩阵扩展，提供了跨厂商的标准接口。
扩展启用方式：
// 在 GLSL shader 中启用扩展
#extension GL_KHR_cooperative_matrix : require扩展特性：

[*]支持查询设备支持的矩阵尺寸和类型组合
[*]支持 Subgroup 和 Workgroup 作用域
[*]支持多种数据类型：FP16、FP32、BF16、INT8、INT16、INT32
[*]提供 Robust Buffer Access 选项
4.2 VK_NV_cooperative_matrix

VK_NV_cooperative_matrix 是 NVIDIA 的厂商扩展，是 KHR 扩展的前身。

\[\begin{array}{c|c|c}\text{特性} & \text{VK\_KHR\_cooperative\_matrix} & \text{VK\_NV\_cooperative\_matrix} \\\hline\text{状态} & \text{Khronos 标准} & \text{NVIDIA 厂商扩展} \\\text{矩阵类型} & \texttt{coopmat} & \texttt{fcoopmatNV} \\\text{Use 参数} & \text{显式指定（gl\_MatrixUseA/B/Accumulator）} & \text{隐式（通过尺寸推断）} \\\text{加载函数} & \texttt{coopMatLoad()} & \texttt{coopMatLoadNV()} \\\text{存储函数} & \texttt{coopMatStore()} & \texttt{coopMatStoreNV()} \\\text{乘加函数} & \texttt{coopMatMulAdd()} & \texttt{coopMatMulAddNV()}\end{array}\]
推荐：优先使用 VK_KHR_cooperative_matrix，除非需要兼容旧版驱动。
4.3 API 查询方法

在使用 Cooperative Matrix 之前，必须查询设备支持的矩阵尺寸和类型组合：
// 查询支持的 cooperative matrix 属性数量
uint32_t propertyCount = 0;
vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, nullptr);

// 分配空间并获取属性
std::vector<VkCooperativeMatrixPropertiesKHR> properties(propertyCount);
vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, properties.data());

// 遍历支持的矩阵配置
for (const auto& prop : properties) {
// prop.MSize, prop.NSize, prop.KSize - 矩阵尺寸
// prop.AType, prop.BType, prop.CType, prop.ResultType - 数据类型
// prop.scope - 作用域（Subgroup 或 Workgroup）
}VkCooperativeMatrixPropertiesKHR 结构体：
typedef struct VkCooperativeMatrixPropertiesKHR {
VkStructureType sType;       // 结构体类型
void*          pNext;       // 扩展指针
uint32_t       MSize;       // 矩阵 A 的行数，矩阵 C 的行数
uint32_t       NSize;       // 矩阵 B 的列数，矩阵 C 的列数
uint32_t       KSize;       // 矩阵 A 的列数，矩阵 B 的行数
VkComponentTypeKHR AType;       // 矩阵 A 的元素类型
VkComponentTypeKHR BType;       // 矩阵 B 的元素类型
VkComponentTypeKHR CType;       // 矩阵 C 的元素类型
VkComponentTypeKHR ResultType;    // 结果矩阵的元素类型
VkBool32       saturatingAccumulation; // 是否饱和累加
VkScopeKHR       scope;       // 作用域
} VkCooperativeMatrixPropertiesKHR;VkComponentTypeKHR 支持的类型：

\[\begin{array}{c|c|c}\text{枚举值} & \text{对应类型} & \text{说明} \\\hline\texttt{VK\_COMPONENT\_TYPE\_FLOAT16\_KHR} & \text{FP16} & \text{半精度浮点数} \\\texttt{VK\_COMPONENT\_TYPE\_FLOAT32\_KHR} & \text{FP32} & \text{单精度浮点数} \\\texttt{VK\_COMPONENT\_TYPE\_BFLOAT16\_KHR} & \text{BF16} & \text{Brain Float16} \\\texttt{VK\_COMPONENT\_TYPE\_SINT8\_KHR} & \text{int8} & \text{8位有符号整数} \\\texttt{VK\_COMPONENT\_TYPE\_UINT8\_KHR} & \text{uint8} & \text{8位无符号整数} \\\texttt{VK\_COMPONENT\_TYPE\_SINT16\_KHR} & \text{int16} & \text{16位有符号整数} \\\texttt{VK\_COMPONENT\_TYPE\_UINT16\_KHR} & \text{uint16} & \text{16位无符号整数} \\\texttt{VK\_COMPONENT\_TYPE\_SINT32\_KHR} & \text{int32} & \text{32位有符号整数} \\\texttt{VK\_COMPONENT\_TYPE\_UINT32\_KHR} & \text{uint32} & \text{32位无符号整数}\end{array}\]
VkScopeKHR 作用域：

\[\begin{array}{c|c|c}\text{枚举值} & \text{说明} & \text{使用场景} \\\hline\texttt{VK\_SCOPE\_SUBGROUP\_KHR} & \text{Subgroup 作用域} & \text{最常用，矩阵分布在单个 Subgroup 内} \\\texttt{VK\_SCOPE\_WORKGROUP\_KHR} & \text{Workgroup 作用域} & \text{较大矩阵，需要跨 Subgroup 协作} \\\texttt{VK\_SCOPE\_DEVICE\_KHR} & \text{设备作用域} & \text{跨 Workgroup（较少使用）} \\\texttt{VK\_SCOPE\_QUEUE\_FAMILY\_KHR} & \text{队列族作用域} & \text{跨队列（较少使用）}\end{array}\]
5. Shader 函数详解

5.1 矩阵类型声明

KHR 扩展矩阵类型

// KHR 扩展的矩阵类型声明
coopmat<ElementType, Scope, Rows, Cols, Use> matrixName;

// 示例：声明一个 16×8 的矩阵 A（用于乘法左操作数）
coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseA> matA;

// 示例：声明一个 8×8 的矩阵 B（用于乘法右操作数）
coopmat<float16_t, gl_ScopeSubgroup, 8, 8, gl_MatrixUseB> matB;

// 示例：声明一个 16×8 的累加器矩阵 C
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> matC;模板参数说明：

\[\begin{array}{c|c|c}\text{参数} & \text{类型} & \text{说明} \\\hline\text{ElementType} & \text{类型} & \text{矩阵元素类型：float16\_t, float, bfloat16\_t, int8\_t, uint8\_t, int16\_t, uint16\_t, int32\_t, uint32\_t} \\\text{Scope} & \text{常量} & \text{作用域：gl\_ScopeSubgroup, gl\_ScopeWorkgroup} \\\text{Rows} & \text{常量整数} & \text{矩阵行数（必须是设备支持的尺寸）} \\\text{Cols} & \text{常量整数} & \text{矩阵列数（必须是设备支持的尺寸）} \\\text{Use} & \text{常量} & \text{矩阵用途：gl\_MatrixUseA, gl\_MatrixUseB, gl\_MatrixUseAccumulator}\end{array}\]
NV 扩展矩阵类型

// NV 扩展的矩阵类型声明
fcoopmatNV<BitWidth, Scope, Rows, Cols> matrixName;

// 示例：FP16 矩阵（16位精度）
fcoopmatNV<16, gl_ScopeSubgroup, 16, 8> matA;

// 示例：FP32 矩阵（32位精度，用于累加器）
fcoopmatNV<32, gl_ScopeSubgroup, 16, 8> matC;BitWidth 精度说明：

[*]16 - FP16（半精度浮点）
[*]32 - FP32（单精度浮点）
5.2 Matrix Use 类型

Matrix Use 参数决定了矩阵在乘加运算中的角色：

\[\underbrace{A_{M \times K}}_{\text{MatrixA}} \times\underbrace{B_{K \times N}}_{\text{MatrixB}} +\underbrace{C_{M \times N}}_{\text{Accumulator}} =\underbrace{D_{M \times N}}_{\text{Accumulator}}\]
矩阵乘加示意（以 \(M=16, K=16, N=8\) 为例）：

\[\underbrace{\begin{bmatrix}a_{0,0} & \cdots & a_{0,15} \\\vdots & \ddots & \vdots \\a_{15,0} & \cdots & a_{15,15}\end{bmatrix}}_{A: 16 \times 16}\times\underbrace{\begin{bmatrix}b_{0,0} & \cdots & b_{0,7} \\\vdots & \ddots & \vdots \\b_{15,0} & \cdots & b_{15,7}\end{bmatrix}}_{B: 16 \times 8}+\underbrace{\begin{bmatrix}c_{0,0} & \cdots & c_{0,7} \\\vdots & \ddots & \vdots \\c_{15,0} & \cdots & c_{15,7}\end{bmatrix}}_{C: 16 \times 8}=\underbrace{\begin{bmatrix}d_{0,0} & \cdots & d_{0,7} \\\vdots & \ddots & \vdots \\d_{15,0} & \cdots & d_{15,7}\end{bmatrix}}_{D: 16 \times 8}\]
单个元素计算示意：

\
例如 \(d_{0,0}\) 的计算：

\

\[\begin{array}{c|c|c|c}\text{Use 类型} & \text{角色} & \text{尺寸约束} & \text{典型数据类型} \\\hline\texttt{gl\_MatrixUseA} & \text{乘法左操作数} & M \times K & \text{FP16, BF16, INT8} \\\texttt{gl\_MatrixUseB} & \text{乘法右操作数} & K \times N & \text{FP16, BF16, INT8} \\\texttt{gl\_MatrixUseAccumulator} & \text{累加器（加法操作数和结果）} & M \times N & \text{FP16, FP32}\end{array}\]
5.3 Scope 参数

Scope 参数定义了矩阵数据分布的范围：
gl_ScopeSubgroup：矩阵分布在单个 Subgroup 内（最常用）
gl_ScopeWorkgroup：矩阵分布在多个 Subgroup 间
5.4 加载和存储函数

coopMatLoad / coopMatLoadNV

从内存加载数据到 Cooperative Matrix：
// KHR 扩展语法
void coopMatLoad(
out coopmat<ElementType, Scope, M, N, Use> matrix,// 输出矩阵
in uint[] buffer,                               // 源缓冲区
uint offset,                                     // 起始偏移（元素数）
uint stride,                                     // 行/列步长
uint layout                                     // 布局方式
);

// 示例：从 buffer 加载矩阵
coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseA> matA;
coopMatLoad(matA, inputData, 0, K, gl_CooperativeMatrixLayoutRowMajor);coopMatStore / coopMatStoreNV

将 Cooperative Matrix 数据存储到内存：
// KHR 扩展语法
void coopMatStore(
in coopmat<ElementType, Scope, M, N, Use> matrix,// 输入矩阵
out uint[] buffer,                               // 目标缓冲区
uint offset,                                     // 起始偏移
uint stride,                                     // 行/列步长
uint layout                                     // 布局方式
);

// 示例：存储结果矩阵
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> result;
coopMatStore(result, outputData, 0, N, gl_CooperativeMatrixLayoutRowMajor);5.5 矩阵乘加函数

执行矩阵乘加运算：\(D = A \times B + C\)
// KHR 扩展语法
coopmat<ResultType, Scope, M, N, gl_MatrixUseAccumulator> coopMatMulAdd(
coopmat A,
coopmat<BType, Scope, K, N, gl_MatrixUseB> B,
coopmat<CType, Scope, M, N, gl_MatrixUseAccumulator> C
);

// 示例：完整的矩阵乘加
coopmat<float16_t, gl_ScopeSubgroup, 16, 8, gl_MatrixUseA> matA;
coopmat<float16_t, gl_ScopeSubgroup, 8, 8, gl_MatrixUseB> matB;
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> matC;
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> result;

// 初始化累加器
matC = coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator>(0.0f);

// 执行乘加
result = coopMatMulAdd(matA, matB, matC);5.6 布局参数

内存布局：
行主序（gl_CooperativeMatrixLayoutRowMajor）

\[\text{addr}(i, j) = \text{base} + i \times \text{stride} + j\]
内存排列示意（以 3×4 矩阵为例）：

\[\begin{bmatrix}a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} \\a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3}\end{bmatrix}\]
内存中的存储顺序：

\[\]
即按行依次存储：第 0 行 → 第 1 行 → 第 2 行
列主序（gl_CooperativeMatrixLayoutColumnMajor）

\[\text{addr}(i, j) = \text{base} + j \times \text{stride} + i\]
内存排列示意（以 3×4 矩阵为例）：

\[\begin{bmatrix}a_{0,0} & a_{0,1} & a_{0,2} & a_{0,3} \\a_{1,0} & a_{1,1} & a_{1,2} & a_{1,3} \\a_{2,0} & a_{2,1} & a_{2,2} & a_{2,3}\end{bmatrix}\]
内存中的存储顺序：

\[\]
即按列依次存储：第 0 列 → 第 1 列 → 第 2 列 → 第 3 列
5.7 其他相关函数

矩阵初始化

// 使用标量初始化所有元素
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> mat(0.0f);

// 拷贝构造
coopmat<float, gl_ScopeSubgroup, 16, 8, gl_MatrixUseAccumulator> mat2 = mat1;矩阵的行数 M 和列数 N 在类型声明时已确定，编译时已知，无需运行时查询。
6. GEMM 完整实现示例

本章通过一个完整的 GEMM 示例项目 cm_gemm_example，讲解 Cooperative Matrix 的实际应用。该示例包含主机端代码和 Shader 代码，实现了带有 Double Buffer 优化的矩阵乘法。
6.1 项目结构

cm_gemm_example/
├── main.cpp       # 主机端代码
└── CMakeLists.txt # 构建配置主要功能模块：

\[\begin{array}{c|c|c}\text{模块} & \text{文件} & \text{功能} \\\hline\text{设备初始化} & \text{main.cpp} & \text{创建 Instance、Device、Queue} \\\text{CM 属性查询} & \text{main.cpp} & \text{查询支持的矩阵尺寸} \\\text{权重重排} & \text{main.cpp} & \text{预处理 B 矩阵} \\\text{Pipeline 创建} & \text{main.cpp} & \text{编译 Shader、创建 Pipeline} \\\text{GEMM 计算} & \text{Shader} & \text{执行矩阵乘法}\end{array}\]
6.2 主机端：查询 Cooperative Matrix 属性

在使用 Cooperative Matrix 之前，必须查询设备支持的矩阵尺寸和类型组合：
void queryCooperativeMatrixProperties() { PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR =       (PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR)    vkGetInstanceProcAddr(instance, "vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR");    uint32_t propertyCount = 0; VK_CHECK(vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, nullptr));    std::vector properties(propertyCount); for (auto& prop : properties) {    prop.sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_PROPERTIES_KHR;    prop.pNext = nullptr; }    VK_CHECK(vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, properties.data()));    for (const auto& prop : properties) {    if (prop.scope == VK_SCOPE_SUBGROUP_KHR) {          std::cout

页: [1]

程序园's Archiver

Vulkan Cooperative Matrix 简明教程