cpu: rv64: brgemm: add bias fusion for rv64 brgemm kernel by zhangjian29 · Pull Request #5150 · uxlfoundation/oneDNN

zhangjian29 · 2026-05-13T03:18:17Z

Description

This PR introduces fused bias addition in the RV64 BRGEMM JIT kernel, eliminating the separate scalar bias loop that previously ran after the BRGEMM computation. When bias is present, the bias vector is now added to the accumulators inside the kernel using RVV vector operations before storing results to C, following the same pattern as rvv_gemm_f32.

This initial version provides:

Fused bias addition in the BRGEMM JIT micro-kernel (4-column main path and single-column N-tail)
Bias pointer threaded through brgemm_kernel_execute with correct per-M-tile offset
Automatic bias fusion in both rvv_brgemm_convolution_fwd and rvv_brgemm_inner_product_fwd callers

Key Features

Fused Bias in JIT Kernel: Bias vector is loaded once per N-group and added to all 4 accumulators using vfadd_vv before the C-store phase, avoiding a separate scalar pass over the output
Runtime Null-Check: Bias pointer is checked at runtime (beq reg_bias, x0); when no bias is present, the overhead is a single branch instruction
Data Types: f32
Zero API Change: The brgemm_kernel_execute signature adds ptr_bias = nullptr as a default parameter, so existing callers (e.g., Winograd) are unaffected

Implementation Details

The bias vector has length M (one element per output channel/row). In the JIT kernel:

Bias pointer is loaded from brgemm_kernel_params_t::ptr_bias (offset 56) into callee-saved register s4
After the K-loop completes and before C-store, bias is loaded into v_tmp (LMUL=m4) and added to all accumulator vectors (v_c0-v_c3) via vfadd_vv
The normal C-store logic (beta branch) then proceeds unchanged
For the single-column N-tail, the same pattern applies to v_c0

The bias is only applied on the first K-block (kb == 0); subsequent K-blocks pass nullptr. In convolution, bias is passed only on the first kernel position call; in inner product, bias is passed on the first brgemm_kernel_execute call.

For the inner product split-M path (when MB < nthr), each thread handles a subset of M tiles with the correctly offset bias pointer (bia + m_offset), eliminating the per-thread scalar bias loop entirely.

Checklist

General

Do all unit and benchdnn tests pass?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data?

All experiments are performed on a Spacemit X60 platform with VLEN=128. We draw comparisons among:

baseline: upstream main branch (bias added in separate scalar loop after BRGEMM)
this PR: fused bias addition inside BRGEMM JIT kernel

Correctness Evaluation

Test command:

ONEDNN_VERBOSE=1 ./benchdnn --conv --batch=tests/benchdnn/inputs/conv/shapes_mobilenet
ONEDNN_VERBOSE=1 ./benchdnn --ip --dir=FWD_B --batch=tests/benchdnn/inputs/ip/shapes_alexnet
ONEDNN_VERBOSE=1 ./benchdnn --ip --dir=FWD_B --batch=tests/benchdnn/inputs/ip/shapes_transformer_lt

All brgemm:rvv, gemm:rvv and jit_1x1:rvv tests pass.

Single-Core Performance

Inner Product (brgemm:rvv)

Layer	Shape	Baseline (ms)	This PR (ms)	Improvement
Encoder_MM_1*36	IC=1024,OC=1024,MB=40	9.80	9.02	+7.9%
Encoder_MM_8*6	IC=4096,OC=1024,MB=40	38.78	37.23	+4.0%
Decoder_vocabulary*40	IC=1024,OC=32768,MB=4	318.31	311.93	+2.0%
resnet:ip1	IC=2048,OC=1000,MB=1	3.56	3.61	parity
Alexnet:ip3	IC=4096,OC=1000,MB=1	7.55	7.64	parity

IP Transformer_lt total: 464.02 ms → 443.12 ms (+4.5%)

Convolution (brgemm:rvv)

Layer	Shape	Baseline (ms)	This PR (ms)	Improvement
res2a_branch2b*3	IC=64,OC=64,56x56,k3	550.13	543.76	+1.2%
res3a_branch1	IC=256,OC=512,s2	1166.48	1167.82	parity
res3a_branch2a	IC=256,OC=128,s2	297.26	298.07	parity

Convolution performance is largely unchanged because the bias loop overhead was already small relative to the GEMM compute time.

8-Core Performance

Inner Product (brgemm:rvv, split-M path)

Layer	Shape	Baseline (ms)	This PR (ms)	Improvement
resnet:ip1	IC=2048,OC=1000,MB=1	2.69	1.67	+38.0%
Alexnet:ip3	IC=4096,OC=1000,MB=1	3.50	3.53	parity

The 8-core split-M path shows +38% improvement for resnet:ip1 (MB=1, OC=1000). When MB < nthr, each thread handles a subset of M tiles. The previous scalar bias loop ran per-thread over all MB rows; fusing it into the kernel eliminates this overhead and replaces scalar ops with vector ops (LMUL=m4).

Convolution (brgemm:rvv)

Layer	Baseline (ms)	This PR (ms)	Improvement
res2a_branch2b*3	74.39	74.23	parity
res3a_branch1	305.36	303.84	+0.5%
res3a_branch2a	80.65	80.06	+0.7%
res3a_branch2b*4	102.35	100.36	+1.9%

Dispatch Coverage

IP Transformer_lt: 6/7 layers (86%) use brgemm:rvv
IP Alexnet: 1/3 layers (33%) use brgemm:rvv
Conv ResNet-50: 4/20 layers (20%) use brgemm:rvv (remaining: jit_1x1:rvv, gemm:rvv, wino:rvv)

Key Observations

Inner product benefits most from bias fusion, especially the split-M multi-core path where per-thread scalar bias loops are eliminated
Convolution shows minimal change because the scalar bias loop was already a small fraction of total compute time
The fused approach reduces memory traffic: bias is loaded once per N-group and consumed immediately, rather than requiring a separate pass over the output

Future Plans

Extend bias fusion to f16 data type when Zvfh support is added to the BRGEMM kernel
Investigate fusing additional post-ops (e.g., ReLU) into the BRGEMM kernel

xiazhuozhao · 2026-05-13T09:12:34Z

                            const float beta_val = first_kpos ? 0.0f : 1.0f;
-                            brgemm_kernel_execute(
-                                    brg_kernel, A, B, C, valid_ow, beta_val);
+                            const float *bias_ptr = first_kpos && jcp.with_bias


This misses bias for padded convolution because first_kpos can be true for a BRGEMM call that does not cover the full OW range. For example, ./benchdnn --conv --mode=C --dir=FWD_B --dt=f32 --stag=acdb --wtag=cdba --dtag=acdb mb1ic16ih20iw20oc16oh20ow20kh3kw3sh1sw1ph1pw1 will fail.

You're right, thanks for catching this. In padded convolutions the first BRGEMM call may only cover a partial OW range (e.g., with pw=1, kw=3: kw=0 covers OW[1..OW-1]), so positions like OW[0] never receive bias.

Fix: Instead of fusing bias into the first BRGEMM call, the output is now initialized to bias values when !with_sum && with_bias, then all BRGEMM calls accumulate with beta=1. This covers every OW position regardless of padding. For with_sum && with_bias, a scalar bias add runs after the loop (same as the original code on main).

The first_kpos variable and bias pointer passing are removed from the conv path entirely. The JIT kernel's bias fusion is still used for the inner product path, where K-split doesn't have this partial-coverage issue.

cpu: rv64: brgemm: add bias fusion feature

da74b01

github-actions Bot added the platform:cpu-rv64 RISC-V label May 13, 2026

zhangjian29 marked this pull request as ready for review May 13, 2026 05:57

zhangjian29 requested a review from a team as a code owner May 13, 2026 05:57

xiazhuozhao requested changes May 13, 2026

View reviewed changes

zhangjian29 force-pushed the add-bias-fusion-for-brgemm-kernel branch from b40044b to da74b01 Compare May 13, 2026 12:25

cpu: rv64: brgemm: fix bias for padded conv

828cb9c

zhangjian29 requested a review from xiazhuozhao May 13, 2026 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: rv64: brgemm: add bias fusion for rv64 brgemm kernel#5150

cpu: rv64: brgemm: add bias fusion for rv64 brgemm kernel#5150
zhangjian29 wants to merge 2 commits into
uxlfoundation:mainfrom
zhangjian29:add-bias-fusion-for-brgemm-kernel

zhangjian29 commented May 13, 2026

Uh oh!

xiazhuozhao May 13, 2026

Uh oh!

zhangjian29 May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangjian29 commented May 13, 2026

Description

Key Features

Implementation Details

Checklist

General

Performance improvements

Correctness Evaluation

Single-Core Performance

Inner Product (brgemm:rvv)

Convolution (brgemm:rvv)

8-Core Performance

Inner Product (brgemm:rvv, split-M path)

Convolution (brgemm:rvv)

Dispatch Coverage

Key Observations

Future Plans

Uh oh!

xiazhuozhao May 13, 2026

Choose a reason for hiding this comment

Uh oh!

zhangjian29 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants