Skip to content

cpu: rv64: brgemm: add bias fusion for rv64 brgemm kernel#5150

Open
zhangjian29 wants to merge 2 commits into
uxlfoundation:mainfrom
zhangjian29:add-bias-fusion-for-brgemm-kernel
Open

cpu: rv64: brgemm: add bias fusion for rv64 brgemm kernel#5150
zhangjian29 wants to merge 2 commits into
uxlfoundation:mainfrom
zhangjian29:add-bias-fusion-for-brgemm-kernel

Conversation

@zhangjian29
Copy link
Copy Markdown
Contributor

Description

This PR introduces fused bias addition in the RV64 BRGEMM JIT kernel, eliminating the separate scalar bias loop that previously ran after the BRGEMM computation. When bias is present, the bias vector is now added to the accumulators inside the kernel using RVV vector operations before storing results to C, following the same pattern as rvv_gemm_f32.

This initial version provides:

  1. Fused bias addition in the BRGEMM JIT micro-kernel (4-column main path and single-column N-tail)
  2. Bias pointer threaded through brgemm_kernel_execute with correct per-M-tile offset
  3. Automatic bias fusion in both rvv_brgemm_convolution_fwd and rvv_brgemm_inner_product_fwd callers

Key Features

  • Fused Bias in JIT Kernel: Bias vector is loaded once per N-group and added to all 4 accumulators using vfadd_vv before the C-store phase, avoiding a separate scalar pass over the output
  • Runtime Null-Check: Bias pointer is checked at runtime (beq reg_bias, x0); when no bias is present, the overhead is a single branch instruction
  • Data Types: f32
  • Zero API Change: The brgemm_kernel_execute signature adds ptr_bias = nullptr as a default parameter, so existing callers (e.g., Winograd) are unaffected

Implementation Details

The bias vector has length M (one element per output channel/row). In the JIT kernel:

  1. Bias pointer is loaded from brgemm_kernel_params_t::ptr_bias (offset 56) into callee-saved register s4
  2. After the K-loop completes and before C-store, bias is loaded into v_tmp (LMUL=m4) and added to all accumulator vectors (v_c0-v_c3) via vfadd_vv
  3. The normal C-store logic (beta branch) then proceeds unchanged
  4. For the single-column N-tail, the same pattern applies to v_c0

The bias is only applied on the first K-block (kb == 0); subsequent K-blocks pass nullptr. In convolution, bias is passed only on the first kernel position call; in inner product, bias is passed on the first brgemm_kernel_execute call.

For the inner product split-M path (when MB < nthr), each thread handles a subset of M tiles with the correctly offset bias pointer (bia + m_offset), eliminating the per-thread scalar bias loop entirely.

Checklist

General

  • Do all unit and benchdnn tests pass?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data?

All experiments are performed on a Spacemit X60 platform with VLEN=128. We draw comparisons among:

  1. baseline: upstream main branch (bias added in separate scalar loop after BRGEMM)
  2. this PR: fused bias addition inside BRGEMM JIT kernel

Correctness Evaluation

Test command:

ONEDNN_VERBOSE=1 ./benchdnn --conv --batch=tests/benchdnn/inputs/conv/shapes_mobilenet
ONEDNN_VERBOSE=1 ./benchdnn --ip --dir=FWD_B --batch=tests/benchdnn/inputs/ip/shapes_alexnet
ONEDNN_VERBOSE=1 ./benchdnn --ip --dir=FWD_B --batch=tests/benchdnn/inputs/ip/shapes_transformer_lt

All brgemm:rvv, gemm:rvv and jit_1x1:rvv tests pass.

Single-Core Performance

Inner Product (brgemm:rvv)

Layer Shape Baseline (ms) This PR (ms) Improvement
Encoder_MM_1*36 IC=1024,OC=1024,MB=40 9.80 9.02 +7.9%
Encoder_MM_8*6 IC=4096,OC=1024,MB=40 38.78 37.23 +4.0%
Decoder_vocabulary*40 IC=1024,OC=32768,MB=4 318.31 311.93 +2.0%
resnet:ip1 IC=2048,OC=1000,MB=1 3.56 3.61 parity
Alexnet:ip3 IC=4096,OC=1000,MB=1 7.55 7.64 parity

IP Transformer_lt total: 464.02 ms → 443.12 ms (+4.5%)

Convolution (brgemm:rvv)

Layer Shape Baseline (ms) This PR (ms) Improvement
res2a_branch2b*3 IC=64,OC=64,56x56,k3 550.13 543.76 +1.2%
res3a_branch1 IC=256,OC=512,s2 1166.48 1167.82 parity
res3a_branch2a IC=256,OC=128,s2 297.26 298.07 parity

Convolution performance is largely unchanged because the bias loop overhead was already small relative to the GEMM compute time.

8-Core Performance

Inner Product (brgemm:rvv, split-M path)

Layer Shape Baseline (ms) This PR (ms) Improvement
resnet:ip1 IC=2048,OC=1000,MB=1 2.69 1.67 +38.0%
Alexnet:ip3 IC=4096,OC=1000,MB=1 3.50 3.53 parity

The 8-core split-M path shows +38% improvement for resnet:ip1 (MB=1, OC=1000). When MB < nthr, each thread handles a subset of M tiles. The previous scalar bias loop ran per-thread over all MB rows; fusing it into the kernel eliminates this overhead and replaces scalar ops with vector ops (LMUL=m4).

Convolution (brgemm:rvv)

Layer Baseline (ms) This PR (ms) Improvement
res2a_branch2b*3 74.39 74.23 parity
res3a_branch1 305.36 303.84 +0.5%
res3a_branch2a 80.65 80.06 +0.7%
res3a_branch2b*4 102.35 100.36 +1.9%

Dispatch Coverage

  • IP Transformer_lt: 6/7 layers (86%) use brgemm:rvv
  • IP Alexnet: 1/3 layers (33%) use brgemm:rvv
  • Conv ResNet-50: 4/20 layers (20%) use brgemm:rvv (remaining: jit_1x1:rvv, gemm:rvv, wino:rvv)

Key Observations

  • Inner product benefits most from bias fusion, especially the split-M multi-core path where per-thread scalar bias loops are eliminated
  • Convolution shows minimal change because the scalar bias loop was already a small fraction of total compute time
  • The fused approach reduces memory traffic: bias is loaded once per N-group and consumed immediately, rather than requiring a separate pass over the output

Future Plans

  1. Extend bias fusion to f16 data type when Zvfh support is added to the BRGEMM kernel
  2. Investigate fusing additional post-ops (e.g., ReLU) into the BRGEMM kernel

@zhangjian29 zhangjian29 marked this pull request as ready for review May 13, 2026 05:57
@zhangjian29 zhangjian29 requested a review from a team as a code owner May 13, 2026 05:57
Comment thread src/cpu/rv64/rvv_brgemm_conv.cpp Outdated
const float beta_val = first_kpos ? 0.0f : 1.0f;
brgemm_kernel_execute(
brg_kernel, A, B, C, valid_ow, beta_val);
const float *bias_ptr = first_kpos && jcp.with_bias
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misses bias for padded convolution because first_kpos can be true for a BRGEMM call that does not cover the full OW range. For example, ./benchdnn --conv --mode=C --dir=FWD_B --dt=f32 --stag=acdb --wtag=cdba --dtag=acdb mb1ic16ih20iw20oc16oh20ow20kh3kw3sh1sw1ph1pw1 will fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for catching this. In padded convolutions the first BRGEMM call may only cover a partial OW range (e.g., with pw=1, kw=3: kw=0 covers OW[1..OW-1]), so positions like OW[0] never receive bias.

  • Fix: Instead of fusing bias into the first BRGEMM call, the output is now initialized to bias values when !with_sum && with_bias, then all BRGEMM calls accumulate with beta=1. This covers every OW position regardless of padding. For with_sum && with_bias, a scalar bias add runs after the loop (same as the original code on main).

The first_kpos variable and bias pointer passing are removed from the conv path entirely. The JIT kernel's bias fusion is still used for the inner product path, where K-split doesn't have this partial-coverage issue.

@zhangjian29 zhangjian29 force-pushed the add-bias-fusion-for-brgemm-kernel branch from b40044b to da74b01 Compare May 13, 2026 12:25
@zhangjian29 zhangjian29 requested a review from xiazhuozhao May 13, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants