perf(shader-compiler): 编译器性能 / 增量编译优化（7 项）

# ShaderLab 编译器性能 / 增量编译优化 Issue

> Preprocessor / SemanticAnalyze / CodeGen 在多 Pass、多 Shader 场景下存在大量重复计算与冗余代码生成。本 issue 收集 7 项可独立推进的优化点，每条配现状与期望的代码 / 伪代码对照，并附 GLSL 编译前后产物对比。

## TODO

- [ ] [函数重载定向展开](#1-函数重载定向展开) — 调用点驱动只展开匹配签名，丢弃未使用重载
- [ ] [死分支函数残留剔除](#2-死分支函数残留) — 基于宏求值后调用图做可达性分析，未到达函数不输出
- [ ] [Preprocessor `#include` 跨 Pass 缓存](#3-preprocessor-include-跨-pass-缓存) — 现有的字符串级 chunk cache 没解决下游重扫，需要把缓存抬到 token / AST 层
- [ ] [SemanticAnalyze / CodeGen 共享符号缓存](#4-semanticanalyze--codegen-共享符号缓存) — 全局符号 codegen 提到共享阶段，vert / frag 复用
- [ ] [Preprocessor + Lexer 合并为单趟 LL 扫描](#5-preprocessor--lexer-合并为单趟-ll-扫描) — 当前 Preprocessor regex pass 与 Lexer 主扫描是两次扫源码，合并到一遍流式 LL
- [ ] [空 `#if/#endif` 块短路](#6-空-ifendif-块短路) — 在 Preprocessor 输出阶段直接剔除空块，不进入 LL 阶段
- [ ] [Visitor 层 Symbol Lookup memo](#7-visitor-层-symbol-lookup-memo) — SymbolTable 底层已是 hash，但 visitor 层重复 lookup 同名标识符，需要 scope-local memo

---

## 1. 函数重载定向展开

### 输入源码

```glsl
vec4  blend(vec4 a,  vec4 b)  { return a * b; }
vec3  blend(vec3 a,  vec3 b)  { return a * b; }
float blend(float a, float b) { return a * b; }

uniform vec4 albedo;
uniform vec4 baseColor;

void main() {
  // 实参都是 vec4，仅 vec4 重载会被调用
  gl_FragColor = blend(albedo, baseColor);
}
```

### 现状

**编译器内部行为**：

```ts
const overloads = symbolTable.lookupOverloads("blend");
for (const fn of overloads) {
  codeGen.emit(generateFunctionCode(fn)); // 全部 emit
}
```

**GPU 实际收到的产物**：

```glsl
// 三个重载都进入 GPU 代码
vec4  blend(vec4 a,  vec4 b)  { return a * b; }
vec3  blend(vec3 a,  vec3 b)  { return a * b; }   // ← 死代码
float blend(float a, float b) { return a * b; }   // ← 死代码

void main() {
  gl_FragColor = blend(albedo, baseColor);
}
```

### 期望

**编译器内部行为**：

```ts
const callExpr = ast.find("blend(albedo, baseColor)");
const argTypes = callExpr.args.map(inferType);          // ["vec4", "vec4"]
const matched  = symbolTable.resolveOverload("blend", argTypes);
codeGen.emit(generateFunctionCode(matched));            // 仅 emit 匹配版本
```

**GPU 实际收到的产物**：

```glsl
vec4 blend(vec4 a, vec4 b) { return a * b; }

void main() {
  gl_FragColor = blend(albedo, baseColor);
}
```

**收益**：减小 GPU 代码体积、降低 GPU 编译时间，对 BRDF / Toon / Math 等多签名工具函数库效果明显。

---

## 2. 死分支函数残留

### 输入源码

```glsl
void heavyFn() {
  // 假设 100 行复杂计算
  // ...
}

void main() {
  #ifdef DEBUG
    heavyFn();
  #endif
  gl_FragColor = vec4(1.0);
}
```

### 现状（运行时未启用 `DEBUG` 宏）

**编译器内部行为**：

```ts
emitGlobal(heavyFn); // 函数定义无条件 emit
emitMain(...);
```

**GPU 实际收到的产物**：

```glsl
void heavyFn() {
  // 100 行死代码原样保留
  // ...
}

void main() {
  gl_FragColor = vec4(1.0);
}
```

### 期望

**编译器内部行为**：

```ts
const reachableCalls  = collectCallExpressions(astAfterPreprocess);
const reachableFuncs  = transitiveClosure(reachableCalls, callGraph);
for (const fn of allFunctions) {
  if (reachableFuncs.has(fn.name)) {
    emitGlobal(fn);
  }
}
```

**GPU 实际收到的产物**：

```glsl
void main() {
  gl_FragColor = vec4(1.0);
}
```

**收益**：宏配置驱动的代码裁剪，shadow caster / forward base / forward add 等多 variant 场景每个 variant 都更小。

---

## 3. Preprocessor `#include` 跨 Pass 缓存

### 输入源码

```glsl
SubShader "Default" {
  Pass "ShadowCaster" {
    #include "Transform.glsl"
    #include "Light.glsl"
    // ...
  }
  Pass "Forward" {
    #include "Transform.glsl"   // 同样的 chunk
    #include "Light.glsl"        // 同样的 chunk
    // ...
  }
  Pass "Outline" {
    #include "Transform.glsl"   // 同样的 chunk
    // ...
  }
}
```

### 现状

**编译器内部行为**（仅有字符串级 chunk 展开 cache，下游每 Pass 全量重扫）：

```ts
// Preprocessor.parse: 单条正则展开 #include，把 chunk 文本 inline 到源码
// ShaderCompiler._chunkOutputCache 只缓存 chunk 嵌套展开的递归结果，
// 缓存值仍然是文本，最终被 inline 到每个 Pass 的源码字符串里
class Preprocessor {
  static parse(source, basePath, includeMap, chunkOutputCache) {
    return source.replace(this._includeReg, (_, name) => {
      let cached = chunkOutputCache.get(name);
      if (cached === undefined) {
        cached = this.parse(includeMap[name], ...); // 递归展开 + 文本缓存
        chunkOutputCache.set(name, cached);
      }
      return cached; // ← 返回的是字符串，inline 到 Pass 源码里
    });
  }
}

// 每个 Pass 走完整的下游流程：lex → parse → AST → semanticAnalyze → codegen
for (const pass of subShader.passes) {
  const expanded   = Preprocessor.parse(pass.source, ...);  // 文本膨胀，多 KB
  const tokens     = new Lexer(expanded, ...).tokenize();    // ← 重扫
  const ast        = parser.parse(tokens, ...);              // ← 重建 AST
  const programSrc = codeGen.visitShaderProgram(ast, ...);   // ← 重 codegen
}
```

**编译耗时分布**：

```
Pass "ShadowCaster":
  #include 字符串展开 (cache miss → ~0.5ms)
  expanded ≈ 12KB（Transform.glsl + Light.glsl 内联）
  Lexer + Parser + Codegen ~ 8ms

Pass "Forward":
  #include 字符串展开 (cache hit → ~0.05ms)
  expanded ≈ 12KB（同样体量）
  Lexer + Parser + Codegen ~ 8ms                            ← 整套下游重做

Pass "Outline":
  #include 字符串展开 (cache hit → ~0.05ms)
  expanded ≈ 5KB
  Lexer + Parser + Codegen ~ 4ms                            ← 整套下游重做
─────────────────────────────────────────────────
总计: ~20ms（字符串级 cache 节省 < 1ms）
```

> 现有的 `_chunkOutputCache` 只省了"chunk 递归展开"那一小步，**字符串膨胀和下游全量重扫的代价完全没有降低**。

### 期望

把缓存抬到 token / AST 层，按 `(includeKey, macroEnvHash)` 复用编译产物：

```ts
class ShaderCompiler {
  private _tokenCache = new Map<string, BaseToken[]>();
  private _astCache   = new Map<string, ASTNode>();

  private _includeAsTokens(name: string, macroEnv: MacroEnv): BaseToken[] {
    const key = `${name}@${hashMacroEnv(macroEnv)}`;
    let toks = this._tokenCache.get(key);
    if (!toks) {
      const chunkSrc = this._includeMap[name];
      toks = new Lexer(chunkSrc, ...).tokenize();
      this._tokenCache.set(key, toks);
    }
    return toks;
  }
  // 解析时在 token 流中遇到 #include 直接拼接缓存的 token 序列
  // AST cache 同理（全局符号子树宏环境无关，可跨 Pass 复用）
}
```

**编译耗时分布**：

```
Pass "ShadowCaster": lex+parse Transform.glsl + Light.glsl → token/AST cache
Pass "Forward":      token/AST cache hit                              (~0.5ms)
Pass "Outline":      token/AST cache hit                              (~0.3ms)
─────────────────────────────────────────────────
总计: ~9ms（约 2.2× 提速）
```

**收益**：典型项目内置 chunk 重复引用率高，下游 lex/parse/AST 真正复用后才有实质收益。

---

## 4. SemanticAnalyze / CodeGen 共享符号缓存

### 输入源码

```glsl
mat4 renderer_ModelMat;   // 全局符号 1
vec4 mainColor;           // 全局符号 2

vec3 worldNormal(vec3 n) {
  return (renderer_ModelMat * vec4(n, 0.0)).xyz;
}

Varyings vert(Attributes attr) {
  Varyings v;
  v.v_normal = worldNormal(attr.NORMAL);   // vert 也用 worldNormal + renderer_ModelMat
  return v;
}

void frag(Varyings v) {
  gl_FragColor = vec4(v.v_normal, 1.0) * mainColor; // frag 用 mainColor
}
```

### 现状

**编译器内部行为**：

```ts
class GLESVisitor {
  visitShaderProgram(ast) {
    const vert = this._vertexMain(ast);    // 内部走 _getGlobalSymbol → codegen
    const frag = this._fragmentMain(ast);  // 同样的全局符号又 codegen 一遍
    return { vert, frag };
  }
}
```

**Visit Trace**：

```
_vertexMain():
  _getGlobalSymbol("renderer_ModelMat") → codegen mat4 decl   (cost X)
  _getGlobalSymbol("worldNormal")        → codegen function    (cost Y)

_fragmentMain():
  _getGlobalSymbol("renderer_ModelMat") → codegen mat4 decl   (cost X 重复)
  _getGlobalSymbol("worldNormal")        → codegen function    (cost Y 重复)
  _getGlobalSymbol("mainColor")          → codegen vec4 decl   (cost Z)
─────────────────────────────────────────────────
总成本: 2X + 2Y + Z
```

### 期望

**编译器内部行为**：

```ts
class GLESVisitor {
  private globalCache = new Map<string, string>();

  visitShaderProgram(ast) {
    this._collectGlobals(ast);                       // 共享阶段
    const vertBody = this._vertexMainBody(ast);
    const fragBody = this._fragmentMainBody(ast);
    return {
      vert: this._composeOutput(vertBody, this._globalsUsedBy(vertBody)),
      frag: this._composeOutput(fragBody, this._globalsUsedBy(fragBody)),
    };
  }
}
```

**Visit Trace**：

```
_collectGlobals():
  cache["renderer_ModelMat"] = codegen(...)  (cost X)
  cache["worldNormal"]       = codegen(...)  (cost Y)
  cache["mainColor"]         = codegen(...)  (cost Z)

_vertexMain():    引用 cache (cost ~0)
_fragmentMain():  引用 cache (cost ~0)
─────────────────────────────────────────────────
总成本: X + Y + Z
```

> 最终 GLSL 产物相同，但符号 codegen 调用次数从 N×k 降到 N。

**收益**：共享 uniform / 工具函数 / 结构体只 codegen 一次，主路径耗时下降，且语义分析阶段也能复用同一份缓存。

---

## 5. Preprocessor + Lexer 合并为单趟 LL 扫描

### 输入源码

```glsl
// 顶点变换工具
#include "Transform.glsl"

#define LIGHT_COUNT 4

/* 主入口 */
void main() {
  // 计算颜色
  gl_FragColor = vec4(1.0);
}
```

### 现状

**两个独立 phase，源码被扫描两遍**：

```ts
// Phase 1: Preprocessor.parse — 文本级 regex 展开 #include（一遍扫描）
const _includeReg = /\/\*[\s\S]*?\*\/|^[ \t]*#include +"([\w\d./]+)"/gm;
const expanded = source.replace(_includeReg, ...);
//   ↓ Preprocessor 只处理 #include，#define / 注释 / #ifdef 都还在 expanded 文本里

// Phase 2: Lexer.tokenize — 主扫描走状态机
//   - skipCommentsAndSpace()  剥离注释
//   - _scanDirectives()        处理 #define / #ifdef / #endif 等
//   - _branchStack             维护 #ifdef 分支栈
//   - 生成 token 流
const tokens = new Lexer(expanded, macroDefineList).tokenize();
```

源文本被完整扫描两次：第一次只为找 `#include`，第二次才是真正的词法分析。

### 期望

**单趟 LL 扫描搞定 include 内联 + 注释剥离 + #define 收集 + token 生成**：

```ts
class Lexer {
  *tokenize(source, includeMap, macroDefineList) {
    let i = 0;
    const stack: { src: string; pos: number }[] = [];  // include 栈
    let curSrc = source;

    while (true) {
      if (i >= curSrc.length) {
        if (stack.length === 0) break;
        ({ src: curSrc, pos: i } = stack.pop()!);      // 退出 include
        continue;
      }

      // 注释 — 顺手剥离，不进 token 流
      if (curSrc[i] === "/" && curSrc[i + 1] === "/") { i = curSrc.indexOf("\n", i); continue; }
      if (curSrc[i] === "/" && curSrc[i + 1] === "*") { i = curSrc.indexOf("*/", i) + 2; continue; }

      // #include — 推栈，切到 chunk 源码继续 tokenize（深度优先 inline）
      if (matchDirective(curSrc, i, "#include")) {
        const { path, end } = readIncludeDirective(curSrc, i);
        stack.push({ src: curSrc, pos: end });
        curSrc = includeMap[path];
        i = 0;
        continue;
      }

      // #define — 状态机走 token 流（与现有 Lexer 逻辑一致）
      // #ifdef / #else / #endif — 现有分支栈逻辑
      // 普通 token — yield 给下游 parser

      yield scanToken(curSrc, i); /* i 推进 */
    }
  }
}
```

源文本只被扫描一遍，include 内联作为词法层操作随 token 流自然产生，避免了 Preprocessor 阶段的 regex 预扫和文本中间表示。

**收益**：两趟变一趟，省掉 Preprocessor 的 regex 预扫和中间字符串构造；token 缓存（item 3）也更自然——按 include 路径缓存 token 子序列即可。

---

## 6. 空 `#if/#endif` 块短路

### 输入源码

```glsl
#ifdef SCENE_USE_FOG
  vec4 fogParams;
#endif

#ifdef RENDERER_HAS_SKIN
  mat4 boneMatrices[64];
#endif

#ifdef USE_NORMAL_MAP
  sampler2D normalTexture;
#endif

void main() { gl_FragColor = vec4(1.0); }
```

### 现状（三个宏均未启用）

**Preprocessor 输出**：

```glsl
#ifdef SCENE_USE_FOG
#endif

#ifdef RENDERER_HAS_SKIN
#endif

#ifdef USE_NORMAL_MAP
#endif

void main() { gl_FragColor = vec4(1.0); }
```

LL parser 仍要消化 9 行空块（6 行 `#ifdef/#endif` + 3 行空行）。

### 期望

**Preprocessor 输出**：

```glsl
void main() { gl_FragColor = vec4(1.0); }
```

LL parser 直接收到精简代码。

**编译器内部行为**：

```ts
function stripEmptyConditionals(source: string): string {
  let prev: string;
  let curr = source;
  do {
    prev = curr;
    curr = curr.replace(
      /#if(?:def|ndef)?\s+\w+\s*\n\s*(?:#else\s*\n\s*)?#endif\s*\n/g,
      ""
    );
  } while (curr !== prev);  // 反复扫描处理嵌套
  return curr;
}
```

**收益**：宏裁剪后大量空块是常态，LL 阶段输入显著变短。

---

## 7. Visitor 层 Symbol Lookup memo

### 输入源码（PCF 9-tap shadow 过滤循环展开，典型放大场景）

```glsl
sampler2D shadowMap;
float sampleShadowMapFiltered9(TEXTURE2D_SHADOW_PARAM(shadowMap), vec3 shadowCoord, vec4 shadowmapSize) {
  float attenuation;
  float fetchesWeights[9];
  vec2 fetchesUV[9];
  sampleShadowComputeSamplesTent5x5(shadowmapSize, shadowCoord.xy, fetchesWeights, fetchesUV);
  attenuation  = fetchesWeights[0] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[0].xy, shadowCoord.z));
  attenuation += fetchesWeights[1] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[1].xy, shadowCoord.z));
  attenuation += fetchesWeights[2] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[2].xy, shadowCoord.z));
  attenuation += fetchesWeights[3] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[3].xy, shadowCoord.z));
  attenuation += fetchesWeights[4] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[4].xy, shadowCoord.z));
  attenuation += fetchesWeights[5] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[5].xy, shadowCoord.z));
  attenuation += fetchesWeights[6] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[6].xy, shadowCoord.z));
  attenuation += fetchesWeights[7] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[7].xy, shadowCoord.z));
  attenuation += fetchesWeights[8] * SAMPLE_TEXTURE2D_SHADOW(shadowMap, vec3(fetchesUV[8].xy, shadowCoord.z));
  return attenuation;
}
```

### 现状

**SymbolTable 底层已经是 hash**，所以单次 `getSymbol` 就是 O(1) bucket 查找：

```ts
// common/SymbolTable.ts
class SymbolTable {
  private _table: Map<string, T[]> = new Map();
  getSymbol(symbol, includeMacro): T | undefined {
    const entry = this._table.get(symbol.ident);     // O(1) hash
    if (entry) {
      for (let i = entry.length - 1; i >= 0; i--) {  // bucket 内匹配重载签名
        if (entry[i].equal(symbol)) return entry[i];
      }
    }
  }
}
```

**问题在 visitor 层**：AST 节点 codegen / semanticAnalyze 每访问一次标识符都重新走 `symbolTableStack.lookup()`，逐层穿过 scope 栈：

```ts
// common/SymbolTableStack.ts
lookup(symbol, includeMacro): S | undefined {
  for (let i = this.stack.length - 1; i >= 0; i--) {   // 遍历作用域栈
    const result = this.stack[i].getSymbol(symbol, includeMacro);
    if (result) return result;
  }
}
```

**符号解析 trace**（PCF 9-tap，函数体内对 `shadowMap` 等同名符号反复 lookup）：

```
Stmt 1: lookup shadowMap                  → stack 穿 k 层 × 每层 hash + bucket
        lookup fetchesWeights             → 同样穿 k 层
        lookup fetchesUV                  → 同样穿 k 层
        lookup SAMPLE_TEXTURE2D_SHADOW   → 同样穿 k 层
        lookup shadowCoord                → 同样穿 k 层
Stmt 2: lookup shadowMap                  → 又穿 k 层（结果与 Stmt 1 相同）
        ... 5 个符号同样穿 k 层
... × 9 行
─────────────────────────────────────────────────
重复穿越同样的 scope stack 共 9 × 5 = 45 次
```

> 单次 lookup 不贵，但 hot path 上密集重复调用（codegen visit 大函数体、loop unroll、重载签名匹配）放大成本。最终 GLSL 产物完全相同，差异仅在编译耗时。

### 期望

**Visitor 进入函数体 / 语句块时建一个 scope-local memo**，按当前可见性快照同名 lookup 只走一次：

```ts
class CodeGenVisitor {
  private _scopeLookupCache = new Map<string, SymbolInfo>();

  visitFunctionDefinition(node) {
    const prev = this._scopeLookupCache;
    this._scopeLookupCache = new Map();    // 进入新作用域：建立局部 memo

    try {
      // 函数体内所有 lookup 都走 cached version
      ...
    } finally {
      this._scopeLookupCache = prev;        // 离开作用域：还原
    }
  }

  // 替代裸 lookup 的入口
  lookupCached(ident: string, type: ESymbolType): SymbolInfo | undefined {
    const key = `${ident}@${type}`;
    let sym = this._scopeLookupCache.get(key);
    if (sym !== undefined) return sym;
    sym = this._symbolTableStack.lookup(...);
    this._scopeLookupCache.set(key, sym);
    return sym;
  }
}
```

**符号解析 trace**：

```
Stmt 1: 5 个符号 → 5 次 stack 穿越，结果写入函数体级 memo
Stmt 2: 5 个符号 → 5 次 Map.get（命中 memo）
... × 9 行
─────────────────────────────────────────────────
仅 5 次 stack 穿越，剩余 40 次 O(1) memo hit
```

**收益**：lookup 密集的 hot path（语义分析、类型检查、调用解析、循环展开函数体）整体加速；对大函数体（PBR / Shadow filter 这种几十上百行的工具函数）效果尤其明显。


Uh oh!

perf(shader-compiler): 编译器性能 / 增量编译优化（7 项） #3002

Description

ShaderLab 编译器性能 / 增量编译优化 Issue

TODO

1. 函数重载定向展开

输入源码

现状

期望

2. 死分支函数残留

输入源码

现状（运行时未启用 DEBUG 宏）

期望

3. Preprocessor #include 跨 Pass 缓存

输入源码

现状

期望

4. SemanticAnalyze / CodeGen 共享符号缓存

输入源码

现状

期望

5. Preprocessor + Lexer 合并为单趟 LL 扫描

输入源码

现状

期望

6. 空 #if/#endif 块短路

输入源码

现状（三个宏均未启用）

期望

7. Visitor 层 Symbol Lookup memo

输入源码（PCF 9-tap shadow 过滤循环展开，典型放大场景）

现状

期望

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

现状（运行时未启用 `DEBUG` 宏）

3. Preprocessor `#include` 跨 Pass 缓存

6. 空 `#if/#endif` 块短路