Automate corpora testing in CI by mustansir14 · Pull Request #4927 · trufflesecurity/trufflehog

mustansir14 · 2026-04-29T08:22:36Z

Motivation

When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.

Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.

This PR automates it. Every PR that touches detector regex or keyword changes now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The workflow runs once on PR open; reviewers or authors can re-run it manually via workflow_dispatch if needed.

What it tells you

The bench scans a corpus of real-world public code (~2.8 GB compressed, ~29 GB uncompressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:

New matches — the regex picks up strings main didn't; could be intentional (broader coverage) or noise (too loose)
Removed matches — the regex narrowed; could be intentional (tighter pattern) or a regression (missing real secrets)

PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.

Example output

What changed

.github/workflows/detector-corpora-test.yml — new workflow; triggers on PRs touching pkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diff
scripts/test/detect_changed_detectors.sh — resolves changed detector directories to their proto enum names for --include-detectors scoping; skips detectors whose diff doesn't touch regex patterns or Keywords() so PRs that only change verification or redaction logic don't trigger a bench run
scripts/test/detector_corpora_test.sh — streams corpus files (S3 or local), runs trufflehog with --no-verification, outputs JSONL
scripts/test/diff_corpora_results.py — diffs two JSONL result sets and renders the Markdown report posted to the PR

Performance

The corpus is ~2.8 GB compressed (~29 GB uncompressed). A typical run takes ~21 minutes.

Three things keep it manageable:

Scoped scanning — Only the detectors changed in the PR are passed via --include-detectors. Scanning the full detector set against 29 GB would be prohibitive; scoping to 1–3 detectors cuts runtime proportionally.

Single S3 stream — Both PR and main binaries consume the same S3 download via a named FIFO. S3 is downloaded and decompressed once; both scans run in parallel against the same stream.

Main scan caching — The main binary always scans the same merge-base commit for a given PR. Results are cached in GitHub Actions keyed by merge-base SHA + detector set. On a manual re-run without a rebase, the entire main side (worktree checkout, go build, S3 download, scan) is skipped.

Fork PRs

Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.

Running locally

./scripts/test/detector_corpora_test.sh /path/to/contents.jsonl.zstd

Results land in /tmp/corpora_results.jsonl with a DuckDB summary table printed to stdout.

Note

Medium Risk
Introduces a new CI workflow that uses AWS credentials, caching, and PR comment automation; misconfiguration could leak costs/time or fail PR checks. Also changes the jdbc detector regex (currently intentionally loosened), which could increase false positives if merged as-is.

Overview
Adds a new GitHub Actions workflow (detector-corpora-test.yml) that automatically runs a scoped corpora scan when detector matching logic changes, compares PR vs merge-base results, and posts/updates a sticky PR comment with a Markdown diff report (or a skip message when no relevant changes are detected).

Introduces supporting tooling: detect_changed_detectors.sh to map changed detector dirs to --include-detectors IDs and ignore non-regex/Keywords() edits; detector_corpora_test.sh to stream S3/local corpora and optionally tee a single stream into both PR and main binaries; and diff_corpora_results.py to diff JSONL outputs and render the report.

Minor hygiene: extends .gitignore for Python artifacts; updates pkg/detectors/jdbc regex (noted as a temporary loosening to exercise the new workflow).

^{Reviewed by Cursor Bugbot for commit ee95bc2. Bugbot is set up for automated code reviews on this repo. Configure here.}

…al runs to get PR issue number

The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed.

…s emoji

awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…urity/trufflehog into hackathon/detector-tests-in-ci

bradlarsen

This looks useful!

bradlarsen · 2026-05-04T20:51:54Z

+    SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified,
+    SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified,
+    SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\"


Running with --no-verification above makes these values not meaningful, right?

Agreed! Thanks for catching this. Will remove

bradlarsen · 2026-05-04T21:01:26Z

+    local rc=0
+    if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then
+        # Single S3 download teed to both binaries simultaneously.
+        unzstd -c "$input" 2>> "$STDERR_FILE" \
+            | jq -r .content 2>> "$STDERR_FILE" \
+            | tee >(
+                "${TRUFFLEHOG_BIN_MAIN}" \
+                    --no-update \
+                    --no-verification \
+                    --allow-verification-overlap \
+                    --log-level=3 \
+                    --concurrency=8 \
+                    --json \
+                    "${main_include_flag[@]}" \
+                    stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE"
+              ) \
+            | "$TRUFFLEHOG_BIN" \
+                --no-update \
+                --no-verification \
+                --allow-verification-overlap \
+                --log-level=3 \
+                --concurrency=8 \
+                --json \
+                --print-avg-detector-time \
+                "${INCLUDE_FLAG[@]}" \
+                stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
+        rc=$?
+        wait
+    else
+        unzstd -c "$input" 2>> "$STDERR_FILE" \
+            | jq -r .content 2>> "$STDERR_FILE" \
+            | "$TRUFFLEHOG_BIN" \
+                --no-update \
+                --no-verification \
+                --allow-verification-overlap \
+                --log-level=3 \
+                --concurrency=8 \
+                --json \
+                --print-avg-detector-time \
+                "${INCLUDE_FLAG[@]}" \
+                stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
+        rc=$?
+    fi


When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).

You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.

I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.

The tests consistently go 30+ minutes and successfully complete without truncation.

How long do the actual scans run? If you look closely, do you see that the actual scans (not the builds etc) take just ~1 minute each?

I just was able to reproduce the behavior I was seeing earlier on my local machine, using the latest published trufflehog (3.95.2). When I try to scan a large stream of textual data on stdin, the scan stops after 60s (the default value of --archive-timeout). Adjust the value of --archive-timeout to something shorter and you will see:

time trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5s stdin < SEVERAL_GIGABYTES.txt ... {"level":"error","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"error writing to data channel","source_manager_worker_id":"dwpxz","unit_kind":"unit","unit":"<stdin>","mime":"text/plain; charset=utf-8","timeout":5,"error":"context deadline exceeded"} ... {"level":"info-0","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"finished scanning","chunks":106929,"bytes":1423438848,"verified_secrets":0,"unverified_secrets":77,"scan_duration":"5.0252335s","trufflehog_version":"3.95.2","verification_caching":{"Hits":0,"Misses":0,"HitsWasted":0,"AttemptsSaved":0,"VerificationTimeSpentMS":0}} trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5 49.61s user 1.97s system 829% cpu 6.219 total

You are right. This was indeed an issue. See my comment on the PR. Thanks for catching!

…e everything works as expected

bradlarsen

This looks like it would add some useful test feedback for detector regex-related changes.

My only remaining question is whether the 60s timeouts I've observed with the trufflehog stdin source are affecting the testing done here.

…c regex to trigger workflow

mustansir14 · 2026-05-12T09:34:21Z

My only remaining question is whether the 60s timeouts I've observed with the trufflehog stdin source are affecting the testing done here.

It turns out you were right. The scans were indeed timing out after 1 minute. The CI wasn't showing this because scanner logs were being piped into a temp file. Once I made it log on CI, I was able to see the error. I added the --archive-timeout=2h flag and now I see in the logs that scans are successfully completing.

Also this impacted the running time heavily. It was running for more than an hour now. This won't work in CI, so I removed the larger dataset (32GB one) and now it's only running on the smaller (2.8GB compressed, 29GB uncompressed) dataset. The running time for this is ~21 minutes, which is fine I think.

Even the smaller dataset returned 865 results for a loosened JDBC regex, so I think it would still really benefit us to have it on CI.

@bradlarsen Thanks much for this! Great catch!

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b37bdab. Configure here.}

add detector corpora test workflow and script

720b203

mustansir14 requested a review from a team April 29, 2026 08:22