Skip to content

Automate corpora testing in CI#4927

Open
mustansir14 wants to merge 51 commits into
mainfrom
hackathon/detector-tests-in-ci
Open

Automate corpora testing in CI#4927
mustansir14 wants to merge 51 commits into
mainfrom
hackathon/detector-tests-in-ci

Conversation

@mustansir14
Copy link
Copy Markdown
Contributor

@mustansir14 mustansir14 commented Apr 29, 2026

Motivation

When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.

Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.

This PR automates it. Every PR that touches detector regex or keyword changes now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The workflow runs once on PR open; reviewers or authors can re-run it manually via workflow_dispatch if needed.

What it tells you

The bench scans a corpus of real-world public code (~2.8 GB compressed, ~29 GB uncompressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:

  • New matches — the regex picks up strings main didn't; could be intentional (broader coverage) or noise (too loose)
  • Removed matches — the regex narrowed; could be intentional (tighter pattern) or a regression (missing real secrets)

PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.

Example output

image

What changed

  • .github/workflows/detector-corpora-test.yml — new workflow; triggers on PRs touching pkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diff
  • scripts/test/detect_changed_detectors.sh — resolves changed detector directories to their proto enum names for --include-detectors scoping; skips detectors whose diff doesn't touch regex patterns or Keywords() so PRs that only change verification or redaction logic don't trigger a bench run
  • scripts/test/detector_corpora_test.sh — streams corpus files (S3 or local), runs trufflehog with --no-verification, outputs JSONL
  • scripts/test/diff_corpora_results.py — diffs two JSONL result sets and renders the Markdown report posted to the PR

Performance

The corpus is ~2.8 GB compressed (~29 GB uncompressed). A typical run takes ~21 minutes.

Three things keep it manageable:

Scoped scanning — Only the detectors changed in the PR are passed via --include-detectors. Scanning the full detector set against 29 GB would be prohibitive; scoping to 1–3 detectors cuts runtime proportionally.

Single S3 stream — Both PR and main binaries consume the same S3 download via a named FIFO. S3 is downloaded and decompressed once; both scans run in parallel against the same stream.

Main scan caching — The main binary always scans the same merge-base commit for a given PR. Results are cached in GitHub Actions keyed by merge-base SHA + detector set. On a manual re-run without a rebase, the entire main side (worktree checkout, go build, S3 download, scan) is skipped.

Fork PRs

Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.

Running locally

./scripts/test/detector_corpora_test.sh /path/to/contents.jsonl.zstd

Results land in /tmp/corpora_results.jsonl with a DuckDB summary table printed to stdout.


Note

Medium Risk
Introduces a new CI workflow that uses AWS credentials, caching, and PR comment automation; misconfiguration could leak costs/time or fail PR checks. Also changes the jdbc detector regex (currently intentionally loosened), which could increase false positives if merged as-is.

Overview
Adds a new GitHub Actions workflow (detector-corpora-test.yml) that automatically runs a scoped corpora scan when detector matching logic changes, compares PR vs merge-base results, and posts/updates a sticky PR comment with a Markdown diff report (or a skip message when no relevant changes are detected).

Introduces supporting tooling: detect_changed_detectors.sh to map changed detector dirs to --include-detectors IDs and ignore non-regex/Keywords() edits; detector_corpora_test.sh to stream S3/local corpora and optionally tee a single stream into both PR and main binaries; and diff_corpora_results.py to diff JSONL outputs and render the report.

Minor hygiene: extends .gitignore for Python artifacts; updates pkg/detectors/jdbc regex (noted as a temporary loosening to exercise the new workflow).

Reviewed by Cursor Bugbot for commit ee95bc2. Bugbot is set up for automated code reviews on this repo. Configure here.

@mustansir14 mustansir14 requested a review from a team April 29, 2026 08:22
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread .github/workflows/detector-corpora-test.yml Outdated
@trufflesecurity trufflesecurity deleted a comment from github-actions Bot Apr 29, 2026
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread scripts/test/detector_corpora_test.sh
@shahzadhaider1 shahzadhaider1 requested a review from a team as a code owner April 29, 2026 13:16
Comment thread pkg/detectors/stripe/stripe.go Outdated
The bench uses --no-verification, so the engine's overlap-path dedup
(which exists to protect verifiers from duplicate calls) adds noise
without value here — it causes shifts in unrelated detectors when only
one detector's regex changes. Pair --allow-verification-overlap with
--no-verification so each detector's regex behavior is measured
independently.

Also fix the false 'no diff vs main' claim that triggered when
NEW/REMOVED were zero but total counts differed.
Comment thread pkg/detectors/jdbc/jdbc.go Outdated
Comment thread scripts/diff_corpora_results.py Outdated
Comment thread scripts/test/detector_corpora_test.sh
Comment thread scripts/test/detect_changed_detectors.sh
Comment thread pkg/detectors/acmevault/eraser.go Outdated
Comment thread pkg/detectors/acmevault/eraser.go Outdated
@mustansir14 mustansir14 marked this pull request as draft April 30, 2026 08:15
shahzadhaider1 and others added 2 commits May 2, 2026 00:27
awk's END block doesn't run when trufflehog exits before draining stdin
(SIGPIPE kills awk first), leaving the bytes file empty and breaking the
step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a
default of 0 and validate it's an integer before arithmetic. Also fold
unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay
out of CI logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pection

Static AST parse of a detector package to extract the strings returned by
its Keywords() method. Used by the upcoming keyword-corpus builder to fan
out per-detector GitHub Code Search queries during the corpora bench.

AST-first because each detector lives in its own package; importing them
dynamically would require codegen or `plugin`. Falls back to a regex over
the function body, then a directory-wide grep, when AST resolution can't
statically resolve the return value (helper calls, build-tagged variants).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread scripts/test/detector_corpora_test.sh
Comment thread scripts/test/diff_corpora_results.py
Comment thread scripts/test/diff_corpora_results.py
@mustansir14 mustansir14 marked this pull request as ready for review May 4, 2026 15:08
Comment thread scripts/test/detector_corpora_test.sh
@shahzadhaider1 shahzadhaider1 requested a review from bradlarsen May 4, 2026 16:00
Copy link
Copy Markdown
Contributor

@bradlarsen bradlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful!

Comment thread scripts/test/detector_corpora_test.sh Outdated
Comment on lines +126 to +128
SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified,
SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified,
SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running with --no-verification above makes these values not meaningful, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Thanks for catching this. Will remove

Comment thread scripts/test/detector_corpora_test.sh
Comment on lines +64 to +106
local rc=0
if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then
# Single S3 download teed to both binaries simultaneously.
unzstd -c "$input" 2>> "$STDERR_FILE" \
| jq -r .content 2>> "$STDERR_FILE" \
| tee >(
"${TRUFFLEHOG_BIN_MAIN}" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
"${main_include_flag[@]}" \
stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE"
) \
| "$TRUFFLEHOG_BIN" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
--print-avg-detector-time \
"${INCLUDE_FLAG[@]}" \
stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
rc=$?
wait
else
unzstd -c "$input" 2>> "$STDERR_FILE" \
| jq -r .content 2>> "$STDERR_FILE" \
| "$TRUFFLEHOG_BIN" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
--print-avg-detector-time \
"${INCLUDE_FLAG[@]}" \
stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
rc=$?
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).

You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.

Copy link
Copy Markdown
Contributor Author

@mustansir14 mustansir14 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests consistently go 30+ minutes and successfully complete without truncation.

How long do the actual scans run? If you look closely, do you see that the actual scans (not the builds etc) take just ~1 minute each?

I just was able to reproduce the behavior I was seeing earlier on my local machine, using the latest published trufflehog (3.95.2). When I try to scan a large stream of textual data on stdin, the scan stops after 60s (the default value of --archive-timeout). Adjust the value of --archive-timeout to something shorter and you will see:

time trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5s stdin < SEVERAL_GIGABYTES.txt
...
{"level":"error","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"error writing to data channel","source_manager_worker_id":"dwpxz","unit_kind":"unit","unit":"<stdin>","mime":"text/plain; charset=utf-8","timeout":5,"error":"context deadline exceeded"}
...
{"level":"info-0","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"finished scanning","chunks":106929,"bytes":1423438848,"verified_secrets":0,"unverified_secrets":77,"scan_duration":"5.0252335s","trufflehog_version":"3.95.2","verification_caching":{"Hits":0,"Misses":0,"HitsWasted":0,"AttemptsSaved":0,"VerificationTimeSpentMS":0}}
trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5  49.61s user 1.97s system 829% cpu 6.219 total

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This was indeed an issue. See my comment on the PR. Thanks for catching!

Comment thread .github/workflows/detector-corpora-test.yml
Comment thread .github/workflows/detector-corpora-test.yml
Comment thread scripts/test/detector_corpora_test.sh
@mustansir14 mustansir14 requested a review from bradlarsen May 6, 2026 07:32
Comment thread scripts/test/detect_changed_detectors.sh
Copy link
Copy Markdown
Contributor

@bradlarsen bradlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it would add some useful test feedback for detector regex-related changes.

My only remaining question is whether the 60s timeouts I've observed with the trufflehog stdin source are affecting the testing done here.

Comment thread scripts/test/detector_corpora_test.sh
@mustansir14
Copy link
Copy Markdown
Contributor Author

mustansir14 commented May 12, 2026

My only remaining question is whether the 60s timeouts I've observed with the trufflehog stdin source are affecting the testing done here.

It turns out you were right. The scans were indeed timing out after 1 minute. The CI wasn't showing this because scanner logs were being piped into a temp file. Once I made it log on CI, I was able to see the error. I added the --archive-timeout=2h flag and now I see in the logs that scans are successfully completing.

Also this impacted the running time heavily. It was running for more than an hour now. This won't work in CI, so I removed the larger dataset (32GB one) and now it's only running on the smaller (2.8GB compressed, 29GB uncompressed) dataset. The running time for this is ~21 minutes, which is fine I think.

Even the smaller dataset returned 865 results for a loosened JDBC regex, so I think it would still really benefit us to have it on CI.

@bradlarsen Thanks much for this! Great catch!

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b37bdab. Configure here.

Comment thread scripts/test/detect_changed_detectors.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants