Automate corpora testing in CI#4927
Conversation
…al runs to get PR issue number
The bench uses --no-verification, so the engine's overlap-path dedup (which exists to protect verifiers from duplicate calls) adds noise without value here — it causes shifts in unrelated detectors when only one detector's regex changes. Pair --allow-verification-overlap with --no-verification so each detector's regex behavior is measured independently. Also fix the false 'no diff vs main' claim that triggered when NEW/REMOVED were zero but total counts differed.
awk's END block doesn't run when trufflehog exits before draining stdin (SIGPIPE kills awk first), leaving the bytes file empty and breaking the step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a default of 0 and validate it's an integer before arithmetic. Also fold unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay out of CI logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pection Static AST parse of a detector package to extract the strings returned by its Keywords() method. Used by the upcoming keyword-corpus builder to fan out per-detector GitHub Code Search queries during the corpora bench. AST-first because each detector lives in its own package; importing them dynamically would require codegen or `plugin`. Falls back to a regex over the function body, then a directory-wide grep, when AST resolution can't statically resolve the return value (helper calls, build-tagged variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…urity/trufflehog into hackathon/detector-tests-in-ci
| SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified, | ||
| SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified, | ||
| SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\" |
There was a problem hiding this comment.
Running with --no-verification above makes these values not meaningful, right?
There was a problem hiding this comment.
Agreed! Thanks for catching this. Will remove
| local rc=0 | ||
| if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then | ||
| # Single S3 download teed to both binaries simultaneously. | ||
| unzstd -c "$input" 2>> "$STDERR_FILE" \ | ||
| | jq -r .content 2>> "$STDERR_FILE" \ | ||
| | tee >( | ||
| "${TRUFFLEHOG_BIN_MAIN}" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| "${main_include_flag[@]}" \ | ||
| stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE" | ||
| ) \ | ||
| | "$TRUFFLEHOG_BIN" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| --print-avg-detector-time \ | ||
| "${INCLUDE_FLAG[@]}" \ | ||
| stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE" | ||
| rc=$? | ||
| wait | ||
| else | ||
| unzstd -c "$input" 2>> "$STDERR_FILE" \ | ||
| | jq -r .content 2>> "$STDERR_FILE" \ | ||
| | "$TRUFFLEHOG_BIN" \ | ||
| --no-update \ | ||
| --no-verification \ | ||
| --allow-verification-overlap \ | ||
| --log-level=3 \ | ||
| --concurrency=8 \ | ||
| --json \ | ||
| --print-avg-detector-time \ | ||
| "${INCLUDE_FLAG[@]}" \ | ||
| stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE" | ||
| rc=$? | ||
| fi |
There was a problem hiding this comment.
When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).
You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.
There was a problem hiding this comment.
I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.
There was a problem hiding this comment.
The tests consistently go 30+ minutes and successfully complete without truncation.
How long do the actual scans run? If you look closely, do you see that the actual scans (not the builds etc) take just ~1 minute each?
I just was able to reproduce the behavior I was seeing earlier on my local machine, using the latest published trufflehog (3.95.2). When I try to scan a large stream of textual data on stdin, the scan stops after 60s (the default value of --archive-timeout). Adjust the value of --archive-timeout to something shorter and you will see:
time trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5s stdin < SEVERAL_GIGABYTES.txt
...
{"level":"error","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"error writing to data channel","source_manager_worker_id":"dwpxz","unit_kind":"unit","unit":"<stdin>","mime":"text/plain; charset=utf-8","timeout":5,"error":"context deadline exceeded"}
...
{"level":"info-0","ts":"2026-05-07T10:09:40-04:00","logger":"trufflehog","msg":"finished scanning","chunks":106929,"bytes":1423438848,"verified_secrets":0,"unverified_secrets":77,"scan_duration":"5.0252335s","trufflehog_version":"3.95.2","verification_caching":{"Hits":0,"Misses":0,"HitsWasted":0,"AttemptsSaved":0,"VerificationTimeSpentMS":0}}
trufflehog --no-update --log-level=4 --no-verification -j --archive-timeout 5 49.61s user 1.97s system 829% cpu 6.219 total
There was a problem hiding this comment.
You are right. This was indeed an issue. See my comment on the PR. Thanks for catching!
…e everything works as expected
bradlarsen
left a comment
There was a problem hiding this comment.
This looks like it would add some useful test feedback for detector regex-related changes.
My only remaining question is whether the 60s timeouts I've observed with the trufflehog stdin source are affecting the testing done here.
…c regex to trigger workflow
It turns out you were right. The scans were indeed timing out after 1 minute. The CI wasn't showing this because scanner logs were being piped into a temp file. Once I made it log on CI, I was able to see the error. I added the Also this impacted the running time heavily. It was running for more than an hour now. This won't work in CI, so I removed the larger dataset (32GB one) and now it's only running on the smaller (2.8GB compressed, 29GB uncompressed) dataset. The running time for this is ~21 minutes, which is fine I think. Even the smaller dataset returned 865 results for a loosened JDBC regex, so I think it would still really benefit us to have it on CI. @bradlarsen Thanks much for this! Great catch! |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b37bdab. Configure here.

Motivation
When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.
Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.
This PR automates it. Every PR that touches detector regex or keyword changes now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The workflow runs once on PR open; reviewers or authors can re-run it manually via
workflow_dispatchif needed.What it tells you
The bench scans a corpus of real-world public code (~2.8 GB compressed, ~29 GB uncompressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:
PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.
Example output
What changed
.github/workflows/detector-corpora-test.yml— new workflow; triggers on PRs touchingpkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diffscripts/test/detect_changed_detectors.sh— resolves changed detector directories to their proto enum names for--include-detectorsscoping; skips detectors whose diff doesn't touch regex patterns orKeywords()so PRs that only change verification or redaction logic don't trigger a bench runscripts/test/detector_corpora_test.sh— streams corpus files (S3 or local), runs trufflehog with--no-verification, outputs JSONLscripts/test/diff_corpora_results.py— diffs two JSONL result sets and renders the Markdown report posted to the PRPerformance
The corpus is ~2.8 GB compressed (~29 GB uncompressed). A typical run takes ~21 minutes.
Three things keep it manageable:
Scoped scanning — Only the detectors changed in the PR are passed via
--include-detectors. Scanning the full detector set against 29 GB would be prohibitive; scoping to 1–3 detectors cuts runtime proportionally.Single S3 stream — Both PR and main binaries consume the same S3 download via a named FIFO. S3 is downloaded and decompressed once; both scans run in parallel against the same stream.
Main scan caching — The main binary always scans the same merge-base commit for a given PR. Results are cached in GitHub Actions keyed by
merge-base SHA + detector set. On a manual re-run without a rebase, the entire main side (worktree checkout,go build, S3 download, scan) is skipped.Fork PRs
Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.
Running locally
Results land in
/tmp/corpora_results.jsonlwith a DuckDB summary table printed to stdout.Note
Medium Risk
Introduces a new CI workflow that uses AWS credentials, caching, and PR comment automation; misconfiguration could leak costs/time or fail PR checks. Also changes the
jdbcdetector regex (currently intentionally loosened), which could increase false positives if merged as-is.Overview
Adds a new GitHub Actions workflow (
detector-corpora-test.yml) that automatically runs a scoped corpora scan when detector matching logic changes, compares PR vs merge-base results, and posts/updates a sticky PR comment with a Markdown diff report (or a skip message when no relevant changes are detected).Introduces supporting tooling:
detect_changed_detectors.shto map changed detector dirs to--include-detectorsIDs and ignore non-regex/Keywords()edits;detector_corpora_test.shto stream S3/local corpora and optionally tee a single stream into both PR and main binaries; anddiff_corpora_results.pyto diff JSONL outputs and render the report.Minor hygiene: extends
.gitignorefor Python artifacts; updatespkg/detectors/jdbcregex (noted as a temporary loosening to exercise the new workflow).Reviewed by Cursor Bugbot for commit ee95bc2. Bugbot is set up for automated code reviews on this repo. Configure here.