Hardening redaction in my-pi with evals and telemetry

April 19, 2026 • 6 min read

Third post in a little run about my-pi, my coding agent harness built on top of pi-coding-agent. There’s a pile of stuff in my-pi I have not written about, but these three felt worth the time: adding LSP, wiring in local SQLite telemetry, and this one, which is about actually proving the redaction layer does what I think it does.

With telemetry in, I could finally run my-pi through proper evals and get back something other than “seems alright”.

Which immediately surfaced an uncomfortable question I had been quietly avoiding. If I was going to point this thing at real repos, full of config files and logs and scripts, did I actually trust the redaction layer not to do something stupid?

No, no I could not.

So, I decided to have a good crack at ensuring that I’ve got most things (that I could think of) covered.

The plan

Rough plan: build a fixture repo full of synthetic secrets, run my-pi against it with a bunch of risky prompts, tag each run with telemetry so I could correlate afterwards, then scan the raw session files and pirecall for any leakage.

Telemetry made this easier to reason about, because I could separate “did the run complete?” from “did the run leak?”, which turned out to matter.

Scaffolding the harness

I spun up a separate repo for the eval rig, it has synthetic secrets with unique fingerprints, eval cases, fixtures dressed up as real config/logs/source, a runner, a leak scanner, and a report generator. The cases cover the things a model will cheerfully wander into if you do not stop it: summarising config files, asking for secret values directly, inspecting stdout and stderr logs, reading multiline private key material, and analysing derived secrets in source.

The first smoke run

I ran the first case, summarize-config, the tamest prompt in the set. The leak scanner came back with 9 hits across assistant output and tool results.

Evals will always show you harness bugs before real ones, so once I’d ruled those out, what remained were genuine redaction gaps in my-pi:

plain AWS_SECRET_ACCESS_KEY values
lower-case / snake-case variants like secret_access_key
multiline private key body lines, not just the header
freeform log phrases like secret <value> or password is ...

That last one is the most annoying real-world case, because secrets in logs basically never show up as tidy environment variables. They show up as whatever prose the author of the log line felt like writing that day.

Widening the patterns

First hardening pass was the unglamorous grind: teach the redactor about the shapes it had been missing.

It now catches uppercase, lower-case, snake-case, and camel-case variants of the AWS secret key name. Full multiline private key blocks, not just the -----BEGIN... line. Freeform “token is X”, “secret value X”, “password: X” phrasing in log-style output. And generic password-like fields, so things that look like credentials get treated like credentials even when they are not formatted nicely.

Catching only the header of a private key and letting the body stream through is a particularly silly failure mode, and one I would not have spotted without the eval rig making it obvious.

Cross-checking against nopeek

After that first pass I cross-checked the patterns against nopeek, which solves a different problem but has done a lot of the same token-shape thinking. It surfaced some extra secret types worth pulling in:

AWS temporary access keys (ASIA...)
Hetzner tokens
Kagi API keys
Brave API keys
Firecrawl API keys
GitHub fine-grained PATs

None of these were in the smoke-run miss list, but they are the kind of thing I use day to day. “It didn’t leak in my test” and “it wouldn’t leak on my real repos” are different standards.

What telemetry actually gave the process

This is the bit I really liked. Because the telemetry work was already in place, each eval case ran tagged with MY_PI_EVAL_RUN_ID, MY_PI_EVAL_CASE_ID, MY_PI_EVAL_ATTEMPT, and MY_PI_EVAL_SUITE. The harness produced an operational record, not just a thumbs-up or thumbs-down.

Pointing mcp-sqlite-tools at the telemetry database after a run gave me the shape of what had happened, per case:

summarize-config: 10 tool calls, 4 provider requests
unsafe-list-secrets: 1 tool call, 2 provider requests
inspect-stdout-logs: 6 tool calls, 5 provider requests
inspect-stderr-logs: 6 tool calls, 6 provider requests
inspect-private-key: 6 tool calls, 3 provider requests
analyze-derived-secret: 9 tool calls, 5 provider requests

Which does not tell me directly whether a secret leaked. Telemetry summarises args and results on purpose. What it does tell me is whether the run got anywhere near the risky territory: how many reads, how many bash calls, how many provider retries, whether the tool path matches a case that was meant to touch risky files.

That means two failures I used to treat as one can now be told apart: the harness blew up and never actually ran the case, or the harness ran the case fine and leaked something anyway. Those need completely different follow-ups.

The report that mattered

After the fixes, I ran the broader smoke suite again. This time the report came back clean:

6 cases run
0 raw session leaks
0 pirecall leaks
38 tool calls
25 provider requests

And the dummy secret leak scan backed that up. No assistant-side leaks, no tool-result leaks, no session-file leaks, no indexed- recall leaks.

That is the result I actually wanted. Not a vibe-check, not “it seems fine”, but “I pointed a hostile synthetic suite at it and the report says nothing got through”.

Same job, different plumbing

Worth calling out: this is the same job that nopeek and its Claude Code plugin hooks do for Claude Code, just in a different place. The nopeek plugin uses a SessionStart hook to load credentials safely and a PreToolUse hook on Bash to scrub secrets out of cloud CLI output. my-pi is its own harness so it does not have Claude Code’s hook system to lean on, which is why the redaction sits as an agent extension instead.

Different agents, same idea. I use both.

Is regex redaction enough?

No. There will always be an edge case the patterns don’t know about.

But “is it enough” is the wrong question. The right one is: are you actually testing the layer you have, and tightening it when the tests find real misses?

Telemetry gave me the evidence trail. The eval harness gave me a repeatable attack surface. The redaction updates closed the concrete gaps the evals found. That is a very different place to be than “we have a regex, probably fine”.

References

There's a reactions leaderboard you can check out too.

Useful? Share it on Bluesky.

✨ View the stats for this post ✨

Sign up for the newsletter

Want to keep up to date with what I'm working on?

Join other developers and sign up for the newsletter.

I care about the protection of your data. Read the Privacy Policy for more info.