AI Digest.

Agents Go Autonomous with Codex /goal and 96-Agent Swarms While DeepSeek Beats Opus Through Harness Engineering

Today's feed was dominated by the agentic revolution going into overdrive, from Codex's new /goal feature running tasks for hours to 96 concurrent Hermes agents burning through 382M tokens in three days. Meanwhile, a detailed teardown showed how four simple input repairs made DeepSeek V4 Pro outperform Opus 4.7, and the dcg safety tool expanded to protect developers from their own agents across all major harnesses.

Daily Wrap-Up

The most striking thread running through today's posts is that the agent era isn't coming; it's already here and scaling fast. We've moved past "can an agent write a function" to "can 96 agents coordinate on a MacBook Air over hotel wifi." The answer, apparently, is yes, and the conversation has shifted accordingly. The infrastructure layer around agents, from safety guardrails to database runtimes to tool-calling repair layers, is where the real engineering work is happening now. The models themselves are almost beside the point.

The single most technically valuable post today was @MrAhmadAwais's deep dive into why open source models fail at tool calling. The punchline is devastating for model elitism: it's almost always a harness problem, not a model problem. Four predictable input repair patterns and a couple of regex lines turned DeepSeek from "can't do tool calls" into an Opus 4.7 competitor. That's not a model upgrade. That's contract design. Meanwhile, @doodlestein's dcg tool hitting its four-month anniversary and expanding to Codex support is a quiet reminder that the more autonomous agents become, the more we need mechanical safety nets rather than vibes-based trust. When you have dozens of agents running simultaneously, "git reset --hard HEAD" from one rogue agent can wipe out everything the others built.

On the lighter side, watching @AlexFinn describe Codex's /goal feature building an entire extraction shooter game with AI-generated assets over the course of an hour feels like a glimpse into a very weird future. The fact that the recommendation is to "put on skip all permissions" and let it run for days is either thrilling or terrifying depending on your relationship with autonomous systems. The most practical takeaway for developers: if you're working with open source models and struggling with tool calling, don't blame the model. Implement validate-then-repair at the harness level, starting with the four common failure modes (null instead of omit, stringified arrays, object-for-array wraps, bare strings for arrays). You'll likely fix 90% of your issues with under 200 lines of code.

Quick Hits

  • @Starlink is pushing Starlink Mini for portable internet. Not AI, but the connectivity infrastructure that makes hotel-wifi agent swarms possible.
  • @usebagel launched Bagel, a lightweight CRM overlay for X that tracks pipeline and follow-ups directly on profiles. Useful for founders doing organic outreach.
  • @Davidstrolder pitched a "talk to your favorite book" chatbot for getting life advice and debating ideas. The literary AI niche continues to find its audience.
  • @TheAhmadOsman sparked debate by suggesting software engineers who can't work around agent limitations should "pivot to being a Starbucks Barista." Spicy take, meet ratio.
  • @ashwingop published Part 3 of his "Company Brain" series on interaction memory, arguing that the most important company knowledge lives in meetings, messages, and emails rather than documents.
  • @ArchiveExplorer highlighted n8n-mcp as the go-to open source automation tool with 19,000 stars and 1,500 documented nodes, calling it "where you start" for Claude automation workflows.
  • @cryptopunk7213 broke down Anthropic's latest usage study, noting that 75% of personal guidance conversations fall into fitness, finance, relationships, and career. Cal AI's $100M+ acquisition by MyFitnessPal proves there's real money in building wrapper experiences around these trends.

Agents Unchained: From Goal-Driven Coding to 96-Agent Swarms

The agent conversation leveled up significantly today, moving from "agents can help you code" to "agents can run autonomously for hours or days." The Codex /goal feature drew the most attention, with @AlexFinn reporting it built him "an entire complex extraction shooter video game" over the course of an hour, including generating all visual assets autonomously:

> "You give it a goal, then it works endlessly until the goal is complete. It's like a Ralph loop. Can run for days. If you enable the image gen skill before you run the goal, it will even generate ALL the assets for your game autonomously."

But the real jaw-dropper was @mr_r0b0t casually revealing that he ran 96 concurrent Hermes agents using DeepSeek V4 Pro, burning through "382,745,618 tokens over 171,136 API calls" in under three days, all from an M4 MacBook Air on hotel wifi with an 81% cache hit rate. Meanwhile, @vmiss33 published a practical guide on how they actually use Hermes agents in a multi-agent setup day to day, offering a grounded counterpoint to the hype.

The Shopify ecosystem is also going agentic. @tamir_eden announced open-sourced Shopify admin routines that flip the agent-operator relationship: "Skills made the agent something operators call when they need help. Routines flip it; the agent runs the store on a schedule and only pings you when something actually needs you." This is quoting in the context of @tobi (Shopify's CEO) endorsing the approach, which signals this isn't a fringe experiment. The pattern emerging across all these posts is consistent: agents are moving from reactive tools to proactive autonomous systems, and the humans are increasingly becoming the exception handlers rather than the drivers.

Harness Engineering: Why Open Models Fail at Tool Calling (And How to Fix It)

The most technically dense and arguably most important post of the day came from @MrAhmadAwais, who published a detailed breakdown of making DeepSeek V4 Pro competitive with Opus 4.7 through harness-level repairs rather than model changes. The core insight challenges a widespread assumption in the AI development community:

> "across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: sending null for an optional field instead of omitting it, emitting ["a","b"] as a json string instead of an actual array, wrapping a single arg in {} where the schema expected an array, passing a bare string where an array was expected."

The architectural decision that made this work was inverting the typical approach: instead of preprocessing inputs before validation, you validate first, let the schema tell you exactly what's broken, then apply targeted repairs only at the failing paths. The preprocessing approach caused silent corruption because valid JSON content got accidentally rewritten. The post-validation approach uses the schema itself as the prior, spending "repair budget" only where the validator actually disagreed.

Perhaps the most entertaining failure mode was DeepSeek-flash emitting file paths as markdown auto-links: notes.md. As @MrAhmadAwais explains, "this is not a hallucination. It's the post-training chat distribution leaking through the tool boundary." The model knows how to format a path; it just hasn't been told this path is going to fopen rather than into a chat bubble. The broader lesson here is that "tool confusion" is a more useful frame than "capability gap," and what looks like a model limitation is often a contract design problem that the harness should solve.

Agent Safety: Protecting Developers From Their Own Creations

As agents become more autonomous and numerous, the safety tooling around them becomes critical infrastructure. @doodlestein marked four months since releasing dcg (destructive_command_guard) and announced it now supports Codex and gemini-cli alongside Claude Code:

> "Agents just can't be trusted to not occasionally do crazy things that seem sensible to them at the moment, but which are wildly destructive and often irreversible. These bouts of temporary madness often occur soon after compactions, or as a result of context rot caused by excessively long sessions."

The tool is written in Rust for speed since it runs on every single tool call, and uses ast-grep analysis to catch agents that try to work around blocked commands via ad-hoc scripts. The point about context rot causing destructive behavior is particularly relevant given today's trend toward long-running autonomous agent sessions. @mattpocockuk raised the human side of this problem, asking what to do when team members use AI "negligently, not reviewing, not caring, leaning into the slop." The "code is cheap" mindset, he argues, is making pre-existing quality problems worse. These two posts form a natural pair: we need mechanical safety nets for agents AND cultural accountability standards for the humans directing them.

Local Inference Gets Serious: Budget GPUs Hit 128 Tokens/Second

The local AI running community had a strong showing today, with concrete benchmarks that make the case for consumer hardware as a viable inference platform. @above_spec reported getting 128 tokens per second on Qwen3.6-35B using an RTX 5060 Ti 16GB ($429 GPU) with ik_llama.cpp's R4 quant format, noting performance "stays consistent from 0 to 139k context." That's faster than a 5070 Ti on mainline llama.cpp, which is a remarkable result for a mid-range card.

@Maor_Elkarat dug into the technical details that make these results possible, pointing out that "weights are only half the story. KV cache is eating your VRAM alive." Understanding the flags and cache management strategies is what separates a frustrating 12GB experience from one that runs "insanely fast." Meanwhile, @luthiraabeykoon shared work on a hardware-level approach: building a transformer using Q4.12 fixed-point math and ROM-backed weights, with a reusable 16-lane streamed MatVec tile time-multiplexed across Q/K/V, MLP, and LM head projections. The stack is deepening at every layer, from silicon to quantization to cache management, all in service of making powerful models run on hardware regular people can actually buy.

Databases as Agent Runtime and the Thinning Application Layer

@anshublog shared a provocative framing that databases are "moving back to the center of software architecture, not as storage, but as runtime." The argument is that as agents generate workflows dynamically, applications get thinner, and the systems managing "memory, state, coordination, and history" become the critical infrastructure. This maps directly onto the multi-agent patterns we're seeing: when you have 96 agents coordinating work, something needs to manage shared state, and traditional application architecture isn't built for that. The quoted post from @siddontang, "The Database Is No Longer Storage; It Is Becoming the Runtime for AI," frames what could be one of the more consequential architectural shifts of the agent era.

The AI Services Market Finds Its Pricing Model

@lukepierceops laid out a detailed pricing framework for AI consulting that's worth bookmarking if you're in the space. The key insight is that the audit is the wedge: "The audit is what separates you from every 22-year-old with Claude Code who'll build whatever they're told. You're selling the map. The build becomes inevitable once they've seen the map." Meanwhile, @official_taches made waves by announcing they cancelled both Claude Max plans in favor of two Codex Max plans, calling GPT5.5 "the best coding model." The competitive landscape for AI coding tools continues to shift fast, and practitioners are voting with their wallets based on real workflow experience rather than benchmark leaderboards.

Sources

S
Starlink @Starlink ·
Starlink Mini offers fast, reliable internet on the go—great for traveling, camping, exploring, boating, RVing, and more. Order online in under 2 minutes.
B
Bagel | Remember people on X @usebagel ·
A tiny CRM for organic outreach on X. Bagel helps founders track pipeline, notes, context, and follow-ups directly on every profile.
A
Archive @ArchiveExplorer ·
This guy runs an AI consultancy out of Warsaw. for his own client work he built the tool every $10k/mo AI automation builder is secretly running 19,000 stars. 1,500 nodes documented. open source readme still says: "started as a personal tool, now helps tens of thousands of developers" if you're following the guide above - n8n-mcp is where you start → https://t.co/wpz2985hqt like + bookmark. you'll need this when you build your first claude automation
E eng_khairallah1 @eng_khairallah1

How to Build & Sell AI Automations That Generate $10K Per Month (Full Course)

E
Ejaaz @cryptopunk7213 ·
the single best part about these anthropic studies is they literally tell founders what products to build 1 startup already sold for $100M because they identified one of these trends: 75% of these conversations are people asking for advice in: > getting fit > paying back debt, making money > relationship advice > career management startups like Cal AI saw the fitness trend early, created an app that estimated calories from a photo and… went viral then myfitnesspal just bought them for 100M+ whoever can create a wrapper experience for financial management will make a killing. relationship AI is tricky, AI models become SUPER sycophantic because users tend to push back more in those convo’s (ai models suck under pressure) anyway - i love these studies please keep them coming
A AnthropicAI @AnthropicAI

About 6% of all conversations are people asking Claude for personal guidance—whether to take a job, how to handle a conflict, if they should move. Over 75% of these conversations fell into four domains: health & wellness, career, relationships, and personal finance. https://t.co/SQamPx0jWt

L
luthira @luthiraabeykoon ·
The core design uses Q4.12 fixed-point math and ROM-backed weights. Most of the model becomes one repeated operation: matrix-vector multiply. So we built a reusable 16-lane streamed MatVec tile and time-multiplexed it across Q/K/V, MLP, and LM head. https://t.co/IacUCTnSx2
A
Anshu Sharma 🌶 @anshublog ·
“Agents will generate workflows dynamically. Applications will get thinner. And the systems that manage memory, state, coordination, and history will become more important than ever. Which is why I think databases are moving back to the center of software architecture. Not as storage. As runtime.”
S siddontang @siddontang

The Database Is No Longer Storage - It Is Becoming the Runtime for AI

A
Alex Finn @AlexFinn ·
Pretty incredible You have to try the new '/goal' feature in Codex It worked for over an hour and built me an entire complex extraction shooter video game You give it a goal, then it works endlessly until the goal is complete. It's like a Ralph loop. Can run for days If you enable the image gen skill before you run the goal, it will even generate ALL the assets for your game autonomously. I didn't manually create ANY of the assets you see in the video Recommendations: enable the image gen skill, put on skip all permissions, and give the prompt as much detail as you can. It will accomplish ALL of it This has to be the sickest way to build games/ long running app tasks ever
M
Matt Pocock @mattpocockuk ·
What do you do if someone on your team is using AI negligently? I.e. not reviewing, not caring, leaning into the slop. This, of course, was a problem pre-AI. But the "code is cheap" mind virus is making it worse IMO.
M
Maor Ai @Maor_Elkarat ·
Stop buying more VRAM. Everyone’s posting Qwen 3.6 configs running insanely fast on 12GB cards. But do you actually understand the flags making it possible? Weights are only half the story. KV cache is eating your VRAM alive. The secret isn’t just 4-bit weights it’s the KV cache sorcery everyone’s missing. Here’s the annotated command & real tricks explained: @elonmusk @grok #Ai
L
Lex Christopherson @official_taches ·
I’ve officially cancelled both Claude Max plans and have 2 x Codex Max plans. Codex - particular GPT5.5 is the best coding model.
L
Luke Pierce @lukepierceops ·
Yesterday I said stop selling AI for $2-5k. Here's what you should actually be selling instead: Phase 1: Audit ($3K-$5K, 2-4 weeks) Phase 2: Build ($25K-$60K, 6-12 weeks) Phase 3A: Dev Retainer ($3K-$8K/mo, ongoing) Phase 3B: Maintenance ($500-$2k/mo) For mid-market ($10M-$50M ARR), shift up: Audit: $4K-$6K Build: $35K-$75K Retainer: $5K-$10K/mo For enterprise ($50M+): Audit: $7.5K-$15K Build: $75K-$250K+ Retainer: $10K+/mo The audit is the wedge. The audit is what separates you from every 22-year-old with Claude Code who'll build whatever they're told. You're selling the map. The build becomes inevitable once they've seen the map.
T
Tuki @tamir_eden ·
Operators, your Monday morning checklist is a pain in the A$$. Agentic-ecommerce operations are quickly becoming the next big leap for 2026 - Shopify shipped a Hermes (@NousResearch) agent skill. We open-sourced and shipped shopify-admin routines + more skills, for any agent: https://t.co/Z29UKCQX66 Skills made the agent something operators call when they need help. Routines flip it - the agent runs the store on a schedule and only pings you when something actually needs you. Come join the growing community of agentic-first @Shopify operators: https://t.co/kya1p5P4JR.
T tobi @tobi

So good

A
Ahmad @TheAhmadOsman ·
If you’re a “Software Engineer” and you don’t know how to bypass this then please pivot to being a Starbucks Barista because you’re ngmi
C cormachayden_ @cormachayden_

software engineers before vs after agents https://t.co/jJp75lO8O7

A
Ahmad Awais @MrAhmadAwais ·
how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending `null` for an optional field instead of omitting it - emitting `["a","b"]` as a json *string* instead of an actual array - wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. deepseek-flash, when asked to edit or write a file, sometimes emits the path as a *markdown auto-link*: filePath: "/Users/x/proj/[notes.md](http://notes. md)" our writeFile tool obediently trued creating files literally named `[notes.md](http://notes .md)` until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like `[click](https://x .com)` passes through untouched. this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict. "tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level `pathString()` instead of `z.string()` and the leak is plugged for every path field at once. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that *happened* to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test. then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log `tool_input_repaired:${toolName}`. on failure, log `tool_input_invalid:${toolName}` and return a model-readable retry message. the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence. (this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.) 4/ shape invariants and relational invariants need different fixes. the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a *relational* invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling `readFile({ absolutePath, limit: 30 })` and getting an `ERROR:` back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them. so i taught the function the model's intent instead. `limit` alone → `offset = 0`. `offset` alone → `limit = 2000` (matches common read tool ops default). then surfaced the decision back to the model in the result: "Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit." no `Error:` prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big. repair where you can. extend semantics where you can't. surface the choice either way. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model.
M MrAhmadAwais @MrAhmadAwais

Wow I just made DeepSeek V4 Pro beat Opus 4.7 6/10 times in our internal evals by auto repairing many of its quirks in tool calling. It’s performing super solid for such a cheap model.

M
mr-r0b0t @mr_r0b0t ·
Here's what 96 concurrent @NousResearch Hermes Agents (using 382,745,618 tokens over 171,136 API calls to deepseek-v4-pro) can generate for you in less than 3 days. From your M4 Macbook Air 24GB, on hotel wifi! 81% cache hit rate in case you're wondering! https://t.co/9nSdx3KvxG
A
Ashwin Gopinath @ashwingop ·
Company Brain, Part 3: Interaction Memory
V
vmiss @vmiss33 ·
What I Use Hermes Agent For (And How I Use It)
D
David Strolder @Davidstrolder ·
Talk directly to the consciousness of your favorite book. 📚 • Get life advice • Debate their deepest ideas • Vent about your problems
A
AboveSpec @above_spec ·
RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵
B
Browser Use @browser_use ·
RT @mamagnus00: browser-harness exploded. some said AGI is here. but what’s the right interface? introducing Browser Use Desktop. open-sou…
J
Jeffrey Emanuel @doodlestein ·
It's now been around 4 months since my open-source dcg tool was first released, and I know from hearing from tons of users that it has saved countless people from disaster at the hands of overeager Claude Code agents. I've continued to make various performance improvements and added additional preset packs to the project, most recently for the Railway API after the recent and infamous incident where someone blamed Claude for wiping their production database. Because of the way dcg is implemented as a "pre-tool-use hook" in Claude Code, there was no way to use it in Codex, since Codex didn't support that kind of hook at all. Until a week or so ago, when they finally added it. So I'm now pleased to say that the latest version of dcg has full support for Codex (plus it also works for gemini-cli if anyone is really using that outside of the 'Plex!). If you're not familiar with dcg yet, I highly recommend checking it out. It's unthinkable to me now to use any coding agent that doesn't support it; it feels like speeding on the highway without a seatbelt on (or more accurately, with a sharp knife strapped to the steering wheel pointed at your heart). Agents just can't be trusted to not occasionally do crazy things that seem sensible to them at the moment, but which are wildly destructive and often irreversible. These bouts of temporary madness often occur soon after compactions, or as a result of context rot caused by excessively long sessions. Not only does dcg mechanically prevent the agents from being able to do that, it explains to them why it did that specifically, and offers them safe alternatives custom-tailored to the specific commands they tried to run. The more agents you have running at the same time on the same project, the more dcg goes from a nice thing to have to being totally indispensable if you don't want to constantly worry about one rogue agent wiping out the work of the other agents with a misguided "git reset --hard HEAD" command. The dcg utility itself is written in hyper-optimized, memory-safe Rust and uses minimal system resources. Because it's totally mechanical (unlike the auto-approve feature in Claude Code, which uses an AI model that adds latency), you can't even notice any delay from it running on every command. dcg is NOT just a cookbook of canned forbidden commands; frontier models are too smart and resourceful to actually be constrained by such a simplistic approach. When they're prevented from running a command one way, they'll try another way; if that also doesn't work, they'll whip up an ad-hoc Bash script or Python program to do what they want. But dcg can detect that as well using its advanced ast-grep mode (which only kicks in when dealing with such heredoc scripts, so that the faster regex-only path can be used when applicable). It's also very quick and easy to expand and customize dcg by creating your own custom preset packs to add to the 50 or so included packs. Just ask Codex to study the existing presets and explain what you want to protect against in your own custom API or tooling, or in a third-party project that's not currently included by default in dcg. So, remember: Friends don't let friends vibe code without dcg. Protect yourself from your agents, and protect them from themselves. You can get it here: https://t.co/aVmEBi9WCd It installs in under a minute on Linux or Mac using the curl-bash one-liner command shown in the README, and automatically detects any supported agent harnesses installed on your machine and configures them for you to use dcg. And if you decide it's not for you, it can be fully uninstalled in seconds using the provided command.
D doodlestein @doodlestein

Agent coding life hack: I’m 100% convinced that there are hundreds of thousands of developers out there who would love and use my dcg tool if they only knew about it. dcg: destructive_command_guard This is a free, open-source, highly-optimized rust program that runs using pre-tool hooks in Claude Code (CC) and checks the tool call that CC was about to make to see if it’s potentially destructive; that is, could delete data, lose work, drop tables, etc. Get it here and install with the convenient one-liner: https://t.co/aVmEBi9WCd A tool like dcg has several competing goals that make it a careful balancing act and tough engineering problem: 1. Since it runs for every single tool call, it must be FAST. Hence why it is written in Rust and an extreme amount of focus has been placed on making it as fast as possible. 2. It must avoid annoying false positives that waste your time, add friction, and re-introduce you as the bottleneck unnecessarily. I run dozens of agents at once and don’t want them wasting time waiting for me unless it’s needed. Usually, the messages from dcg are enough to get the agent to be more thoughtful about what it’s doing. 3. It’s not enough to just use a simple rulebook where you look for canned commands like “rm -rf /” or “git reset --hard HEAD.” The models are very resourceful and will use ad-hoc Python or bash scripts or many other ways to get around simple-minded limitations. That’s why dcg has a very elaborate, ast-grep powered layer that kicks in when it detects an ad-hoc (“heredoc”) script. But wherever possible, it uses much faster simd optimized regex. 4. A tool like this should really be expandable and have semantic knowledge of various domains and what constitutes a destructive act in those domains. For instance, if you’re working with s3 buckets on aws, you could have a highly destructive command that doesn’t look like a normal delete. That’s why dcg comes out of the box with around 50 presets which can be easily enabled based on your projects’ tech stacks (just ask CC to figure out which packs to turn on for you by analyzing your projects directory). 5. dcg is designed to be very agent friendly. It doesn’t just block commands, it explains why and offers safe alternatives based on an analysis of the specific command used by the agent. For instance, it might stop the agent from deleting your Rust project’s build directories but suggest using “cargo clean” instead. Often, these messages are enough to knock sense into Claude. I really can’t exaggerate just how much time and frustration dcg has already saved me. It should be known and used by everyone who has had these kinds of upsetting experiences with coding agents. dcg is included along with all my other tooling in my https://t.co/N4As0kJTQP project. All free, MIT licensed, with extensive tutorials and other educational resources for people with less experience. Give it a try, you won’t regret it!