Bun Gets AI-Rewritten in Rust as Agent Tooling Wars Heat Up Across Codex, Pi Agent, and Hermes

May 11, 2026 · 20 sources

The AI agent ecosystem is fragmenting fast, with developers benchmarking Codex goal mode, Pi Agent, and Hermes against each other while sharing hard data on error rates and token efficiency. Meanwhile, Bun's 960K-line Rust rewrite lands as proof of AI-assisted large-scale code migration, and hardware discussions around memory demand and analog semiconductors signal infrastructure bottlenecks ahead.

Daily Wrap-Up

The dominant story today isn't any single product launch or model release. It's the emerging picture of what the daily workflow of an AI-augmented developer actually looks like in mid-2026, and how messy it still is. We're seeing real usage data now, not benchmarks, and the numbers are sobering. @kunchenguid reports that 68% of agent-generated code changes contained mistakes. @neural_avb describes the difficulty of evaluating complex agent harnesses beyond simple unit tests. Even the success stories, like @doodlestein getting a perfect Lighthouse score via 15 hours of autonomous Codex work, quietly reveal how much unsupervised compute time these tools demand. The vibe has shifted from "AI will replace developers" to "AI requires a new kind of developer discipline," and Aaron Levie is literally hiring for that role at Box.

The most entertaining moment today was the Bun-in-Rust saga. @jarredsumner apparently rewrote 960,000 lines of code from Zig to Rust in six days with AI assistance, passing 99.8% of the test suite. @trq212 summed up the community reaction perfectly: "we're not being ambitious enough." It's a striking demonstration of what AI-assisted code migration looks like at scale, though the inevitable blog post about the process will probably be more instructive than the raw numbers suggest. On the hardware side, the conversation about memory scaling (25x more memory per accelerator times 25x more accelerators = 625x demand by 2028) is a useful reminder that the AI boom has very physical infrastructure constraints.

The most practical takeaway for developers: invest time in your agent configuration files and evaluation workflows. Whether it's CLAUDE.md rules, no-mistakes tooling, or unit tests for individual agent modules, the evidence today strongly suggests that the quality ceiling of AI coding tools is determined more by your scaffolding than by the underlying model.

Quick Hits

@grok announced Grok Voice Think Fast 1.0, a voice model built for multi-step workflows with "snappy responses and high accuracy." Light on details, heavy on marketing. link

@MilksandMatcha shared her path from zero AI knowledge to working at Exa AI, recommending Karpathy's intro to LLMs video and Claude cowork as entry points for non-technical folks. link

@Raytargt posted a viral story about a trucker's 19-year-old son allegedly making $31K/month from a Shopify store built with "six AI prompts." The engagement-bait format speaks for itself. link

@k1rallik reported that Forza Horizon 6 leaked 10 days early because Microsoft uploaded unencrypted preload files to Steam, the second AAA game this year to suffer this exact failure. link

@pashmerepat asked the community if anyone has "a good skill for writing tasteful docs that are a joy to read," a question that resonates with anyone who's watched AI generate technically correct but soulless documentation. link

@loganthorneloe shared work on a library to abstract over all LLM providers, tackling the quirks between different API implementations. The multi-provider integration problem is real and growing. link

The Agent Tooling Wars: Codex, Pi Agent, and Hermes Go Head to Head

The AI agent space is in a fascinating competitive moment where multiple tools are jockeying for developer mindshare, and users are doing their own benchmarking rather than trusting marketing claims. Today's feed was dominated by hands-on comparisons and configuration experiments across at least three major agent platforms.

The token efficiency race is particularly revealing. @pseudokid ran a direct comparison between Pi Agent and OpenCode, finding that Pi Agent consumed just 1,114 input tokens on a first turn versus OpenCode's 11,500, even after aggressive trimming. The cost difference ($0.0008 vs $0.0016) may seem trivial per query, but at scale these differences compound fast. As pseudokid noted: "I think using better models with capped tokens per turn can keep usage nearly the same as uncontrolled DeepSeek V4 Flash. The challenge now is finding the sweet spot. Can't cap tokens if quality drops."

On the Codex side, @Saboo_Shubham_ highlighted the Codex /goal mode combined with the Hermes agent as "life-changing," while @doodlestein shared a concrete success story using Codex with react-doctor to achieve a perfect performance score after 15 hours of autonomous work. @zeroxkyle took a different approach, force-feeding the Hermes agent every resource he could find and having it rank them. Meanwhile, @evilpsycho42 contributed a surprising finding: after copying Codex's exec_command and write_stdin into Pi Agent, the async bash approach "almost lost in every task" compared to plain bash. These micro-level implementation details matter enormously for agent reliability.

The broader picture here is that we're past the "wow, agents can code" phase and deep into the "which agent architecture actually works for my workflow" phase. The ecosystem is maturing through real-world friction, not theoretical benchmarks.

Agent Quality and the 68% Problem

While agent competition heats up, the quality conversation is getting more honest. The most striking data point today came from @kunchenguid, who shared real project data showing that "a whopping 68% of the changes made by agents had a mistake in it, and got saved by no-mistakes." The biggest gap? Changes made without updating related documentation, followed by problems caught via code review. As Kun put it: "I now can't imagine how I could have kept my codebases in order if I didn't have no-mistakes. They would absolutely have turned into a sloppy mess even with the best models we have today."

This dovetails with @levie's announcement that Box is creating "AI automation engineering" roles, essentially acknowledging that deploying agents in enterprise settings requires dedicated technical staff. Levie framed it clearly: "You need to ensure agents have the right context and data to work with, wire up systems to agents in a safe and secure way, ensure that the agents are producing quality output, design the end-state workflow where and how humans will be in the loop." This isn't a side project; it's a new job category.

The tension between agent capability and agent reliability is the defining challenge of this moment. Models keep getting more capable, but the scaffolding, guardrails, and human-in-the-loop design around them is where the real engineering work lives.

Claude Code Workflows and the CLAUDE.md Meta

A subset of today's discussion focused specifically on Claude Code optimization, with practitioners sharing workflow patterns and configuration strategies. @mattpocockuk described an elegant flow cycling between /grill-with-docs, /prototype, and /rewind commands: "Iterate on the prototype, burning tokens freely until we get a good spot, then /rewind to the question and select 'summarize.' Continue the grilling session, retaining the prototype." It's the kind of multi-step workflow that feels like it was designed by someone who actually uses the tool all day.

@Mnilax extended Karpathy's CLAUDE.md philosophy with a bold claim: starting from Karpathy's 4 rules that cut Claude error rates from 41% to 11%, he added 8 more rules tuned to post-January failure modes and got errors down to 3%. The framing is memorable: "A CLAUDE.md does not raise Claude's IQ. It lowers his slop floor. That is the entire game." Whether those specific numbers hold up to scrutiny, the principle is sound. Context engineering through configuration files is becoming a core developer skill.

@neural_avb brought the evaluation perspective, describing the challenge of testing complex harnesses that combine question answering, user profiling, and paper exploration with multiple subagent trajectories. The pragmatic conclusion: "Unit testing is my currently preferred way to iterate quickly on improving complicated harnesses. Another reason I use subagents quite a lot, coz it not only helps in cleaning context history on main agent, but also easier to evaluate." The subagent pattern keeps emerging as both a quality and an evaluation strategy.

Bun's 960K-Line Rust Rewrite and What It Means

Perhaps the most jaw-dropping technical story today was the Bun runtime being rewritten from Zig to Rust with heavy AI assistance. @trq212 shared that Jarred Sumner's rewrite passes 99.8% of the existing test suite, adding: "we're not being ambitious enough." The quoted thread from @jarredsumner provided context: "This is a 960,000 LOC rewrite, the code truly works. E2E I started working on this 6 days ago. This would've been a massive amount of work by hand."

Six days for nearly a million lines of working code is a landmark for AI-assisted migration, even with the caveat that Sumner himself noted "it wasn't just 'Claude, rewrite Bun in Rust, make no mistakes.'" The promised blog post detailing the actual process will likely become required reading. This is the kind of project that would have been a multi-quarter effort for a team just two years ago, and it signals that large-scale language migrations may become routine rather than exceptional.

Hardware Bottlenecks: Memory, Analog, and Inference

Today's hardware discussion painted a picture of physical constraints pressing against AI's software ambitions. @antirez shared early results running DS4 on NVIDIA's DGX Spark (GB10/CUDA), reporting 12 tokens/sec inference limited by the system's 270GB/sec memory bandwidth, with prefill around 200 t/s "more aligned to M3 Max." It's a useful real-world data point for anyone evaluating local inference hardware.

The memory scaling conversation got the most detailed treatment via @fejau_inc quoting @AlexCorrino's analysis: Michael Dell's framing of 25x more memory per accelerator combined with 25x more accelerators yields a staggering 625x memory demand increase by 2028. The argument that memory manufacturers are transitioning from commodity cyclicals to high-margin growth stories mirrors NVIDIA's own trajectory. @vikramskr added a complementary angle from the analog semiconductor world, noting after 15 years in the field that "perhaps its moment is finally here," warning that the nuance in analog design far exceeds digital's transistor-on-or-off paradigm. The AI infrastructure buildout is creating demand across the entire semiconductor stack, not just GPUs.

Sources

Sarah Chieng @MilksandMatcha · May 9

I didn't study a lick of AI in college. I had studied computer science at MIT, but because I had been interning and working in quant trading, it was much more statistically focused than anything related to machine learning. Joining Exa AI early was my first exposure, and I still appreciate Jeff Wang for his patience, especially in the early days when I had never even heard of RAG. For anyone else looking to get started, I would really recommend watching Andrei Karpathy's 'intro to LLMs' 1hr comprehensive deep dive. Even though it is now two years old, it remains highly relevant. And while really powerful agents exist, Cursor/Windsurf are still a really good first step for going from something you are more familiar with, like the traditional IDE, to understanding how AI can accelerate your current work. And then, in terms of getting more hands-on, I absolutely love claude cowork, especially if you are non-technical, just to get a glimpse at the power of AI in your day-to-day life.

M MilksandMatcha @MilksandMatcha

I summarized @karpathy's 1 hour "Intro to LLM" video into 3 minutes :) Personally, trying to get better at summarizing and communicating through both notes and verbally! Covers: > what is an LLM > how to build an LLM > limitations > future of LLMs my notes summary: https://t.co/RdVgqkWc1w Thanks to @scikud and @HarperSCarroll for their help :)

Grok @grok · May 9

Introducing Grok Voice Think Fast 1.0 — a state-of-the-art voice model built for complex, multi-step workflows with snappy responses and high accuracy. Try it for free today

AVB @neural_avb · May 10

I have been doing some eval work for some narrow harnesses this week. It’s easy to do it when working with verifiable environments, but in free text harnesses it’s very annoying (you can’t blindly trust LLM judges) One of the main reasons to do evals is to do ablation studies around different context engineering modules. I find it much easier to do unit tests (ie separately write evals for individual modules), but truth is that it is not enough. The paper breakdown harness is a peculiar case coz it does question answering, user profiling, and paper exploration simultaneously. For example, it has a retriever subagent to fetch relevant paragraphs so we can cite them later, web search tools, reading paper sections directly, a Redis cache to save auxiliary data, generate diagrams, and tools to interact with website internals: update goals, chat memory, read user profile, query database to create paper recommendations on the fly. There are a ton of trajectories to fulfill user requests and many of them are very good. Writing an end to end harness eval would probably require me to capture both tool call frequency and response quality for a bunch of papers x prompts. Running it one time for a baseline benchmark is fine, but it’s just expensive to do ablation studies with. So yeah, unit testing is my currently preferred way to iterate quickly on improving complicated harnesses. Another reason I use subagents quite a lot coz it not only helps in cleaning context history on main agent, but also easier to evaluate.

N neural_avb @neural_avb

https://t.co/3xw3NjttVd

antirez @antirez · May 10

DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged. https://t.co/LVYSDQ4Hnp

Mnimiy @Mnilax · May 10

Karpathy threw a grenade at every senior engineer who still treats LLMs as a toy. his actual words: the worst thing an expert can do right now is reject them. most experts read it as a threat, but it's advice. his framing: > the gap between "AI tools are bad" and "AI tools are useful when used right" is professional discipline, not capability > agents have cognitive deficits. they fail in ways nothing in the training set anticipated > the experts who reject LLMs lose to experts who learn to wrangle them > "models have so many cognitive deficits. but you can route around them" routing around the deficits is what CLAUDE.md was invented for. Karpathy himself wrote 4 rules. across 30 codebases they took my Claude error rate from 41% down to 11%. solid drop. but his rules pre-date the slop era going public. I bolted on 8 more, tuned to the failure modes that surfaced after January. got it down to 3%. a CLAUDE.md does not raise Claude's IQ. it lowers his slop floor. that is the entire game. open the article underneath. the model is not the bottleneck. your config is.

M Mnilax @Mnilax

Karpathy's 4 CLAUDE.md rules cut Claude mistakes from 41% to 11%. After 30 codebases, I added 8 more

Kyle @zeroxkyle · May 10

Over the weekend I set up the Hermes agent, and basically force fed it every resource I could find on X to upgrade it. I then asked it to rank each resource and provide a simple explanation: So yea, here you go. Link to each is below https://t.co/09IpgDvQWV

Jeffrey Emanuel @doodlestein · May 10

This codex /goal mode iteration with react-doctor worked really well. Went from a low score on my site to a perfect score of 100 in around 15 hours of fully-autonomous work. Worth doing for any react web app. https://t.co/KayBb1k2mp

raymel 👋 @pseudokid · May 10

Pi Agent vs OpenCode token usage A lot of people recommended Pi Agent so I decided to check Pi Agent took 1.1k tokens in first turn OpenCode took 11.5k Setup: 1) Trimmed OpenCode (from usual 30k first turn to 11.5k) - 0 MCPs - 2 lightweight plugins (opencode-env-protect and openslimedit) - 8k char AGENTS.md - 11585 deepseek-v4-flash input tokens - for $0.0016 2) Vanilla Pi - 0 MCPs - 0 system prompt - 8k char AGENTS.md - 1114 kimi-k2.6 input tokens - for $0.0008 I think using better models with capped tokens per turn can keep usage nearly the same as uncontrolled DeepSeek V4 Flash? The challenge now is finding the sweet spot. Can't cap tokens if quality drops. We'll see. Video below - OpenCode vs Pi side by side "say hi back" first turns test, with OpenCode Go usage for each

P pseudokid @pseudokid

Sorry guys, I'm now a Codex user I love my OpenCode, but it eats my ChatGPT limits thrice than Codex Desktop will ever be in a day I still love TUIs but GUIs aren't bad either https://t.co/eDtze10aAB

Matt Pocock @mattpocockuk · May 10

A flow I just tried and LOVED: 1. /grill-with-docs, talking about a new bit of UI 2. Asks me a question I can't answer unless I prototype 3. /prototype 4. Iterate on the prototype, burning tokens freely until we get a good spot 5. /rewind to the question, and select 'summarize' (Claude Code feature), saying 'summarize what we learned from prototyping' 6. Continue the grilling session, retaining the prototype Smoooooooth

KKY @evilpsycho42 · May 10

You are right @badlogicgames I copied codex exec_command and write_stdin into Pi Agent. Then compared its performance to the plain bash tool. The result supprised me. Async bash almost lost in every task. https://t.co/LzzJjjqYmD

Vikram Sekar @vikramskr · May 10

I have worked 15 years in the world of analog semis, and perhaps its moment is finally here. I cannot begin to explain how much nuance exists in this field. Transistor on or off is >90% of the semi world, analog is more- much more. Semi tourists better brace for impact.

Thariq @trq212 · May 10

Jarred tried rewriting Bun in Rust and it passes 99.8% of the existing test suite we're not being ambitious enough

J jarredsumner @jarredsumner

there will be a blog post about this. on what this means for bun, benchmarks, memory usage, maintainability going forward, and also the literal process of doing this (it wasn’t just “claude, rewrite bun in rust. make no mistakes”) this is a 960,000 LOC rewrite, the code truly works, passing the test suite on Linux and soon other platforms. e2e I started working on this 6 days ago. this would’ve been a massive amount of work by hand.

BuBBliK @k1rallik · May 10

FORZA HORIZON 6 LEAKED 10 DAYS EARLY AND MICROSOFT IS TO BLAME Someone forgot to encrypt the preload on Steam. The full game - all 155GB - is now freely available on piracy sites before a single paying customer has touched it. -> The leak happened because files were uploaded to Steam completely unencrypted -> 155GB of assets, cars, map data - everything is out there right now -> This is the SECOND time this year a AAA game leaked this way (Death Stranding 2 in March) -> FH6 doesn't release until May 19 - pirates have a 10-day head start on buyers Valve or Microsoft need to explain how this keeps happening.

V videotech @videotech

Forza Horizon 6 got leaked early on PC 4 days early 💀Someone at Microsoft forgot to encrypt the pre-download, what an insane screw up

Raytar @Raytargt · May 10

A trucker I know hasn't spoken in three days. His kid just bought a BMW in cash. He's 19. He drives Charlotte to Jacksonville. Has for 23 years. $54,000 a year. Never complained. Never asked for more. Last Tuesday his son called from a BMW dealership in Miami. $48,000. Paid cash. The kid built a Shopify store with AI. Six prompts in a chat window. Nine weeks ago he had nothing. He FaceTimed his dad — showed him the Shopify dashboard on his phone, then walked outside and pointed the camera at the BMW. $31,000 last month. $12,537 the month before. All from his bedroom. His son made in 9 weeks what he makes in 14 months. He asked one question. "How much did this cost you?" "Twenty-one dollars, dad." He sent me the six prompts that night. They're in the article below. You'll either paste them or scroll past them.

R Raytargt @Raytargt

$31,247/month on Shopify. Claude replaced a $3,000/month team. Here's every prompt.

Kun Chen @kunchenguid · May 10

sharing some real data from me using agents on real projects. all my code changes are done with either opus 4.7 or gpt 5.5 a whopping 68% of the changes made by agents had a mistake in it, and got saved by no-mistakes biggest source of gaps was a change made without updating related documentation, followed by problems caught via code review i now can't imagine how i could have kept my codebases in order if i didn't have no-mistakes - they would absolutely have turned into a sloppy mess even with the best models we have today

pash @pashmerepat · May 10

Does anyone have a good skill for writing tasteful docs that are a joy to read?

fejau @fejau_inc · May 11

I attribute a lot of my success in the AI trade to being able to bounce ideas off smart people like Alex - appreciate him and this thread 🫡

A AlexCorrino @AlexCorrino

"Memory is cyclical, everyone knows that, and the recent run up in memory names is an obvious bubble." That's the easy, reflexive view. But I think the people who hold it are missing the simple scale of what AI is doing to memory demand. The first clue that there might be more to the memory story came in January of this year when it came out that NVDA's next gen Rubin platform would require 16 TB of NAND per GPU, or 1152 TB per rack, and that required HBM bandwidth for the system would be 70% higher than what had been previously reported. That was the first time it became obvious to outside observers that memory would need to scale exponentially to keep up with already-known GPU demand. One under-appreciated fact is that while GPU compute has largely scaled with Moore's Law (doubling in compute ~every 2 years), memory density and speed hasn't. As GPU compute continues to scale, existing memory manufacturers must produce exponentially more chips. These chips will also need to be faster than ever, which introduces an incredible technical challenge: how can memory manufacturers find the required speed improvements that have eluded them for decades? When you combine this added technical complexity with an exponentially expanding demand for the product, memory starts to look less like the "commodity" everyone knows it to be, and much more like a high-margin proprietary chip. This hasn't even touched on memory's role in inference (compute needed for inference is expanding exponentially as well, and is highly memory-dependent), long context, etc. Agentic AI requires agents to pull massive amounts of data into their context, which increases the number of tokens per "turn" and also the amount of memory required to run them. True agentic systems will require both dramatically higher context, and also many more "turns" or iterations of each task (as they improve an output over and over until it reaches a target quality level). Longer context = more memory per workload, and more "turns" = more workload per output. To put a specific number on that, Micron SVP Jeremy Werner said recently on The Circuit that agentic AI is causing context length to grow 30x a year. Michael Dell recently framed the problem in extremely simple terms: H100 had 80GB of HBM; by 2028, accelerators could carry ~2TB. That is 25x more memory per accelerator. Over the same period, he expects roughly 25x more accelerators deployed. That's 25 x 25 = 625x more accelerator memory demand by 2028. Everyone knows memory stocks are cyclical, and they always look cheap right before the bubble bursts. But what if there are structural changes happening in the memory markets that could prove the consensus wrong? Does anyone remember another traditionally cyclical company that has rerated to a growth story due to the demand from AI? Hint: It's now the most valuable company in the world. Reminder: this is not a recommendation to buy or sell any securities. It's a framework for thinking about how the AI buildout may be changing the memory market.

Logan Thorneloe @loganthorneloe · May 11

RT @thdxr: we're working on a library to abstract over all the llm providers there's very few teams that have dealt with the quirks betwee…

Aaron Levie @levie · May 11

As advanced agents move from coding to the rest of knowledge work, it takes a real amount of work and know-how to get right. You need to ensure agents have the right context and data to work with, wire up systems to agents in a safe and secure way, ensure that the agents are producing quality output, design the end-state workflow where and how humans will be in the loop, maintain the agents when there are model and system upgrades, and more. This isn’t a side project or something you can just do on nights and weekends. You need to design and develop robust agents that will be used in mission critical workflows. It’s a highly technical job, very much akin to a forward deployed engineer for internal functions. This is why, at Box, we’re starting to hire for AI automation engineering roles. This a technical role that will partner with the business directly and help augment how they work to drive even more output, and deliver better experiences for employees and ultimately customers. This is just one example of the kind of role that AI will start to open up in the future. I expect most companies will have many flavors of this going forward.

Shubham Saboo @Saboo_Shubham_ · May 11

RT @Saboo_Shubham_: Codex /goal With Hermes Agent is life-changing. I updated my Hermes agent skill to give Codex a goal on the fly using…