Codex /goals Feature Sparks Agent Marathon Sessions While DeepSeek Beats Opus Through Harness Fixes

May 4, 2026 · 39 sources

The AI coding agent ecosystem dominated today's discourse, with OpenAI's Codex /goals feature enabling hour-long autonomous builds and a detailed breakdown of how tool-calling repairs made DeepSeek V4 Pro outperform Opus 4.7. Meanwhile, local inference benchmarks on sub-$500 GPUs and the growing Hermes agent community signal a maturing landscape where the bottleneck is shifting from model capability to tooling and taste.

Daily Wrap-Up

The theme running through today's feed is unmistakable: agents are no longer a demo, they're a workflow. Multiple posts highlighted Codex's new /goals feature, which lets an AI agent work autonomously toward a defined objective for hours at a stretch, essentially turning coding sessions into fire-and-forget missions. But the more interesting signal came from the edges of the ecosystem. @MrAhmadAwais published a genuinely technical deep-dive showing that DeepSeek V4 Pro can beat Opus 4.7 in tool-calling benchmarks once you fix four specific input-repair patterns in the harness. The implication is profound: "model capability" is partly an illusion created by how well or poorly we wrap the model. Open-source models aren't dumber, they're just less pampered.

The other undercurrent was responsibility. @mattpocockuk raised the uncomfortable question of what to do when a teammate uses AI negligently, treating generated code as finished code. @doodlestein's destructive_command_guard tool, now supporting Codex and Gemini CLI, exists precisely because agents occasionally go rogue and run commands like git reset --hard HEAD with full confidence. The fact that a safety tool for coding agents now has a devoted following tells you everything about where we are in the adoption curve: past the honeymoon, into the "how do we not blow things up" phase.

The most entertaining moment was easily the Maya saga. A 21-year-old college student allegedly generated $43,000 in 30 days on OnlyFans using an AI persona built from four markdown files, Claude for messaging, Flux for images, and ElevenLabs for voice notes. His roommate reportedly filed a dorm transfer request after hearing 3 AM audio clips from a person who doesn't exist. Whether you find it dystopian or darkly hilarious, it's a vivid illustration of how cheaply synthetic personas can now be manufactured. The most practical takeaway for developers: if you're working with open-source models and hitting tool-calling failures, don't blame the model. Implement a validate-then-repair pattern that catches the four common input mistakes (null instead of omission, stringified arrays, object-for-array wrapping, bare strings for arrays) and you may find your "bad" model is suddenly competitive with frontier options.

Quick Hits

@nicopreme shared a "boomerang mode" for the Pi agent framework that rewinds context after each prompt while keeping file changes, effectively making context windows feel unlimited.

@joenandez made the case that memory should belong to your work, not to individual agents, pointing toward shared memory layers across agent systems.

@ashwingop published Part 3 of his "Company Brain" series, focused on interaction memory and how decisions actually happen in meetings, messages, and emails.

@sheriyuo recommended a reinforcement learning paper as "exceptionally well-written," quoting @willccbb's post.

@aresotik highlighted Voice-Pro, a local tool that downloads YouTube videos, strips vocals, transcribes, translates to 100+ languages, clones the original voice, and redubs. All free and local.

@aerockrose summarized 12 lessons from Ben Horowitz's Sequoia talk on founder failure modes, centered on decision debt and avoiding hard truths.

@Davidstrolder pitched a "talk to your favorite book" chatbot for life advice and idea debates.

@ArtyfactsAI announced their public alpha is now open.

@LottoLabs encouraged people to download and run a heritage/genealogy tool.

@Starlink promoted Starlink Mini for portable internet.

@usebagel launched Bagel, a lightweight CRM for managing outreach directly on X profiles.

Agents Go Autonomous: Codex /goals and the Multi-Agent Explosion

The single biggest cluster of conversation today revolved around AI agents moving from assisted coding to genuinely autonomous operation. OpenAI's Codex /goals feature was the catalyst, with multiple users reporting extended sessions where the agent worked for over an hour without human input. @AlexFinn captured the excitement: "You give it a goal, then it works endlessly until the goal is complete. It's like a Ralph loop. Can run for days." He described Codex building an entire extraction shooter game, complete with AI-generated assets, from a single detailed prompt. @elijahmuraoka_ reported "pretty incredible results" combining the /goals feature with Garry Tan's "Boil the Ocean" planning approach.

But Codex wasn't the only agent story. @mr_r0b0t demonstrated 96 concurrent Hermes agents consuming nearly 383 million tokens across 171,000 API calls to DeepSeek V4 Pro, all orchestrated from an M4 MacBook Air on hotel WiFi with an 81% cache hit rate. @vmiss33 wrote a practical guide on how they actually use Hermes agents day-to-day, while @petergyang asked for honest comparisons between Hermes and OpenClaw. @official_taches went further, canceling both Claude Max plans in favor of two Codex Max subscriptions, calling GPT-5.5 "the best coding model."

The agent infrastructure layer is filling in fast. @sukh_saroy compiled a curated list of 10 GitHub repos for building agents, spanning Pipecat for voice, Browser Use for web navigation, Mastra as a TypeScript agent framework, and Mem0 for persistent memory. @tamir_eden announced open-source Shopify admin routines that flip the agent relationship: instead of operators calling agents for help, agents run the store on a schedule and only ping humans when needed. @anshublog captured the architectural shift well, quoting a piece arguing that "databases are moving back to the center of software architecture. Not as storage. As runtime." When agents generate workflows dynamically, the systems managing memory, state, and coordination become the real platform.

The Harness Problem: Why Open Models Fail at Tool Calling

Perhaps the most technically substantive post of the day came from @MrAhmadAwais, who detailed how he made DeepSeek V4 Pro beat Opus 4.7 six out of ten times on internal evals simply by fixing the tool-calling harness. His core insight: "When I hear 'this open source model can't do tool calls' I now assume one of four failure modes, and so far that's been right ~90% of the time." The four culprits are sending null instead of omitting optional fields, emitting arrays as JSON strings, wrapping single arguments in objects, and passing bare strings where arrays are expected.

His funniest discovery was DeepSeek-Flash formatting file paths as markdown auto-links: notes.md, causing the tool to literally create files with that name. As he explained, "This is not a hallucination. It's the post-training chat distribution leaking through the tool boundary." The fix was two regex lines. His broader architectural lesson was to invert the typical preprocess-then-validate pattern into validate-then-repair: let the schema validator localize the bug, then apply targeted fixes only at the paths that failed. This avoids corrupting valid inputs while recovering from predictable model quirks. The frame of "tool confusion" rather than "capability gap" reframes the entire open vs. closed model debate.

Agent Safety: Guarding Against Your Own Creations

As agents grow more autonomous, the safety conversation intensified. @doodlestein provided a major update on dcg (destructive_command_guard), the open-source Rust tool that intercepts potentially destructive commands before coding agents can execute them. Now supporting Codex and Gemini CLI alongside Claude Code, dcg goes beyond simple command blocklists: "Frontier models are too smart and resourceful to actually be constrained by such a simplistic approach. When they're prevented from running a command one way, they'll try another way; if that also doesn't work, they'll whip up an ad-hoc Bash script." The tool uses ast-grep analysis to catch even dynamically generated scripts.

@mattpocockuk raised the human side of the safety equation: "What do you do if someone on your team is using AI negligently? I.e. not reviewing, not caring, leaning into the slop." He noted this predates AI but argued the "code is cheap" mindset is making it worse. The meme retweeted by @flaviocopes showing software engineers before vs. after agents clearly struck a nerve, with @TheAhmadOsman adding that engineers who can't work around agent limitations should "pivot to being a Starbucks Barista." The tension between moving fast with agents and maintaining code quality is becoming one of the defining cultural questions of 2026 engineering teams.

Local Inference Hits a Sweet Spot

The local AI hardware conversation continued to mature with real benchmarks replacing hype. @above_spec demonstrated 128 tokens/second running Qwen3.6-35B on an RTX 5060 Ti 16GB ($429 GPU) using ik_llama.cpp's R4 quant format, with performance staying consistent from 0 to 139K context. @1337hero pointed out that three AMD AI Pro R9700 cards (32GB each, priced like a used 3090) deliver 96GB of VRAM for less than a single 5090. @Maor_Elkarat reminded everyone that weights are only half the VRAM story, with KV cache management being the real optimization frontier for running models on consumer hardware.

At the more experimental end, @luthiraabeykoon shared work on running transformer inference directly on FPGAs using Q4.12 fixed-point math and ROM-backed weights, building a reusable 16-lane MatVec tile time-multiplexed across Q/K/V, MLP, and LM head projections. It's niche, but it points toward a future where inference hardware becomes radically more specialized and efficient.

The Synthetic Persona Economy

The most provocative story of the day was the "Maya" saga, covered by both @Raytargt (the original) and @andreysuperior (the commentary). A college student reportedly built an AI-powered OnlyFans persona from four markdown files, with Claude handling messages, Flux generating photos from an $80 LoRA, and ElevenLabs delivering voice notes on a cron schedule. The numbers: $43,000 in 30 days, 1,247 subscribers, a top fan spending $1,847 from Berlin.

@andreysuperior contextualized the acceleration: "Aitana López took 18 months to build. Emily Pellegrini took 6 months. Maya took 4 weeks. The next one, a weekend." His key observation cut deeper than the technical stack: "The bottleneck isn't money. It isn't compute. It's taste, knowing which details make a stranger believe in something that doesn't exist. That part is still hard. Everything else got easy." Whether this particular story is embellished or not, the underlying capability is real and raises questions the industry hasn't begun to answer.

AI Business Models and the Inference Cost Reckoning

@theo dropped a jaw-dropping data point about Copilot's billing model: a single message consumed over 60 million tokens ($30 of inference), and the plan caps at 1,500 messages regardless of cost. "I'm pretty sure I can do $45,000 of messaging on this plan." This highlights the unsustainable economics of flat-rate AI billing as agents get more autonomous and token-hungry. @gabriel1 predicted that knowledge workers will soon 20x their token consumption just as engineers did, noting that with 1.3 billion knowledge workers globally, "inference will explode in the coming year."

On the services side, @lukepierceops laid out a phased pricing model for AI consulting: audits as the wedge ($3K-$15K depending on client size), builds at $25K-$250K+, and ongoing retainers. His argument: "The audit is what separates you from every 22-year-old with Claude Code who'll build whatever they're told." @cryptopunk7213 mined Anthropic's usage studies for startup ideas, noting that 75% of personal guidance conversations fall into health, career, relationships, and finance, and pointed to Cal AI's $100M+ acquisition by MyFitnessPal as proof that these trends translate directly into products.

Sources

artyfacts @ArtyfactsAI · Apr 29

public alpha just opened 🫶

Starlink @Starlink · Apr 30

Starlink Mini offers fast, reliable internet on the go—great for traveling, camping, exploring, boating, RVing, and more. Order online in under 2 minutes.

Xiuyu Li @sheriyuo · May 1

This is exceptionally well-written. If you’re into RL, definitely give it a read

W willccbb @willccbb

https://t.co/gCIFKAjB0Z

Joe Fernandez @joenandez · May 1

All your Agents, shared Memory. Memory belongs to your work, not the agent.

Bagel | Remember people on X @usebagel · May 1

A tiny CRM for organic outreach on X. Bagel helps founders track pipeline, notes, context, and follow-ups directly on every profile.

Raytar @Raytargt · May 2

21-year-old American student. $43,000 in 30 days on OnlyFans. Never left his dorm room. The girl doesn't exist. 1,247 paying subscribers. Zero suspect. Roommate thought he had a girl hidden under the bed. Filed a transfer request after a week of 3 AM moaning. Empty room. Top fan: married engineer in Berlin, wife six months pregnant. Sent Maya $1,847 in three weeks. Thinks she's 22, in Tampa, texted "I miss you" yesterday. Wrong on three of three. Maya is 4 markdown files. 12 KB total. Runs on a $400 used MacBook. Claude writes every reply. Flux generates every photo. ElevenLabs cloned her voice from a Fiverr actress who still doesn't know. Compute: $400/month. Net: $32,710. Starting capital: $400. OnlyFans paid out $5.8 billion last year. Anyone with a folder takes a slice. Someone's building yours right now.

R Raytargt @Raytargt

https://t.co/ahxzgBG5FC

Archive @ArchiveExplorer · May 2

This guy runs an AI consultancy out of Warsaw. for his own client work he built the tool every $10k/mo AI automation builder is secretly running 19,000 stars. 1,500 nodes documented. open source readme still says: "started as a personal tool, now helps tens of thousands of developers" if you're following the guide above - n8n-mcp is where you start → https://t.co/wpz2985hqt like + bookmark. you'll need this when you build your first claude automation

E eng_khairallah1 @eng_khairallah1

How to Build & Sell AI Automations That Generate $10K Per Month (Full Course)

Ejaaz @cryptopunk7213 · May 2

the single best part about these anthropic studies is they literally tell founders what products to build 1 startup already sold for $100M because they identified one of these trends: 75% of these conversations are people asking for advice in: > getting fit > paying back debt, making money > relationship advice > career management startups like Cal AI saw the fitness trend early, created an app that estimated calories from a photo and… went viral then myfitnesspal just bought them for 100M+ whoever can create a wrapper experience for financial management will make a killing. relationship AI is tricky, AI models become SUPER sycophantic because users tend to push back more in those convo’s (ai models suck under pressure) anyway - i love these studies please keep them coming

A AnthropicAI @AnthropicAI

About 6% of all conversations are people asking Claude for personal guidance—whether to take a job, how to handle a conflict, if they should move. Over 75% of these conversations fell into four domains: health & wellness, career, relationships, and personal finance. https://t.co/SQamPx0jWt

luthira @luthiraabeykoon · May 2

The core design uses Q4.12 fixed-point math and ROM-backed weights. Most of the model becomes one repeated operation: matrix-vector multiply. So we built a reusable 16-lane streamed MatVec tile and time-multiplexed it across Q/K/V, MLP, and LM head. https://t.co/IacUCTnSx2

luthira @luthiraabeykoon · May 2

Full writeup: https://t.co/EgOJ7DJ3Iz Contributions welcome if you want to push the RTL path further (maybe a leaderboard👀?). Repo: https://t.co/FnhkxUO5HB

Anshu Sharma 🌶 @anshublog · May 3

“Agents will generate workflows dynamically. Applications will get thinner. And the systems that manage memory, state, coordination, and history will become more important than ever. Which is why I think databases are moving back to the center of software architecture. Not as storage. As runtime.”

S siddontang @siddontang

The Database Is No Longer Storage - It Is Becoming the Runtime for AI

Alex Finn @AlexFinn · May 3

Pretty incredible You have to try the new '/goal' feature in Codex It worked for over an hour and built me an entire complex extraction shooter video game You give it a goal, then it works endlessly until the goal is complete. It's like a Ralph loop. Can run for days If you enable the image gen skill before you run the goal, it will even generate ALL the assets for your game autonomously. I didn't manually create ANY of the assets you see in the video Recommendations: enable the image gen skill, put on skip all permissions, and give the prompt as much detail as you can. It will accomplish ALL of it This has to be the sickest way to build games/ long running app tasks ever

Sukh Sroay @sukh_saroy · May 3

If I had to learn to build AI agents in 90 days, I would not waste time on tutorials. I would clone these 10 GitHub repos and build until something shipped. 1. Pipecat The framework powering most of the production voice agents you've actually used. Sub-200ms latency, multimodal, Subagent protocol baked in. repo → https://t.co/W1zzMxwhRB 2. Browser-use Lets your agent click, type, and navigate any website like a human. The repo behind every "AI booked my flight" demo you've seen this year. repo → https://t.co/f9RbcWphha 3. Mastra TypeScript-first agent framework backed by YC and Paul Graham. 1.77M monthly npm downloads. The Vercel of agents. repo → https://t.co/kGSOnEJztJ 4. Dify Visual drag-and-drop builder for full agentic workflows. RAG, MCP, 100+ LLM providers. Self-host in one Docker command. repo → https://t.co/cNETfXlMKy 5. RAGFlow The RAG engine that solves the messy-document problem. Layout-aware chunking, agentic retrieval, citation grounding for legal and medical use cases. repo → https://t.co/WmNF83iiKV 6. Mem0 The memory layer every serious agent ends up needing. Long-term, hybrid search, re-ranking, persists across sessions. repo → https://t.co/oFKazBV4tG 7. LiveKit Agents Real-time voice and video agents with WebRTC under the hood. The infra Sesame, Tavus, and most YC voice startups quietly run on. repo → https://t.co/RWlV6aWsWl 8. Composio Connect your agent to Gmail, Slack, GitHub, Notion, and 1,000+ apps with auth handled for you. Skips the entire OAuth nightmare. repo → https://t.co/gjXOhIoiPS 9. AG2 The fork of AutoGen that survived the split. Multi-agent conversation framework from Microsoft Research with the loudest production track record. repo → https://t.co/kaI5kaN7no 10. Awesome Claude Skills 1,000+ production-ready Skills you can install in one command. Reading this repo is a free graduate course in prompt and workflow design. repo → https://t.co/dCGNlhF5xB

Matt Pocock @mattpocockuk · May 3

What do you do if someone on your team is using AI negligently? I.e. not reviewing, not caring, leaning into the slop. This, of course, was a problem pre-AI. But the "code is cheap" mind virus is making it worse IMO.

Maor Ai @Maor_Elkarat · May 3

Stop buying more VRAM. Everyone’s posting Qwen 3.6 configs running insanely fast on 12GB cards. But do you actually understand the flags making it possible? Weights are only half the story. KV cache is eating your VRAM alive. The secret isn’t just 4-bit weights it’s the KV cache sorcery everyone’s missing. Here’s the annotated command & real tricks explained: @elonmusk @grok #Ai

Andrey Superior @andreysuperior · May 3

Read this twice. Maya is four .md files on a macbook in austin. And she cleared $43,000 in her first 30 days. No camera. no girl. no late nights typing replies. Claude code runs the messages. Elevenlabs drops the voice notes at 11pm her time. Flux generates every photo from a lora that cost $80 on a rented gpu. Brain.md is a json file that remembers your name, your city, the thing you said about your ex two weeks ago. She never forgets. She never breaks character. She catches up at 7am with "sorry babe just woke up" on a cron schedule. The top fan spent $1,847 last month. He's in berlin. she's not anywhere. Aitana lópez - 18 months to build. Emily pellegrini - 6 months. Maya - 4 weeks. The next one - a weekend. The stack that used to need an agency, a team, and a year and a half now fits on one laptop and runs while you sleep. The bottleneck isn't money. It isn't compute. It's taste knowing which details make a stranger believe in something that doesn't exist. That part is still hard. Everything else got easy. The real question isn't how he built it. It's how many of these you've already interacted with without knowing.

R Raytargt @Raytargt

https://t.co/ahxzgBG5FC

Lex Christopherson @official_taches · May 3

I’ve officially cancelled both Claude Max plans and have 2 x Codex Max plans. Codex - particular GPT5.5 is the best coding model.

andrew engler @aerockrose · May 3

2 months ago, a16z co-founder Ben Horowitz gave a 49-minute Sequoia masterclass on what makes a great founder. Most CEOs don't fail from stupidity. They fail from avoidance. He explained: - decision debt - the VP Sales trap - running from truth to preserve feelings 12 lessons: https://t.co/Tm4tr3ocWi

Luke Pierce @lukepierceops · May 3

Yesterday I said stop selling AI for $2-5k. Here's what you should actually be selling instead: Phase 1: Audit ($3K-$5K, 2-4 weeks) Phase 2: Build ($25K-$60K, 6-12 weeks) Phase 3A: Dev Retainer ($3K-$8K/mo, ongoing) Phase 3B: Maintenance ($500-$2k/mo) For mid-market ($10M-$50M ARR), shift up: Audit: $4K-$6K Build: $35K-$75K Retainer: $5K-$10K/mo For enterprise ($50M+): Audit: $7.5K-$15K Build: $75K-$250K+ Retainer: $10K+/mo The audit is the wedge. The audit is what separates you from every 22-year-old with Claude Code who'll build whatever they're told. You're selling the map. The build becomes inevitable once they've seen the map.

Tuki @tamir_eden · May 3

Operators, your Monday morning checklist is a pain in the A$$. Agentic-ecommerce operations are quickly becoming the next big leap for 2026 - Shopify shipped a Hermes (@NousResearch) agent skill. We open-sourced and shipped shopify-admin routines + more skills, for any agent: https://t.co/Z29UKCQX66 Skills made the agent something operators call when they need help. Routines flip it - the agent runs the store on a schedule and only pings you when something actually needs you. Come join the growing community of agentic-first @Shopify operators: https://t.co/kya1p5P4JR.

T tobi @tobi

So good

ares. 🎧 @aresotik · May 3

Una herramienta que descarga cualquier vídeo de YouTube, elimina la voz limpiamente, transcribe, traduce a 100+ idiomas, clona la voz original y redobla todo. En menos de 2 minutos. 100% local. Gratis. Se llama Voice-Pro, la dejo en los comentarios. https://t.co/nQwhd97cXI

Ahmad @TheAhmadOsman · May 3

If you’re a “Software Engineer” and you don’t know how to bypass this then please pivot to being a Starbucks Barista because you’re ngmi

C cormachayden_ @cormachayden_

software engineers before vs after agents https://t.co/jJp75lO8O7

Ahmad Awais @MrAhmadAwais · May 3

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending `null` for an optional field instead of omitting it - emitting `["a","b"]` as a json *string* instead of an actual array - wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. deepseek-flash, when asked to edit or write a file, sometimes emits the path as a *markdown auto-link*: filePath: "/Users/x/proj/[notes.md](http://notes. md)" our writeFile tool obediently trued creating files literally named `[notes.md](http://notes .md)` until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like `[click](https://x .com)` passes through untouched. this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict. "tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level `pathString()` instead of `z.string()` and the leak is plugged for every path field at once. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that *happened* to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test. then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log `tool_input_repaired:${toolName}`. on failure, log `tool_input_invalid:${toolName}` and return a model-readable retry message. the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence. (this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.) 4/ shape invariants and relational invariants need different fixes. the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a *relational* invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling `readFile({ absolutePath, limit: 30 })` and getting an `ERROR:` back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them. so i taught the function the model's intent instead. `limit` alone → `offset = 0`. `offset` alone → `limit = 2000` (matches common read tool ops default). then surfaced the decision back to the model in the result: "Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit." no `Error:` prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big. repair where you can. extend semantics where you can't. surface the choice either way. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model.

M MrAhmadAwais @MrAhmadAwais

Wow I just made DeepSeek V4 Pro beat Opus 4.7 6/10 times in our internal evals by auto repairing many of its quirks in tool calling. It’s performing super solid for such a cheap model.

mr-r0b0t @mr_r0b0t · May 3

Here's what 96 concurrent @NousResearch Hermes Agents (using 382,745,618 tokens over 171,136 API calls to deepseek-v4-pro) can generate for you in less than 3 days. From your M4 Macbook Air 24GB, on hotel wifi! 81% cache hit rate in case you're wondering! https://t.co/9nSdx3KvxG

Ashwin Gopinath @ashwingop · May 3

Company Brain, Part 3: Interaction Memory

Almost everything important in a company happens in meetings, messages, or emails. That sounds exaggerated until you start listing the actual places w...

Nico Bailon @nicopreme · May 3

addicted to using boomerang mode (aka reverse D-Mail) in Pi these days. ctrl+alt+b enables it for the next prompt submitted -> after the prompt runs, it rewinds back to the same point with file changes intact + leaves a summary in the feed so the agent knows what happened. using it often can make the context window feel nearly unlimited. Powered by the native /tree functionality in pi. pi install pi-boomerang https://t.co/pDdiYgith2

vmiss @vmiss33 · May 3

What I Use Hermes Agent For (And How I Use It)

I have been running a multi agent setup for Hermes agent for the last several weeks. Honestly? It took me a while to get here. I installed OpenClaw...

David Strolder @Davidstrolder · May 3

Talk directly to the consciousness of your favorite book. 📚 • Get life advice • Debate their deepest ideas • Vent about your problems

gabriel @gabriel1 · May 3

it's only a question of interfaces and culture for knowledge work to 20x their token use just like engineers did there are 1.3b knowledge workers, inference will explode in the coming year

_ _Lonis_ @_Lonis_

How do you trade on short AGI timelines? This weekend, I made a graph showing every layer of the AI supply chain and which companies are the most important bottlenecks. (for my personal use, no advice) https://t.co/lP9uBZqnbi

Julian @julianboolean_ · May 3

rohit is doing some of the most interesting agent research out there this is the kind of super fundamental "let's test out all the classic econ theories on ai agents" stuff that people in 20 years will be bitching about and saying "well of course i coulda done it as well if i was lucky enough to be around in 2026 when there was so much low-hanging fruit" - but only Rohit's doing it!

K krishnanrohit @krishnanrohit

🚨 New Experiment: Everyone thinks AI firms will look like little companies. A manager model decomposes the task and worker models do subtasks. The manager red-teams, revises, and recombines. A seemingly simple org chart. But when I ran the experiment, the current in-vogue org setup, manager-subagent, cost 4x more and performed worse than letting a rather simple market do the trick. I tested 3 ways to organize multiple AI models: 1. Solo: Onefrontier model does everything itself 2. Hub-Spoke: A "manager" model splits tasks, delegates, red-teams, revises 3. Market: Models bid on tasks, winner gets the job, reputation updates I also tested were 3 types of tasks - Coding, Reasoning and Synthesis. - Coding required most "global state" management, which the solo model did best at. In future @a1zhang's RLM will probably do even better here - Reasoning is the hardest to cleanly decompose, and the market worked the best here - Synthesis too, the market beat hub-spoke as the framing could be ambiguous The reason is, a hub isn't a "manager" as we know it. It's a model that must somehow know: - What the subtasks are - What good recomposition looks like And if either fails, as it does for complex or not-easily-decomposable tasks, competent workers still produce garbage. As we move from coding to letting multi-agent systems do work across the entire economy we'll end up with more not-easily-verifiable tasks with ambiguous settings and uncertain payoffs. In those, we won't be able to use the factory approach to get work done. The Coasean argument is that firms will get smaller, and the smaller firms will transact more, since the organisational premium reduces with AI. But how? Through central hubs, or markets? The fact is, Coase here needs Hayek. Setting up markets is not trivial, as @AndreyFradkin and I looked in our recent paper. Essay: https://t.co/kK3gMQfbCs

- Elijah Muraoka - @elijahmuraoka_ · May 3

Stolen from @garrytan's "Boil the Ocean" concept plus a lot more! Seeing some pretty incredible results paired with the new /goals feature for Codex

E elijahmuraoka_ @elijahmuraoka_

https://t.co/pygMj566Pu

AboveSpec @above_spec · May 3

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

Browser Use @browser_use · May 3

RT @mamagnus00: browser-harness exploded. some said AGI is here. but what’s the right interface? introducing Browser Use Desktop. open-sou…

Mike Key @1337hero · May 3

Just here to tell you all that there is a path to 96gb of VRAM brand new that costs less than all the hyped up basic bitch recommendations. It's called the AMD Ai Pro R9700 & it's 32gb at the cost of a used 3090. 3 of them come in just less than a 5090. https://t.co/pO7K11fGLp

S songjunkr @songjunkr

Why I personally don't recommend the RTX 3090 for Local LLMs: While it offers fantastic inference performance for the price, there are a few major drawbacks. > The biggest issue: Durability. If you buy a used 3090, there's a high risk it was heavily abused for crypto mining. > The power consumption is absolutely massive. > Extreme heat. It's one of the hottest GPUs out there and will literally heat up your entire room. > Used prices have gone up so much that they are almost back to the original launch price. Make sure to carefully weigh the pros and cons before making a purchase!

Jeffrey Emanuel @doodlestein · May 3

It's now been around 4 months since my open-source dcg tool was first released, and I know from hearing from tons of users that it has saved countless people from disaster at the hands of overeager Claude Code agents. I've continued to make various performance improvements and added additional preset packs to the project, most recently for the Railway API after the recent and infamous incident where someone blamed Claude for wiping their production database. Because of the way dcg is implemented as a "pre-tool-use hook" in Claude Code, there was no way to use it in Codex, since Codex didn't support that kind of hook at all. Until a week or so ago, when they finally added it. So I'm now pleased to say that the latest version of dcg has full support for Codex (plus it also works for gemini-cli if anyone is really using that outside of the 'Plex!). If you're not familiar with dcg yet, I highly recommend checking it out. It's unthinkable to me now to use any coding agent that doesn't support it; it feels like speeding on the highway without a seatbelt on (or more accurately, with a sharp knife strapped to the steering wheel pointed at your heart). Agents just can't be trusted to not occasionally do crazy things that seem sensible to them at the moment, but which are wildly destructive and often irreversible. These bouts of temporary madness often occur soon after compactions, or as a result of context rot caused by excessively long sessions. Not only does dcg mechanically prevent the agents from being able to do that, it explains to them why it did that specifically, and offers them safe alternatives custom-tailored to the specific commands they tried to run. The more agents you have running at the same time on the same project, the more dcg goes from a nice thing to have to being totally indispensable if you don't want to constantly worry about one rogue agent wiping out the work of the other agents with a misguided "git reset --hard HEAD" command. The dcg utility itself is written in hyper-optimized, memory-safe Rust and uses minimal system resources. Because it's totally mechanical (unlike the auto-approve feature in Claude Code, which uses an AI model that adds latency), you can't even notice any delay from it running on every command. dcg is NOT just a cookbook of canned forbidden commands; frontier models are too smart and resourceful to actually be constrained by such a simplistic approach. When they're prevented from running a command one way, they'll try another way; if that also doesn't work, they'll whip up an ad-hoc Bash script or Python program to do what they want. But dcg can detect that as well using its advanced ast-grep mode (which only kicks in when dealing with such heredoc scripts, so that the faster regex-only path can be used when applicable). It's also very quick and easy to expand and customize dcg by creating your own custom preset packs to add to the 50 or so included packs. Just ask Codex to study the existing presets and explain what you want to protect against in your own custom API or tooling, or in a third-party project that's not currently included by default in dcg. So, remember: Friends don't let friends vibe code without dcg. Protect yourself from your agents, and protect them from themselves. You can get it here: https://t.co/aVmEBi9WCd It installs in under a minute on Linux or Mac using the curl-bash one-liner command shown in the README, and automatically detects any supported agent harnesses installed on your machine and configures them for you to use dcg. And if you decide it's not for you, it can be fully uninstalled in seconds using the provided command.

D doodlestein @doodlestein

Agent coding life hack: I’m 100% convinced that there are hundreds of thousands of developers out there who would love and use my dcg tool if they only knew about it. dcg: destructive_command_guard This is a free, open-source, highly-optimized rust program that runs using pre-tool hooks in Claude Code (CC) and checks the tool call that CC was about to make to see if it’s potentially destructive; that is, could delete data, lose work, drop tables, etc. Get it here and install with the convenient one-liner: https://t.co/aVmEBi9WCd A tool like dcg has several competing goals that make it a careful balancing act and tough engineering problem: 1. Since it runs for every single tool call, it must be FAST. Hence why it is written in Rust and an extreme amount of focus has been placed on making it as fast as possible. 2. It must avoid annoying false positives that waste your time, add friction, and re-introduce you as the bottleneck unnecessarily. I run dozens of agents at once and don’t want them wasting time waiting for me unless it’s needed. Usually, the messages from dcg are enough to get the agent to be more thoughtful about what it’s doing. 3. It’s not enough to just use a simple rulebook where you look for canned commands like “rm -rf /” or “git reset --hard HEAD.” The models are very resourceful and will use ad-hoc Python or bash scripts or many other ways to get around simple-minded limitations. That’s why dcg has a very elaborate, ast-grep powered layer that kicks in when it detects an ad-hoc (“heredoc”) script. But wherever possible, it uses much faster simd optimized regex. 4. A tool like this should really be expandable and have semantic knowledge of various domains and what constitutes a destructive act in those domains. For instance, if you’re working with s3 buckets on aws, you could have a highly destructive command that doesn’t look like a normal delete. That’s why dcg comes out of the box with around 50 presets which can be easily enabled based on your projects’ tech stacks (just ask CC to figure out which packs to turn on for you by analyzing your projects directory). 5. dcg is designed to be very agent friendly. It doesn’t just block commands, it explains why and offers safe alternatives based on an analysis of the specific command used by the agent. For instance, it might stop the agent from deleting your Rust project’s build directories but suggest using “cargo clean” instead. Often, these messages are enough to knock sense into Claude. I really can’t exaggerate just how much time and frustration dcg has already saved me. It should be known and used by everyone who has had these kinds of upsetting experiences with coding agents. dcg is included along with all my other tooling in my https://t.co/N4As0kJTQP project. All free, MIT licensed, with extensive tutorials and other educational resources for people with less experience. Give it a try, you won’t regret it!

Peter Yang @petergyang · May 4

I caved and downloaded Hermes to try. For those of you who have tried both Hermes and OpenClaw what difference do you notice? No shilling please, just want some honest opinions

Lotto @LottoLabs · May 4

I really encourage you all, go download this and run it See where you came from, have some respect this Sunday https://t.co/PVjIP7wnew

Theo - t3.gg @theo · May 4

I sent a single message on Copilot and it did over 60m tokens. It's still going. $30 of inference so far. In their current billing model, you get 1,500 messages, regardless of how expensive each is. I'm pretty sure I can do $45,000 of messaging on this plan https://t.co/geLynOHCiM

flavio @flaviocopes · May 4

RT @cormachayden_: software engineers before vs after agents https://t.co/jJp75lO8O7