Dynamic Workflows Redefine Agent Architecture as Computer Use Agents Surpass Human Baseline at OSWorld

May 30, 2026 · 19 sources

The conversation around AI agents shifted from "do they work?" to "how do we orchestrate them?" with Claude's dynamic workflows, self-managing Codex threads, and multi-agent delegation patterns dominating the day. A new OSWorld record saw agents beat human baseline, while the model ecosystem filled out from 4B edge coders to 199B MoE giants.

Daily Wrap-Up

Something shifted this week in how the AI community talks about agents. For months the debate was whether multi-agent systems even work, with skeptics pointing to fragile "agent soup" architectures that collapse under their own complexity. Today's posts suggest the conversation has moved past that binary. The question isn't whether to use agents. It's how to orchestrate them. Yi Ding captured this well, noting that Claude's new dynamic workflows aren't deterministic scaffolding but rather a prompt teaching the agent to write graph-like descriptions in JavaScript. The dream of flexible, non-brittle agent coordination that people imagined three years ago is becoming a working reality, even if it still requires careful tuning.

The model ecosystem also continued its march toward covering every conceivable hardware tier. From Qwopus's 4B coding model that runs at 270 tokens per second on consumer hardware, to Step 3.7's 199B parameter MoE with vision support, developers now have genuinely competitive options whether they have 4GB of VRAM or a rack of GPUs. Georgi Gerganov's llama.cpp launched a polished new website with a single-line cross-platform installer, signaling that local AI is thinking seriously about user experience for the first time. And in a milestone that would have seemed absurd a year ago, Neal Chopra's team hit state-of-the-art on OSWorld's computer use benchmark, beating the human baseline of 72.4% with both Claude Opus and Sonnet.

The most surprising detail of the day came from @rohit4verse, who revealed that Borris Cherny, the creator of Claude Code, doesn't write code or even talk to Claude directly. He runs one Claude that prompts the rest. It's a perfect distillation of where things stand: the people building these tools have already ascended to the meta-layer, orchestrating the orchestrators. The most practical takeaway for developers: stop treating your AI interactions as one-off chats and start designing them as persistent workflows with clear seams, handoffs, and hierarchical delegation. The developers getting outsized results are treating their AI tools less like assistants and more like a managed team.

Quick Hits

@ycombinator spotlighted @DraftedAI, which lets users draw shapes, define rooms, and set constraints to generate complete floor plans and 3D home designs in seconds. Over 120,000 people have already generated 325,000+ designs.
@garrytan flagged Moss, an open-source search layer for voice AI that runs retrieval at sub-10ms with no network hop to a vector DB. The 24-Hour Conversational AI Hackathon runs June 6-7 at the YC office.

Agent Architecture Breaks Through

The day's dominant theme was agent architecture in all its forms: how agents coordinate, how they manage themselves, and how they're evaluated. The spark was Claude Code's dynamic workflows, which @yi_ding described as deceptively simple. "It isn't 'deterministic.' It's literally just a prompt, albeit a fairly detailed one, teaching the agent to write a graph-like description in Javascript." This observation came in response to @dexhorthy's argument that the feature proves deterministic workflows orchestrating small agent loops beat "agent soup" every time. Ding's take was more nuanced: the feature requires manual tuning, similar to Deep Research at launch, but seeing the underlying dream become reality is impressive regardless.

@huntlovell pushed in a related direction, exploring what happens when you package workflows like skills and let those workflows determine how the harness itself works. Drawing explicit parallels to Claude Code's approach, his team uses an interpreter runtime to instrument large numbers of subagents for complex tasks. The insight is that workflow orchestration for general agents mirrors the same problems the Claude team is solving, and the solution involves letting the model define its own graph structure rather than imposing one from outside.

The meta-pattern was captured perfectly by @rohit4verse, who noted that Claude Code's creator "doesn't write code. He doesn't even talk to Claude. He runs one Claude that prompts the rest." This delegation pattern, where one agent manages many others, is becoming the default for power users. @nickbaumann_ demonstrated this with Codex, describing a "chief of staff" pattern where a single persistent thread spins up new project threads, checks in on them during heartbeats, and routes relevant context from Slack. "Everything flows naturally to the top," he wrote.

On the enterprise side, @McKinsey described agentic orchestration layers that allow AI agents, enterprise systems, and data connections to work together across functions, shrinking cognitive bottlenecks and enabling new operating models as coordination improves. And on the benchmark front, @nealchopra shared a new state of the art on OSWorld: 83.6% with Claude Opus 4.7 and 81.5% with Sonnet 4.6, both surpassing the human baseline of 72.4%. Their harness is open source and deliberately simple, reflecting a shift away from elaborate scaffolding. Perhaps most telling, Sonnet achieved near parity with Opus at less than half the cost, pointing toward a future where agent deployment means picking the right model at the intelligence-cost Pareto frontier, not defaulting to the biggest one available.

The Craft of Yielding Agents

If agent architecture was the "what," today's posts were equally focused on the "how." @steipete reported that with GPT 5.5, a /goal command, autoreview, and a tool called crabbox, his prompts have moved from handling 30-60 minute tasks to 4-10 hour autonomous runs, with much higher confidence in the output. His conclusion was succinct: "Yielding agents is a skill." This was reinforced by @davidfowl retweeting a post asserting that "using a coding agent is a deep skill," where the people who appear to use them effortlessly have simply put in the repetition.

@MrSanders shared a detailed multi-phase prompt architecture for driving agent-based code reviews, spanning eight phases from triggering the review through iterating on specifics to handing off to TDD implementation. The structure is notable for how it constrains the agent at each step. Phase 2 asks for analysis as text for line-by-line iteration rather than a file dump. Phase 4 requests targeted revisions instead of full rewrites. Phase 7 encodes cross-cutting conventions in AGENTS.md so they persist across sessions. It reads less like a prompt and more like a project management protocol adapted for non-human workers.

The economics matter too. @mattpocockuk pointed out that staying in the "smart zone" is an underrated way to save on tokens, because sending 600K tokens on every request in the "dumb zone" gets expensive fast, even with cache discounts. @ClaudeDevs highlighted a concrete improvement in Opus 4.8: system instructions can now be added mid-conversation without breaking the prompt cache, translating to more cache hits and lower cost and latency for API requests. Together these posts paint a picture of a maturing discipline where the best practitioners think about token budgets, cache strategies, and incremental refinement the way a senior engineer thinks about memory allocation.

Models for Every VRAM Budget

The model ecosystem continued its relentless expansion into every hardware niche. @0xSero published a practical hardware guide that reads like a restaurant menu for models. At 4-8GB VRAM, minicpm5 offers agentic tool use on tiny machines and tops benchmarks in its weight class. At 8-16GB, LFM-2.5-8B is an 8B MoE with only 1.5B active parameters trained on a massive 38T tokens with 131K context. At 96-128GB, quantized variants of larger models deliver strong agent performance with high context lengths for modest VRAM. And at 196GB and above, Step-3.7-Flash brings 199B parameters with 11B active, vision support, 256K context, and 150 tokens per second on 6000-series GPUs.

@KyleHessling1 introduced Qwopus 3.5-Coder 4B, a tiny coding model that scored 43.5% on completed patches from a SWE-bench mini slice, running at 270 tokens per second at Q8 with multi-token prediction on a 5090. With parallel requests during SWE-bench runs, aggregate throughput exceeded 500 tokens per second. That makes it viable not just for lightweight coding tasks on older hardware but also for swarm data cleaning and large dataset processing where throughput matters more than single-request intelligence.

Meanwhile @Hikari_07_jp announced that Step 3.7 NVFP4 with multi-token prediction would be released tonight Japan time, bringing the frontier model to lower precision formats for broader accessibility. The trio of posts underscores a clear trend: the gap between what runs locally and what requires cloud infrastructure is narrowing fast, and developers can now make genuine architectural choices rather than settling for whatever their hardware can勉强 handle.

Local AI Gets Serious About UX

Local AI has long suffered from a developer experience gap. The tools work, but reaching for them requires comfort with build flags, quantization schemes, and model format conversions. That's starting to change. @ggerganov announced that llama.cpp now has an official website with a single-line cross-platform installer providing a unified llama entrypoint for running and serving models. Existing GGUF models stored in the machine's HuggingFace cache are automatically available without re-downloading. The roadmap includes seamless integration with local-friendly third-party agents like Pi, positioning llama.cpp as the runtime layer beneath the coming wave of local agent applications rather than just a inference library for enthusiasts.

@thatboybenagain launched Harbour, a local-first Mac journal with on-device AI reflection and no subscription, priced at $35 for a limited time. "I wanted a journal with AI reflection, but I did not want my private thoughts uploaded into another cloud app," he wrote. It's a direct challenge to the assumption that AI features require cloud infrastructure, and it targets a use case, personal journaling, where privacy concerns are a genuine blocker for cloud-dependent alternatives.

@DhravyaShah open-sourced what he describes as a "universal company brain": a system that connects natively to any agent,

Sources

Rohit @rohit4verse · May 29

Borris Cherny the creator of Claude Code doesn't write code. He doesn't even talk to Claude. He runs one Claude that prompts the rest. He drops it like a throwaway, but that's the whole game now. You stop being the worker and become the one who runs them. ↓read this today to master multi-agent system

R rohit4verse @rohit4verse

You're Not Slow. You're Single-Threaded: A Complete Guide on Commanding 300 Agents from One Prompt

Garry Tan @garrytan · May 29

Everyone's bottleneck in voice AI is the same: retrieval. The agent thinks, network round-trips to a vector DB, and the magic dies. Moss runs search at sub-10ms (no hop). Open source. This is the layer voice agents were missing. Build on it June 6-7 at the YC office.

K koomen @koomen

Come build agents that can finally hold a fluid conversation at the 24-Hour Conversational AI Hackathon, hosted by @usemoss at the YC Office, June 6-7. First place wins an interview with a YC partner: https://t.co/T9md5yyoF4

Matt Pocock @mattpocockuk · May 29

One thing people underrate about the smart zone: It's a great way to save on tokens. When you're in the dumb zone, sending 600K tokens on every request gets expensive FAST Yes, you get charged for cache hits. Less, but it still adds up.

Y Combinator @ycombinator · May 29

It’s never been easier to design your dream house. Draw a shape. Define your rooms. Set your constraints. @DraftedAI generates complete floor plans, elevations, and 3D home designs in seconds. Over the last month, 120,000 people generated 325,000+ home designs with https://t.co/XqC0LP5n3y.

Georgi Gerganov @ggerganov · May 29

llama.cpp now has an official website: https://t.co/vztdUpdBWL Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer. The installation provides a single unified `llama` entrypoint which you can use to run/serve models and interface with 3rd-party agentic applications. While oriented towards simplified user experience, the new `llama` application also provides all the advanced functionality of the existing llama.cpp tooling with which experienced users are already familiar. Also note that all GGUF models that you might have already downloaded with llama.cpp in the past will be automatically available to use without downloading again (they are stored in the common HF cache on your machine). We have many improvements in the pipeline both at the UX and at the engine level and we plan to iteratively ship new things over the coming months. One of the main focuses will be seamless integration with local-friendly 3rd-party agents (such as Pi). In the meantime, we’ll continue to listen for feedback from the community and adjust accordingly, so keep letting us know what you think and need.

Hunter Lovell @huntlovell · May 29

What if you could package workflows in the same way you do skills? And give those workflows the ability to determine how the harness works? This is something we've been thinking a ton about lately. All using the interpreter runtime we launched earlier this month! In practice this is very similar to how the new dynamic workflows work in claude code- if you're interested in how it works under the hood, here's a writeup on how we're approaching a very similar problem workflow orchestration for agents in general, including how to instrument huge scales of subagents to take on more complex work. (and h/t to @a1zhang who's work on RLM's is what inspired this all)

H huntlovell @huntlovell

Building workflows for agents with Skills + Interpreters

Neal Chopra @nealchopra · May 29

A lot of people have been asking about our harness / approach - some thoughts: 1/ it’s fully open source on github! 2/ it is quite simple - and we think this is where harness engineering is heading. you no longer need elaborate scaffolding to force the model to reason in a prescribed way 3/ we initially included a verifier to check the executor’s work. it ended up being *more* accurate than the benchmark’s grader, but omitted it (you can't score above the ceiling set by the grader). we have a lot more to say on this. 4/ we were most excited by the performance uplift in sonnet (lighter model). it reflects a shift toward picking the model at the intelligence/cost pareto max for a task, not just the largest one. sonnet achieved near parity with opus in performance, while costing less than half.

N nealchopra @nealchopra

Today, we’re sharing a new state of the art for computer use. Our system holds the two highest verified scores on OSWorld, the standard benchmark for AI agents that operate a computer like a person: 83.6% using Claude Opus 4.7 and 81.5% using Claude Sonnet 4.6. The human baseline is 72.4%. 🧵 1/7

Kyle Hessling @KyleHessling1 · May 29

Hello again, everyone! Welcome, Qwopus 3.5-Coder 4B! Lots of awesome model drops are coming out, so we've got so many great new candidates for fine-tuning and dataset generation. We're so pumped and have a lot of great experiments running currently! We've put together this significantly smaller coder model, Qwopus Coder 4B, and it seems to be impressive for something that could run well on most smartphones, or really fast on older GPUs. It scored a 43.5% on a 225 slice of swe bench mini for completed patches, 32.5% for all patches, including empties due to missing the specific format required by swe, but on the ones that it output patches, it performed surprisingly well at 73/168 patches submitted for 43.5% Bear in mind, this is a tiny 4b model with additional coding training and COT improvements. I was able to make a neon snake game (HF space link in comments to try) in just a few turns of the model. It's laser fast running at 270tps at Q8 with MTP on my 5090, with tons of headroom for concurrent instances! I was able to get over 500tps aggregate with parallel requests running SWE bench with it! It also shows improvement in @stevibe's BenchLocal agent and coding benchmarks! Check out the full results in the model card! If you want to do some simple HTML game coding at lightning speeds on older hardware or less VRAM, I strongly recommend playing with it! Or if you want an intelligent model to do some serious swarm data cleaning or large dataset processing, this could be an excellent option! Blessed to be here; you all are so enjoyable to engage with! Please let us know your thoughts in the comment section, and let us know what use cases jump out to you for a small 4b model like this one! https://t.co/mykGsjmESv

Dhravya Shah @DhravyaShah · May 29

oh shit i forgot to mention it's open source Yes. https://t.co/h2iFC5r8nm open source

D DhravyaShah @DhravyaShah

I made a universal company brain. - connect to ANY agent natively - Git-like versioning and RBAC permissioning - connectors to all sync tools of companies - Dreams about your company - run on-prem. Free to start. it's live today. we've been using it for 6 months @supermemory https://t.co/xsrenSBHbY

ClaudeDevs @ClaudeDevs · May 29

With Opus 4.8, you can add system instructions mid-conversation without breaking the prompt cache. More cache hits means lower cost and latency for your API requests. https://t.co/42C7wqnLhD

Yi Ding -- prod/acc @yi_ding · May 29

So the super impressive thing about dynamic workflows that people are sleeping on is that it isn't "deterministic." It's literally just a prompt, albeit a fairly detailed one, teaching the agent to write a graph-like description in Javascript. A lot of people thought this kind of thing would be possible 3 years ago, but were too early. The feature currently has a lot of manual tuning (in the same way Deep Research did when it was first released), but it's still super impressive to see the dream become a reality.

D dexhorthy @dexhorthy

someone hit me up about the new "claude dynamic workflows" feature, claiming "see, multi-agent works" But really, the launch of this feature proves the exact point that I made back in June of 2025, along with @walden_yan, @tobi, @karpathy, and many others: Deterministic workflows orchestrating small agent loops beats non-deterministic multi-agent or "agent soup" systems every dang time everything is context engineering

Hikari∣LocalLLM⚡ @Hikari_07_jp · May 29

We plan to release Step 3.7 NVFP4 with MTP tonight, Japan time. Stay tuned!

McKinsey & Company @McKinsey · May 29

Agentic orchestration layers allow AI agents, enterprise systems, and data connections to work together across functions. As cognitive bottlenecks shrink, that allows decision-making to speed up, coordination to improve and new operating models to form. https://t.co/0to4HC26cu https://t.co/zjLdSSZL8T

0xSero @0xSero · May 29

Best models I’ve seen this week for your hardware: if you have 8-16gb you have a competitive model finally! ———- 4gb - 8gb: - minicpm5: this model was built for agentic tool use on tiny machines: https://t.co/LvNnIDSh7u - tops benchmarks in weight class - extremely small - great for using in projects with AI - blazing fast ———— 8gb - 16gb Most exciting model - LFM-2.5-8B: https://t.co/5SYi6D56FR Frontier for vram: - 8b moe with - 1.5B active - trained on 38T tokens (MASSIVE) - 131k context ————- 96GB - 128GB - ds4flash either q2 or reap + q4 https://t.co/BaphZfWrwG - or https://t.co/EAZB4bDYjA - very strong agent - logical pleasant to talk to - good in Hermes - fast - high contexts for little vram ————- 196gb+ Step-3.7-Flash: https://t.co/oaVf5wMILx - 199B with 11B active (FAST) - vision support! - its predecessor was topping benchmarks for 3 months - 256k context - 150 tok/s on 6000s

Nick @nickbaumann_ · May 29

This has fundamentally changed how I use Codex - everything runs out of a single persistent thread (my "chief of staff") - anytime I start a new project or workstream, I have that thread spin up a new thread (because it's already found the context from slack, etc) - the CoS thread checks in on the project threads during heartbeats, and occasionally sends relevant updates from slack to that thread everything flows naturally to the top

G guinnesschen @guinnesschen

If you ever get tired of managing your Codex threads, just let Codex manage itself! Codex can now create threads, search them, organize them, pin the important ones, and spin up worktrees for parallel tasks. https://t.co/fO6OVu0ZcE

Luis Sanchez @MrSanders · May 29

Thanks for sharing of the session @dillon_mulroy i learnt a lot. This is the analisis done with an agent for the prompts used. Curious to me when you drop everything and ask to rebuild the thing entirely. **P1 — Trigger an architecture review (anchored)** I'm not happy with the patterns/design of @. Review its integration/composition with @. Study the patterns in and tell me where the design's locality/cohesion breaks down. **P2 — Plan as iterable text (no report/file)** Don't generate a report or a file: give me the analysis as text in the message so we can iterate on it line by line right here. **P3 — Structural artifact (the core) ⭐** For each of the options, sketch in pseudocode, concretely and concisely (these are sketches, not final code): 1. Type API / public interface of the seam 2. Call stack / call graph from entrypoint down to the leaf 3. Seams (where behavior is injected) 4. Adapters (production implementation vs test/in-memory) **P3‑bis — Prod‑vs‑test call graph (the viral artifact)** Show me the final call graph in two versions —Production and Tests — so the shape is identical and the only thing that changes is the injected layers/adapters (real vs in-memory). **P4 — Targeted review loop (revise, don't rewrite)** On "": . Apply only this and re-emit the affected section — don't rewrite everything. **P5 — Consolidate into a spec** Consolidate everything we agreed on into a single, complete tech spec. **P6 — Refine the ubiquitous language** What would be a better name for ? → Use and re-emit the full spec with that rename applied throughout. **P7 — Encode a cross-cutting convention (and persist it)** Rule: don't use in ; instead . Encode this in AGENTS.md if it isn't already. Then re-emit the spec with that change. **P8 — Hand off to implementation** Implement the spec at @ using red-green-refactor TDD. Start by bootstrapping . Hard constraints: no , no , no . When you're done, tell me what's left in ≤5 todos.

D dillon_mulroy @dillon_mulroy

my "plans" largely look like pseudo code composed of mostly types/interfaces, how they compose, and their boundaries ive recently started including call stacks - been very helpful for both me and agents when implementing https://t.co/SLrYX3ywqc

thatboyben @thatboybenagain · May 30

I wanted a journal with AI reflection, but I did not want my private thoughts uploaded into another cloud app. So I built Harbour: a local-first Mac journal with private AI, no subscription, for just $35 for a limited time.

David Fowler @davidfowl · May 30

RT @thdxr: i have seen enough proof now that using a coding agent is a deep skill it's confusing because the people you see heavily using…

Peter Steinberger 🦞 @steipete · May 30

With GPT 5.5, /goal, autoreview and crabbox my prompts moved from ~30-60min to often 4-10h tasks and my confidence that I can it’s ready is much much higher. Yielding agents is a skill.