Dynamic Workflows Redefine Agent Architecture as Computer Use Agents Surpass Human Baseline at OSWorld
The conversation around AI agents shifted from "do they work?" to "how do we orchestrate them?" with Claude's dynamic workflows, self-managing Codex threads, and multi-agent delegation patterns dominating the day. A new OSWorld record saw agents beat human baseline, while the model ecosystem filled out from 4B edge coders to 199B MoE giants.
Daily Wrap-Up
Something shifted this week in how the AI community talks about agents. For months the debate was whether multi-agent systems even work, with skeptics pointing to fragile "agent soup" architectures that collapse under their own complexity. Today's posts suggest the conversation has moved past that binary. The question isn't whether to use agents. It's how to orchestrate them. Yi Ding captured this well, noting that Claude's new dynamic workflows aren't deterministic scaffolding but rather a prompt teaching the agent to write graph-like descriptions in JavaScript. The dream of flexible, non-brittle agent coordination that people imagined three years ago is becoming a working reality, even if it still requires careful tuning.
The model ecosystem also continued its march toward covering every conceivable hardware tier. From Qwopus's 4B coding model that runs at 270 tokens per second on consumer hardware, to Step 3.7's 199B parameter MoE with vision support, developers now have genuinely competitive options whether they have 4GB of VRAM or a rack of GPUs. Georgi Gerganov's llama.cpp launched a polished new website with a single-line cross-platform installer, signaling that local AI is thinking seriously about user experience for the first time. And in a milestone that would have seemed absurd a year ago, Neal Chopra's team hit state-of-the-art on OSWorld's computer use benchmark, beating the human baseline of 72.4% with both Claude Opus and Sonnet.
The most surprising detail of the day came from @rohit4verse, who revealed that Borris Cherny, the creator of Claude Code, doesn't write code or even talk to Claude directly. He runs one Claude that prompts the rest. It's a perfect distillation of where things stand: the people building these tools have already ascended to the meta-layer, orchestrating the orchestrators. The most practical takeaway for developers: stop treating your AI interactions as one-off chats and start designing them as persistent workflows with clear seams, handoffs, and hierarchical delegation. The developers getting outsized results are treating their AI tools less like assistants and more like a managed team.
Quick Hits
- @ycombinator spotlighted @DraftedAI, which lets users draw shapes, define rooms, and set constraints to generate complete floor plans and 3D home designs in seconds. Over 120,000 people have already generated 325,000+ designs.
- @garrytan flagged Moss, an open-source search layer for voice AI that runs retrieval at sub-10ms with no network hop to a vector DB. The 24-Hour Conversational AI Hackathon runs June 6-7 at the YC office.
Agent Architecture Breaks Through
The day's dominant theme was agent architecture in all its forms: how agents coordinate, how they manage themselves, and how they're evaluated. The spark was Claude Code's dynamic workflows, which @yi_ding described as deceptively simple. "It isn't 'deterministic.' It's literally just a prompt, albeit a fairly detailed one, teaching the agent to write a graph-like description in Javascript." This observation came in response to @dexhorthy's argument that the feature proves deterministic workflows orchestrating small agent loops beat "agent soup" every time. Ding's take was more nuanced: the feature requires manual tuning, similar to Deep Research at launch, but seeing the underlying dream become reality is impressive regardless.
@huntlovell pushed in a related direction, exploring what happens when you package workflows like skills and let those workflows determine how the harness itself works. Drawing explicit parallels to Claude Code's approach, his team uses an interpreter runtime to instrument large numbers of subagents for complex tasks. The insight is that workflow orchestration for general agents mirrors the same problems the Claude team is solving, and the solution involves letting the model define its own graph structure rather than imposing one from outside.
The meta-pattern was captured perfectly by @rohit4verse, who noted that Claude Code's creator "doesn't write code. He doesn't even talk to Claude. He runs one Claude that prompts the rest." This delegation pattern, where one agent manages many others, is becoming the default for power users. @nickbaumann_ demonstrated this with Codex, describing a "chief of staff" pattern where a single persistent thread spins up new project threads, checks in on them during heartbeats, and routes relevant context from Slack. "Everything flows naturally to the top," he wrote.
On the enterprise side, @McKinsey described agentic orchestration layers that allow AI agents, enterprise systems, and data connections to work together across functions, shrinking cognitive bottlenecks and enabling new operating models as coordination improves. And on the benchmark front, @nealchopra shared a new state of the art on OSWorld: 83.6% with Claude Opus 4.7 and 81.5% with Sonnet 4.6, both surpassing the human baseline of 72.4%. Their harness is open source and deliberately simple, reflecting a shift away from elaborate scaffolding. Perhaps most telling, Sonnet achieved near parity with Opus at less than half the cost, pointing toward a future where agent deployment means picking the right model at the intelligence-cost Pareto frontier, not defaulting to the biggest one available.
The Craft of Yielding Agents
If agent architecture was the "what," today's posts were equally focused on the "how." @steipete reported that with GPT 5.5, a /goal command, autoreview, and a tool called crabbox, his prompts have moved from handling 30-60 minute tasks to 4-10 hour autonomous runs, with much higher confidence in the output. His conclusion was succinct: "Yielding agents is a skill." This was reinforced by @davidfowl retweeting a post asserting that "using a coding agent is a deep skill," where the people who appear to use them effortlessly have simply put in the repetition.
@MrSanders shared a detailed multi-phase prompt architecture for driving agent-based code reviews, spanning eight phases from triggering the review through iterating on specifics to handing off to TDD implementation. The structure is notable for how it constrains the agent at each step. Phase 2 asks for analysis as text for line-by-line iteration rather than a file dump. Phase 4 requests targeted revisions instead of full rewrites. Phase 7 encodes cross-cutting conventions in AGENTS.md so they persist across sessions. It reads less like a prompt and more like a project management protocol adapted for non-human workers.
The economics matter too. @mattpocockuk pointed out that staying in the "smart zone" is an underrated way to save on tokens, because sending 600K tokens on every request in the "dumb zone" gets expensive fast, even with cache discounts. @ClaudeDevs highlighted a concrete improvement in Opus 4.8: system instructions can now be added mid-conversation without breaking the prompt cache, translating to more cache hits and lower cost and latency for API requests. Together these posts paint a picture of a maturing discipline where the best practitioners think about token budgets, cache strategies, and incremental refinement the way a senior engineer thinks about memory allocation.
Models for Every VRAM Budget
The model ecosystem continued its relentless expansion into every hardware niche. @0xSero published a practical hardware guide that reads like a restaurant menu for models. At 4-8GB VRAM, minicpm5 offers agentic tool use on tiny machines and tops benchmarks in its weight class. At 8-16GB, LFM-2.5-8B is an 8B MoE with only 1.5B active parameters trained on a massive 38T tokens with 131K context. At 96-128GB, quantized variants of larger models deliver strong agent performance with high context lengths for modest VRAM. And at 196GB and above, Step-3.7-Flash brings 199B parameters with 11B active, vision support, 256K context, and 150 tokens per second on 6000-series GPUs.
@KyleHessling1 introduced Qwopus 3.5-Coder 4B, a tiny coding model that scored 43.5% on completed patches from a SWE-bench mini slice, running at 270 tokens per second at Q8 with multi-token prediction on a 5090. With parallel requests during SWE-bench runs, aggregate throughput exceeded 500 tokens per second. That makes it viable not just for lightweight coding tasks on older hardware but also for swarm data cleaning and large dataset processing where throughput matters more than single-request intelligence.
Meanwhile @Hikari_07_jp announced that Step 3.7 NVFP4 with multi-token prediction would be released tonight Japan time, bringing the frontier model to lower precision formats for broader accessibility. The trio of posts underscores a clear trend: the gap between what runs locally and what requires cloud infrastructure is narrowing fast, and developers can now make genuine architectural choices rather than settling for whatever their hardware can勉强 handle.
Local AI Gets Serious About UX
Local AI has long suffered from a developer experience gap. The tools work, but reaching for them requires comfort with build flags, quantization schemes, and model format conversions. That's starting to change. @ggerganov announced that llama.cpp now has an official website with a single-line cross-platform installer providing a unified llama entrypoint for running and serving models. Existing GGUF models stored in the machine's HuggingFace cache are automatically available without re-downloading. The roadmap includes seamless integration with local-friendly third-party agents like Pi, positioning llama.cpp as the runtime layer beneath the coming wave of local agent applications rather than just a inference library for enthusiasts.
@thatboybenagain launched Harbour, a local-first Mac journal with on-device AI reflection and no subscription, priced at $35 for a limited time. "I wanted a journal with AI reflection, but I did not want my private thoughts uploaded into another cloud app," he wrote. It's a direct challenge to the assumption that AI features require cloud infrastructure, and it targets a use case, personal journaling, where privacy concerns are a genuine blocker for cloud-dependent alternatives.
@DhravyaShah open-sourced what he describes as a "universal company brain": a system that connects natively to any agent,
Sources
You're Not Slow. You're Single-Threaded: A Complete Guide on Commanding 300 Agents from One Prompt
Come build agents that can finally hold a fluid conversation at the 24-Hour Conversational AI Hackathon, hosted by @usemoss at the YC Office, June 6-7. First place wins an interview with a YC partner: https://t.co/T9md5yyoF4
Building workflows for agents with Skills + Interpreters
Today, we’re sharing a new state of the art for computer use. Our system holds the two highest verified scores on OSWorld, the standard benchmark for AI agents that operate a computer like a person: 83.6% using Claude Opus 4.7 and 81.5% using Claude Sonnet 4.6. The human baseline is 72.4%. 🧵 1/7
I made a universal company brain. - connect to ANY agent natively - Git-like versioning and RBAC permissioning - connectors to all sync tools of companies - Dreams about your company - run on-prem. Free to start. it's live today. we've been using it for 6 months @supermemory https://t.co/xsrenSBHbY
someone hit me up about the new "claude dynamic workflows" feature, claiming "see, multi-agent works" But really, the launch of this feature proves the exact point that I made back in June of 2025, along with @walden_yan, @tobi, @karpathy, and many others: Deterministic workflows orchestrating small agent loops beats non-deterministic multi-agent or "agent soup" systems every dang time everything is context engineering
If you ever get tired of managing your Codex threads, just let Codex manage itself! Codex can now create threads, search them, organize them, pin the important ones, and spin up worktrees for parallel tasks. https://t.co/fO6OVu0ZcE
my "plans" largely look like pseudo code composed of mostly types/interfaces, how they compose, and their boundaries ive recently started including call stacks - been very helpful for both me and agents when implementing https://t.co/SLrYX3ywqc