Agents Go Headless as Box CEO Rethinks Software Pricing and Local Inference Hits 130 tok/s on Consumer GPUs
The AI agent ecosystem dominated today's discourse, with Box CEO Aaron Levie laying out a new pricing framework for headless software and multiple projects pushing agent orchestration forward. Meanwhile, local inference got a serious speed boost from Luce DFlash and DeepSeek-v4-flash, and Cursor shipped both a security review agent and a deep dive into their agent harness engineering.
Daily Wrap-Up
The throughline today is unmistakable: agents are no longer a novelty, they are becoming the primary consumers of software. @levie's lengthy post about headless software pricing wasn't idle speculation. It was a CEO of a major enterprise platform publicly working through the economics of a world where agents outnumber human users. When he writes that "every platform that goes headless will need to adopt a consumption model," he's describing a tectonic shift that makes traditional per-seat SaaS licensing look quaint. The fact that this post landed on the same day Cursor shipped always-on security review agents and @gkisokay outlined a full research agent workflow with Hermes tells you we've crossed from "agents are coming" into "agents are here, now figure out the business model."
On the infrastructure side, local inference continues its march toward parity with cloud. Luce DFlash hitting 130 tokens per second on a single consumer GPU with a 27B model is the kind of benchmark that makes you rethink your API budget. @quxiaoyin switching from Claude Code Max to DeepSeek and Hermes at $5 per week is anecdotal, but it maps to a broader pattern: the cost floor for capable AI keeps dropping, and developers are noticing. @0xSero ranking inference stacks (SGLang > vLLM > ExLlamaV3 > llama.cpp) suggests this space is maturing enough that people are developing real preferences and workflows.
The most entertaining moment was @journoverax breathlessly listing 69 open-source repos that supposedly replace every Anthropic product for $0/month, framed as insider knowledge from "a mid-level engineer at Anthropic." The engagement bait was strong, but the underlying tension is real: the open-source community is building fast alternatives to commercial AI tools, and the gap is narrowing. The most practical takeaway for developers: add exclude-newer to your uv configuration today. As @tdhopper explained, a 7-day dependency cooldown blocks most malicious package uploads before they can affect your projects, and it costs you nothing but a slight delay on bleeding-edge packages.
Quick Hits
- @StockSavvyShay shared AWS CEO's pushback on the "AI kills dev jobs" narrative, noting Amazon is hiring as many developers as ever while AI agents are "exploding" across industries.
- @ashwingop dropped Part 2 of the "Company Brain" series, diving into factual memory as the foundation layer for enterprise AI systems.
- @journoverax compiled 69 open-source repos positioned as free alternatives to commercial AI products, from Ollama replacing APIs to OpenHands replacing Claude Code.
- @mr_r0b0t showed off Hermes agent's architecture-diagram skill from @NousResearch, generating visual system diagrams on demand.
- @TheAhmadOsman wrote about building enough cloud-like infrastructure around a homelab that it functionally became "the cloud," a fun read for self-hosting enthusiasts.
- @wavedash launched browser-based gaming with no downloads, installs, or accounts required.
- @alexcooldev highlighted an indie iOS developer hitting $102k/month across 19 apps, sharing his own $25k/month journey with 3 apps as a case study in mobile persistence.
Agents Take Center Stage: From Frameworks to Business Models
The agent conversation has matured significantly. We're no longer debating whether agents will matter; today's posts were about how to build them, price them, and keep them from producing garbage. @levie's post was the anchor, offering a surprisingly detailed framework for how enterprise software pricing evolves when agents become the dominant users. His key insight is that consumption-based pricing will win for headless agent usage, while human seats survive but must include embedded API usage so agents can act on behalf of users. "If you don't do this, you're DOA," he wrote, which is blunt for a Fortune 500 CEO.
On the builder side, @gkisokay outlined a practical research agent architecture using Hermes v0.12.0: pick a domain, feed it sources, define your signal, save evidence, deliver daily briefs, and iterate on feedback. "Once you have a research agent, everything gets easier," he argued, positioning research as the foundational agent that feeds all others. Meanwhile, @KexinHuang5 introduced agent-managed sandboxes for scientific workloads, where AI agents autonomously orchestrate fleets of sandboxes to handle terabyte-scale processing, a pattern that points toward agents managing their own infrastructure.
The framework question is still unsettled, though. @samuelcolvin asked TypeScript developers what agent framework they actually use, listing Vercel AI, Mastra, and "Langpain-js" (the typo doing a lot of work there). His real question cut deeper: "Do you even use a framework, or pretend you don't need one and let the coding agent one-shot a slop micro-framework each time?" @Vtrivedy10 offered a more structured perspective, highlighting convergent design patterns across Cursor, LangChain, and others: tuning models with bespoke tools, using offline and online evals, and treating the context window as "a sacred boundary where computation happens." @bbssppllvv tackled a different angle entirely, releasing 2,000 DESIGN.md files from top products to help agents stop producing ugly UIs. And @xdotli shared required reading for anyone building agentic systems, focused on reducing entropy, which remains the core engineering challenge as these systems scale.
Local Inference Breaks New Ground
The local AI inference space had a banner day. @fahdmirza's post about Luce DFlash grabbed attention with hard numbers: 130 tokens per second on a single GPU, a 27B model running in 24GB VRAM, 128K context on consumer hardware, all through speculative decoding with a tiny draft model. "Raw C++ binary, zero Python in the engine," he noted, which explains the speed. The 3.4x improvement over standard autoregressive decoding is the kind of leap that changes cost calculations for small teams.
@0xSero offered a clear stack ranking for Nvidia hardware users: "SGLang > VLLM > ExLlamaV3 > Llama.cpp." In a separate post, he praised DeepSeek-v4-flash as "incredibly reliable and capable, logical," recommending that companies spending over $100k per year on AI should buy 8-10 RTX 6000s and have workers blind-test models. @quxiaoyin went further, claiming to have abandoned Claude Code Max entirely: "I can't believe I stopped using Claude Code max and entirely use DeepSeek and Hermes. It's so fast, 3x faster for the same task. I spent $5 last week." Whether that holds up for complex tasks is debatable, but the sentiment reflects real pressure on cloud AI pricing. @ClementDelangue added fuel by showing Qwen3.6 running on a Raspberry Pi via llamacpp, demonstrating that the floor for "usable local AI" keeps dropping.
Dev Tools Get Serious About Security and Quality
Cursor had a two-punch day. First, they launched Security Review for Teams and Enterprise plans, offering always-on agents that check every PR for vulnerabilities and run scheduled codebase scans with findings posted to Slack. Then they published a deep dive into their agent harness, explaining how they make models "faster, smarter, and more token-efficient" inside Cursor and how they test improvements, monitor degradations, and customize for different models.
Security was a broader theme. @zodchiii highlighted that Anthropic's CISO revealed 90% of their code is written by Claude, then explained how they protect secrets during AI-assisted development. "Your .env file is the weakest link in your entire AI workflow," he warned. @teej_m amplified @tdhopper's advice on adding a 7-day dependency cooldown using uv's exclude-newer setting, calling it essential for every Python project: "Do this today. Do not wait. This change will save your ass."
On code quality, @mattpocockuk addressed the growing problem of AI-accelerated software entropy, offering guidance on de-slopping codebases ruined by AI-generated code. And @steipete shipped Crabbox 0.1.0, a tool for running agent test suites on remote Linux boxes when your local machine can't keep up: "Too many agents, too many test suites, one very tired Mac."
API Infrastructure: WebSockets and Prompt Optimization
Two posts pointed at infrastructure-level improvements for AI workflows. @badlogicgames discussed OpenAI's WebSocket mode for the Responses API, noting that the real speed improvement comes not from the transport layer but from "caching context OpenAI side, and only sending deltas as the context grows." He referenced @OpenAIDevs' claim of 40% faster end-to-end workflows and noted that his own tool has supported WebSocket mode since March, though delta-based context sending remains unimplemented.
@GptMaestro shared an explanation of GEPA, a technique that optimizes prompts before inference rather than cramming more into the context window. This represents a shift in thinking about prompt engineering: instead of adding more context, reduce and refine what you send. Both posts point toward a future where the infrastructure around model calls matters as much as the models themselves.
Sources
projects.md
The .env Setup That Keeps Claude Code From Leaking Your Secrets (Full Config Included)
What to Learn, Build, and Skip in AI Agents (2026)
Agent-managed sandboxes for scientific workloads
Company Brain, Part 2: Factual Memory
In the first piece, I argued that a real Company Brain needs three kinds of memory: factual memory, interaction memory, and action memory. Factual mem...
Seeing $100k+/month in revenue for the first time 🚀 Crying from happiness. It took me 2.5 years to reach this. From zero. Had to go through hell: account deletion, lawsuits, losses, debt, frozen accounts, countless mistakes. Just believe in yourself and don’t give up. You can work in public. Don’t be afraid to make mistakes. That’s how you find what works. And the most unusual part, you can scroll through my entire X. I documented every step. What I shipped, what I tested. You can watch the video about me. @adamlyttleapps made a great episode, and nothing has changed since then. I just keep hitting the same point, slowly but consistently, and it’s starting to pay off.
turns out not killing the prefix cache all the time and notnhaving a humongous set of tools and a massive system prompt is good for local model use. who'd have thunk. https://t.co/lWjRoikJeM
Our agent harness makes models inside Cursor faster, smarter, and more token-efficient. Here's how we test improvements to the harness, monitor and repair degradations, and customize it for different models. https://t.co/YIXcEZW6ud
Benchmarks are marathons, here’s how we 5x’d our browser performance
Add a 7-day dependency cooldown. uv's `exclude-newer` refuses any version published inside a rolling window. With 7 days set, today's malicious uploads would not be considered for resolution at all. Most malicious uploads are caught within that window.
69 Best Open-Source AI Repositories in April 2026
Hermes Agent v0.12.0 - “The Curator Release” https://t.co/i6uAAkuD2z
Exploring GEPA
My Homelab Is Technically the Cloud Now
Someone asked me recently if I’d duplicated my local homelab into the cloud. Not really. The more accurate answer is that I’ve built enough cloud-like...
⚙️ We made agent loops faster with WebSockets in the Responses API As Codex got faster, the bottleneck moved from inference to inefficient API calls WebSockets keep response state warm across tool calls, helping workflows run up to 40% faster end to end https://t.co/nFeUEdRdKt