Coding Agents Hit 60% of Merged PRs at Ramp While Local Inference Gets Surprisingly Fast on Consumer Hardware
AI coding agents are moving from experiment to production, with Ramp's internal agent now authoring the majority of their merged pull requests and multi-agent review loops becoming standard practice. Meanwhile, local inference on consumer hardware is having a moment, with 35B parameter models running smoothly on MacBooks and NVIDIA's GB10s finding fans for always-on edge deployment. The Claude Code skills ecosystem continues to expand as Obsidian officially embraces AI agents for vault management.
Daily Wrap-Up
The throughline across today's posts is unmistakable: AI agents aren't prototypes anymore, they're coworkers. Ramp's internal coding agent "Inspect" now authors over 60% of their merged PRs. Peter Steinberger has codex running on every commit, spawning review agents that spawn fix agents in recursive loops. Dan Rosenthal is running an entire agency where Claude Code is the primary interface and 79 skills power daily operations. We've crossed from "can agents do useful work?" to "how do we manage fleets of agents doing useful work?" and the organizational implications are only starting to land.
On the hardware side, the local inference crowd is quietly winning converts. Harrison Kinsley's praise of NVIDIA's GB10s for their low power and silent operation, combined with reports of Qwen 35B running fast on MacBooks via optimized quantization, suggests that the "cloud vs. local" debate is becoming less binary. The practical reality is that many developers want always-on, private, zero-API-cost inference for routine tasks, and the hardware is finally catching up. Meanwhile, a fascinating deep dive on HBM economics from a semiconductor architect explains why this GPU memory arms race is structurally different from past DRAM cycles, tying token throughput directly to HBM size times bandwidth as a first principle.
The most entertaining moment was Jeffrey Emanuel's marathon post about "Dueling Wizards," a system where multiple AI models independently generate ideas, then score and critique each other's work in adversarial rounds. It's essentially debate club for LLMs, and GPT-5.5's own meta-analysis of why the approach works reads like a PhD defense. The most practical takeaway for developers: if you're building with coding agents, study Ramp's approach of embedding agent workflows into existing project management tools like Linear rather than creating separate agent-specific infrastructure. Context lives where work lives, and agents that can access structured project context outperform those operating in isolation.
Quick Hits
- @badlogicgames RT'd that Ghostty is leaving GitHub, with creator @mitchellh departing after 18 years on the platform. A notable move in the ongoing conversation about developer infrastructure independence.
- @Scobleizer flagged a writeup covering 199 pitches from YC W26 Demo Day, calling it the best demo day summary he's ever seen.
- @ashtom (GitHub CEO Thomas Dohmke) RT'd a video where two Anthropic engineers explain why most people use less than 10% of Claude's actual capabilities.
- @Alfred_Lin shared "A New Token Rule For Engineering Leadership," arguing all companies need to close the gap between teams getting massive AI leverage and those barely touching the tools.
- @heyconstella announced a research canvas designed for neurodivergent researchers, featuring visual search across 475M+ papers.
- @DSPyOSS RT'd a one-word endorsement of DSPy. That's it. That's the post.
Agents Take Over the Codebase
The agent-in-production story has shifted from aspiration to operational reality, and the numbers are getting hard to ignore. @karrisaarinen highlighted that Ramp's internal coding agent "Inspect" now authors over 60% of their merged pull requests, with the key insight being integration with Linear where "the work and context already live." This isn't a toy demo. It's a production system at a major fintech company that has fundamentally changed how code gets written and shipped.
The automation doesn't stop at code generation. @steipete described a recursive agent pipeline where "codex now runs on each commit we land, reviews it, and if a booboo is found, a new codex spins up and makes a PR for the fix. Then a review agent spins up. If an issue is found, another agent will fix the issues (up to 5 loops)." What's notable here is the self-limiting design with a five-loop cap, an acknowledgment that infinite agent recursion needs guardrails.
@dan__rosenthal took the concept furthest, describing an entire agency built around Claude Code as the primary interface, with 79 skills, 26 agents, and a "Company OS" stored in git. His team of 20 runs client operations through a system where "AI does the legwork, but humans ship." The infrastructure includes safety hooks gating 94+ risky operations and a self-improvement loop where performance metrics feed back into the system. Meanwhile, @derekmeegan released /browser-trace, a skill that gives agents full observability into browser sessions, dumping network requests, DOM content, and screenshots into a searchable filesystem. The trend is clear: the tooling layer around agents is maturing fast, and the teams investing in structured agent infrastructure are pulling ahead of those still experimenting ad hoc.
Local Inference Finds Its Sweet Spot
The local AI inference story isn't about beating cloud performance anymore. It's about carving out a distinct niche where always-on, private, zero-cost inference makes more sense than an API call. @Sentdex made the case for NVIDIA's GB10 units, noting they offer "very low power usage, make no noise, generate very little heat, take up little space. Always on and available. 50 t/s with plenty of very solid MoEs." This was a direct rebuttal to criticism that the Spark hardware is "too slow to do anything real with," and the counterargument is compelling for developers who want persistent local inference rather than peak throughput.
On the software optimization side, @DataChaz reported running Qwen 35B locally on a MacBook with impressive speed, crediting Google's TurboQuant quantization scheme. The ability to run a 35B model on consumer hardware with "zero setup required" and full offline privacy represents a meaningful threshold. @MatthewBerman pointed to "Nightshift," a tool for autonomous ML research on Apple Silicon using the MLX framework, describing it as "AutoResearch but for MLX." The pattern here is the Apple Silicon ecosystem developing its own inference stack that's increasingly competitive. For developers who do repetitive coding tasks, local models that are always warm and never rate-limited may be more valuable than cloud models that are occasionally faster but require network calls and API management.
The Claude Code Skills Ecosystem Expands
Obsidian's official embrace of AI agents marks a significant moment for the knowledge management space. @obsidianstudio9 reported that Obsidian CEO Kepano personally created and released obsidian-skills for Claude Code, covering Markdown syntax, database operations, JSON Canvas editing, CLI vault operations, and web content ingestion. The Japanese-language post captured the significance well: Obsidian is now officially designing "with the premise that AI agents will manage your vault."
This dovetails with the broader Claude Code skills culture visible in today's posts. @mauriciord praised the /grill-me skill for helping understand feature implementations and explore design reviews with teammates before creating plans. @mattpocockuk recommended the related /grill-with-docs variant. These aren't general-purpose prompts. They're structured workflows that encode specific development practices into repeatable agent behaviors. The skills ecosystem is becoming a new distribution channel for developer best practices, and tool makers like Obsidian officially joining signals it's moving beyond the early-adopter phase.
AI Memory Hits Its Architectural Limits
The question of how AI systems remember things surfaced in two distinct but related conversations. @kimmonismus highlighted Engramme's "Large Memory Models," a new architecture designed specifically around human memory patterns rather than RAG or vector search. The founding team closed their Harvard lab to build it, bringing 160+ publications from Nature and ICLR. Whether the approach delivers on its promise remains to be seen, but the bet is that current memory solutions are architectural dead ends.
@rohit4verse offered a more granular critique, responding to LangChain CEO Harrison Chase's walkthrough of four agent memory approaches. The core objection: "All four assume the model is still holding the right tokens. It isn't. At token 4,096 the cache ran a silent eviction nobody wrote." The provocative claim that "first founder to write the eviction policy ships a 100B agent that remembers a person" frames memory management not as a feature but as the fundamental unsolved problem in agent development. Both posts point to the same gap: as agents take on longer, more complex tasks, memory architecture becomes the binding constraint.
AI-Powered Debugging and the Vibe Coding Movement
Two posts captured different ends of the AI coding spectrum. @moofeez shared results from post-training Qwen3-Coder to use an actual debugger, pushing solve rates from 70% to 89% while cutting median turns to fix from 46 to 19. The model "reasons from execution, inspects live variables and call stacks, sets breakpoints, steps, and evaluates expressions." This represents a meaningful advance: AI models that don't just read code but interact with runtime state the way human developers do.
@naval released a podcast covering the cultural shift around vibe coding, with chapter titles ranging from "Vibe Coding Is a Video Game with Real-World Rewards" to "The Beginning of the End of Apple's Dominance." The most provocative segment, "Pure Software Is Uninvestable," suggests that if AI collapses the cost of building software to near zero, the defensibility of software businesses fundamentally changes. Whether or not you agree, the framing captures where the discourse has moved: from "will AI write code?" to "what happens to the entire software economy when it does?"
The HBM Bottleneck Explained
@labubu_trader shared an English translation of a deep semiconductor architecture analysis from colleague @fi56622380, and it deserves attention. The core thesis derives a first principle of AI inference economics: token throughput equals HBM size times HBM bandwidth. Using an airport shuttle bus analogy, cabin size represents HBM capacity (how many requests' KV caches fit simultaneously) while door width represents bandwidth (how fast tokens generate). The analysis argues this time is structurally different from past DRAM cycles because "when demand has been physically locked into exponential growth" by the inference paradigm's requirements, the supply-side dynamics change fundamentally. For developers tracking infrastructure costs, understanding this hardware constraint explains why inference pricing follows the curves it does and why edge inference on devices like GB10s represents a genuinely different cost structure rather than just a slower version of cloud.
Sources
Ramp’s internal coding agent now writes 60%+ of their merged PRs. With Linear as the underlying layer for structured product context, it can take on issues and work them to completion. Here’s how they got there: https://t.co/TA4AJyk8PI https://t.co/TRPXRJcTUI
A New Token Rule For Engineering Leadership
All companies need to be closing the gap between (a) their teams getting massive leverage from AI tools and (b) those barely touching them. -- @lorenc...
@AlexFinn Spark is too slow to do anything real with. Only good for taking pictures it of for twitter clout. I know, I have two. https://t.co/ybjyrTErNE
How AI Actually Remembers (Full Guide)
"My Favorite Prompts," by Jeffrey Emanuel Prompt 1: The Idea Wizard "Come up with your very best ideas for improving this project to make it more robust, reliable, performant, intuitive, user-friendly, ergonomic, useful, compelling, etc. while still being obviously accretive and pragmatic. Come up with 30 ideas and then really think through each idea carefully, how it would work, how users are likely to perceive it, how we would implement it, etc; then winnow that list down to your VERY best 5 ideas. Explain each of the 5 ideas in order from best to worst and give your full, detailed rationale and justification for how and why it would make the project obviously better and why you're confident of that assessment. Use ultrathink."
Persistent memory is the Achilles heel of AI. Engramme’s Large Memory Models (LMMs) empower every app with persistent memory. Google solved search. OpenAI solved language. Engramme solved memory. Join beta: https://t.co/iN6ZgEVSxH https://t.co/LqPxweldnh
What I Learnt From 199 Pitches at the YC W26 Demo Day
AI Semiconductor Endgame 2026 (Part 1) New Token Economics Computing Paradigm Shifts from GPU Compute to HBM This article starts from the essence of GPU architectural evolution to address a question the market has long worried about: Why must each GPU's HBM memory demand grow exponentially, and why won't this exponential growth in HBM demand stall? It then derives the first principle of token economics under the current architecture: token throughput = HBM size × HBM BW (bandwidth) It also discusses why the GPU ceiling is determined by HBM's two dimensions of progress. The topic of HBM cyclicality has long been controversial. Optimists argue that AI-driven demand is much greater than before, but the market mainstream still believes that previous up-cycles also saw 20%+ annual demand growth — so what's different this time? AI doesn't change the fact that HBM, like traditional DRAM, has commodity attributes. Once capacity expansion at the demand peak meets a downturn, history will repeat itself. We can take the perspective of compute-chip architecture, start from first principles, and unpack and reason through this question: why this time is genuinely different. ——————————————————————————————— History: The Era of CPU Compute For a very long time, we lived in the era of CPU-dominated compute. The CPU's top-level KPI was performance — running faster — and so each generation of CPUs deployed every method imaginable to push benchmark scores higher. First it was rising clock frequencies, then it was architectural evolution: superscalar designs, and so on. During this period, why didn't DDR need to advance technologically at high speed? DDR3 to DDR5 took a full 15 years. Because in this era, DDR's role was purely auxiliary — and only weakly so. By industry experience, even doubling DDR speed would generally only raise CPU performance by less than 20%. Why did improvements in DDR bandwidth and speed matter so little? Two reasons: 1. CPUs designed all kinds of architectural tricks to hide DDR latency — superscalar designs, wider issue widths, massive ROBs and register renaming to extract parallelism and hide latency, L1 caches, L2 caches — all of which weakened the demand for DDR bandwidth and speed. 2. CPU workloads don't have particularly demanding bandwidth requirements. For most everyday workloads — say, opening a webpage — DDR bandwidth is severely overprovisioned. Even cloud workloads often look the same. In other words, in the CPU era, DDR bandwidth and speed didn't really matter. There was virtually no difference between DDR4 and DDR5 except in a handful of games — and even the JEDEC standard advanced slowly. On top of that, only a small portion of any given app needs to permanently sit in DDR. Whatever is needed can be paged in from the hard drive on demand. App size grew slowly, and so DDR capacity demand grew slowly as well. That's why, over the past decade, the average PC went from 7–8GB of DDR to about 23GB — only 3× growth in ten years. This slow upgrade pace directly affected revenue. Capacity-based pricing was the main way of making money; speed improvements were just a technological upgrade that raised the unit price of capacity. With both of these dimensions advancing slowly, growth could only come from increases in PC/phone unit volumes. So along both dimensions — bandwidth/speed and capacity — DRAM was always a “nice-to-have” appendage to the chip industry. The marginal utility of DDR upgrades was very low, and almost completely disconnected from the CPU era's top-level KPI. ——————————————————————————————— The Paradigm Shift: GenAI's Top-Level KPI When we entered the era of GenAI large models, the computing paradigm shifted, and the top-level KPI changed fundamentally. By the time GPUs evolved into AI inference engines, the top-level KPI was no longer compute alone (TOPS/FLOPS), as it had been for CPUs — it became the cost of a token. Specifically: overall token throughput per unit cost / per unit power. A close second is token throughput speed — because in the agent era, many tasks have become serial, and token output speed has become a critical bottleneck for user experience. This is exactly why Jensen invented the concept of the AI factory: to produce the most tokens at the lowest cost, while pushing token throughput speed as high as possible. In the AI training era, Jensen's economics were TCO (Total Cost of Ownership): the more GPUs you buy, the more you save. In the inference era, Jensen's token economics flip the logic: AI inference has very healthy gross margins, so the logic now becomes: the NVIDIA GPU is the GPU that produces the cheapest token in the world, so the more you buy, the more you earn. The top-level KPI has become a Pareto frontier: along the two dimensions of token throughput and token speed, optimize as far as possible. Each generation of NVIDIA's token factory is essentially pushing the entire Pareto frontier up and to the right. This is the most important KPI of the AI inference era. ——————————————————————————————— From Token Throughput to HBM: The Core Logic Chain Below is the most important logical chain of this article: how to start from the exponential growth of token throughput and derive that the ceiling bottleneck lies in the exponential growth of HBM size and HBM speed. In the era of single-GPU inference with single-thread batch size = 1, token throughput had only one dimension: HBM bandwidth speed. Higher bandwidth = higher token throughput. But once we entered the NVL72 era, inference is no longer single-GPU. It is a system-level token factory composed of 72 GPUs + 36 CPUs, designed to fully saturate HBM bandwidth and compute simultaneously, in pursuit of the ultimate token throughput. Token throughput growth depends on two things: the number of requests batched simultaneously × the average token speed per request. That is: batch size × token speed. Take Rubin NVL72 as an example. At an average token speed of 100 tokens/s, processing 1,920 simultaneous requests yields a token throughput of 192,000 tokens/s. A Rubin NVL72 draws roughly 120kW (0.12MW), so per MW it can handle 1.6M tokens/s. So we need to find ways to push both parameters up: batch size and average token speed. Their product is our top-level KPI — token throughput. Parameter 1: Batch growth — bottleneck is HBM size Every request in the batch carries its own KV cache, which has to live in HBM, with sizes ranging from a few GB to tens of GB. Because hot KV cache must be read at high frequency and high speed at any moment, it must reside in HBM. For a model with, say, 80 layers, every token generation step requires reading the KV cache 80 times from HBM. As batch size grows, hot KV cache grows linearly. And because the hot KV cache for every request in the batch must sit in HBM, HBM size must grow linearly with batch size. Like an airport shuttle bus: the gate wants to move passengers to the plane as fast as possible. If HBM size is small, the shuttle is small, so you have to make extra trips. Conclusion: batch size growth bottlenecks on HBM size growth. Parameter 2: Average token speed per request — bottleneck is HBM bandwidth The decode-phase speed of a large model bottlenecks on HBM bandwidth, because every token generated requires reading the activated weights and KV cache many times over. The emergence of LPUs has, in cases where batch size isn't very large, moved the activated weights portion onto SRAM — but every generated token still requires many reads of the KV cache from HBM. The higher the HBM bandwidth, the faster each token is generated, in essentially linear correspondence. Like the airport shuttle bus: HBM bandwidth is like the width of the door — wider doors mean passengers board faster. The rest of the GPU's configuration is essentially adapted to support batch growth and to keep token compute speed in step with HBM growth. In some cases the GPU even spends excess compute to recover effective bandwidth (e.g., bandwidth compression techniques). —------- To return to the shuttle bus analogy: • Shuttle bus cabin size = HBM Size (capacity): determines how many passengers can fit at once (i.e., how many requests' KV caches can sit in HBM simultaneously). Bigger cabin = more passengers (higher batch size) per trip. If the bus is too small, moving 100 people takes two trips — and total throughput suffers. • Shuttle bus door width = HBM Bandwidth: determines how fast passengers get on and off. A wide door, and everyone piles on at once (decode/token generation is fast). A narrow door, and even with a giant cabin, people queue up and most of the time is spent boarding. • Passenger throughput = cabin size × door-width-determined boarding speed. —------- At this point, we've logically derived the first principle of token-economics hardware demand: Token throughput = HBM size × HBM Bandwidth The top-level KPI of the AI inference era is highly dependent on progress along both HBM dimensions. If we want to maintain 2× token throughput growth per generation, that means each generation of single GPU must grow HBM size × HBM BW speed by 2×! This is the first time in history that HBM memory size can influence the top-level KPI — token throughput. To validate this thesis, we can put NVIDIA's token throughput from A100 to Rubin Ultra on the same chart as HBM size × HBM BW speed. What you find is that the two curves track each other startlingly closely on log axes. HBM size × speed actually grows even faster than token throughput — which makes sense, because HBM defines the ceiling, and in practice utilization of that ceiling is very hard to push to 100%. Even if HBM size × HBM speed grew by 1,000×, with the supporting compute and architecture, it would be very hard to wring out the full 1,000× of headroom. This curve isn't a coincidence — it's the necessary solution of system optimization. throughput = batch × speed. This is the unavoidable first principle of token factory economics. —------- What about software? Won't software optimization reduce bandwidth demand? Reduce HBM demand? This is an independent dimension from hardware. It's like asking: if software on a CPU runs faster after optimization, does that mean the CPU doesn't need to advance for ten years? After all, software is faster now. If that were the case, would CPU vendors still make money? For a CPU vendor to survive, there's only one path: in standardized benchmarks, ignoring software optimization, every new CPU generation must score higher — otherwise it doesn't sell. GPUs are exactly the same. How well software is optimized, and the requirement that the GPU's own token-throughput KPI must improve dramatically every year, are two separate things. As long as token demand keeps growing, the pursuit of higher token throughput will not stop — and so neither will the pursuit of higher HBM size × HBM speed. If HBM size and HBM speed were to slow down, Jensen would personally fly to the Big Three and pressure them to accelerate, because that ishis GPU ceiling. If the ceiling stops rising, can his GPU still sell? Of course, NVIDIA also needs to wrack its brains to extract performance beyond the HBM ceiling through heterogeneous architectural angles. The LPU is a great example — it improved the Pareto frontier substantially from a different angle (the right-hand high-token-speed portion). —-------------------- HBM memory has now bid farewell to that old era of drifting with the tide. On this one-way road paved by exponential demand, it has, in something close to a destined fashion, walked onto the central stage of the industry's epic. When the inference paradigm's first principles evolve to this point, as long as Jensen still wants to sell GPUs, HBM must double — and it must double every generation. This is endogenous pressure from the supply side. It has nothing to do with AI demand, nothing to do with macro cycles, and nothing to do with the moods of the hyperscalers. The only remaining question is this: When demand has been physically locked into exponential growth, will the three players on the supply side — like they have for the past thirty years — once again drag themselves back into the mire of the cycle by their own hands?