Claude Desktop Opens Third-Party Inference as Speculative Decoding Hits 154 tok/s on Consumer GPUs

April 24, 2026 · 23 sources

The Claude ecosystem dominated today's discourse with third-party inference support quietly appearing in Claude Desktop, new skills marketplaces, and creative prompt hacks. Meanwhile, speculative decoding benchmarks on a single 4090 showed a 6x speedup for Qwen 3.6 27B, and industry observers mapped out a diverging future where big labs go enterprise while open-source eats the consumer market.

Daily Wrap-Up

Today's feed painted a picture of an AI ecosystem splitting into distinct layers, each with its own economics and winners. The biggest undercurrent was the Claude Code skills explosion: from ad campaign management to code review automation to Vercel's official skills marketplace, the ecosystem around Claude's coding tools is maturing fast. But the real eyebrow-raiser was @PawelHuryn's discovery that Claude Desktop now quietly supports third-party inference providers, including OpenRouter and local models via LiteLLM, with zero official acknowledgment from Anthropic more than 20 hours later. Whether intentional soft launch or premature release, it signals a world where the IDE and the model become fully decoupled.

On the local inference front, the numbers are getting hard to ignore. @outsource_ demonstrated a 6x speedup on Qwen 3.6 27B using speculative decoding with a small draft model, pushing a single 4090 from 26 to 154 tokens per second. That's fast enough to make local models genuinely competitive for interactive coding workflows. Combined with @neural_avb's thesis that open-source labs are about to deliver Sonnet-class models at a tenth the price, and Red Hat publishing a paper on "harness engineering" showing 5%+ reliability gains from better infrastructure, the message is clear: the value is shifting from raw model capability to the surrounding tooling and deployment stack.

The most entertaining moment was @housecor's teammate who, denied access to Claude's premium "ultrareview" feature, simply told Claude to "just do what ultrareview does." Claude obligingly spawned five parallel review agents and produced a comprehensive report. It's the kind of prompt-engineering judo that makes you wonder how much of the premium tier is just better prompting. The most practical takeaway for developers: invest time in your harness, not just your model choice. Whether it's speculative decoding for local inference, skills files for Claude Code, or proper orchestration for agent workflows, the environment your AI works in now matters as much as the AI itself.

Quick Hits

@elonmusk announced Cybercab has started production, marking Tesla's entry into purpose-built autonomous vehicle manufacturing.

@neil_xbt highlighted a new 2-hour free course from Andrej Karpathy covering AI fundamentals from scratch, no frameworks or libraries.

@chiefofautism reversed OpenAI's 1.5B privacy-masking model to extract PII instead of hiding it, turning structured spans into a data extraction tool. A vivid reminder that safety features can be inverted.

@Steve_Yegge published "Welcome to Gas City," announcing the v1.0.0 launch of an MIT-licensed enterprise SDK for building orchestrators using the MEOW stack.

@jenzhuscott spotlighted Tencent's Chief AI Scientist Yao Shunyu, who emphasized co-designing models with diverse products rather than chasing open benchmarks.

@lukepierceops pushed back on the "sell 45-minute audits for $1,000" trend, detailing a real audit process that takes 1.5-2 weeks with 5-10 stakeholder calls and proper deliverables.

@itsolelehmann shared a free Claude skill based on the Minto Pyramid, a 1970s McKinsey content structuring framework adapted for AI-era writing.

Claude Code Skills & Ecosystem

The Claude Code ecosystem is undergoing a Cambrian explosion of skills, plugins, and workflow integrations. What started as custom instructions has matured into a genuine platform play, with Vercel launching an official Skills Marketplace and practitioners sharing battle-tested configurations for everything from ad campaigns to code review. The tooling layer around Claude is becoming as important as the model itself.

@housecor shared a revealing anecdote about emergent capability: "Teammate wanted to try Claude's new expensive ultrareview but didn't have access. So he told Claude 'just do what ultrareview does'... It spawned 5 parallel agents: Security, Correctness, Conventions, Test coverage, Architecture. Then it generated an impressive report." The fact that Claude can approximate its own premium features from a natural language description raises interesting questions about the value of packaged workflows versus raw model access.

On the commercial side, @MichLieben offered to share "the Claude Code skills we use to manage $300k/mo in ad spend at ColdIQ. 4X ROAS on $1M+ spent." These skills handle bulk edits across platforms, creative fatigue detection, and bid adjustments from the terminal. Meanwhile, @alvinsng's opinionated "no useEffect" skill became officially part of Vercel's marketplace, and @trevin reported saving nearly 400 million tokens in a week using RTK, a token optimization tool. The pattern is clear: the Claude Code ecosystem is transitioning from individual experimentation to shared, productized tooling. The developers who build reusable skills are creating a new category of infrastructure.

Local Inference & Model Optimization

The local AI movement hit a major milestone today with concrete benchmarks showing that consumer hardware can now run large models at production-viable speeds. The combination of speculative decoding, better quantization, and rapidly improving open-source models is collapsing the gap between cloud and edge inference.

@outsource_ posted numbers that turned heads: "My 4090 went from 26 to 154 tok/s on Qwen 3.6 27B. Same GPU. Same Q4_K_M. No FP8, no extra quant. The unlock: ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model. 85% acceptance rate." That's a 6x improvement without any hardware changes, purely from software-level inference optimization. For context, 154 tokens per second on a 27B model makes real-time interactive coding entirely feasible on a single consumer GPU.

@TheAhmadOsman urged newcomers to read hardware guides before purchasing, reflecting a maturing community that's moved past the "just buy a 4090" phase into nuanced discussions about memory bandwidth, quantization tradeoffs, and model-hardware matching. @davis7 added a complementary insight from the API side, insisting developers should use models on low reasoning settings rather than medium or high, suggesting that many tasks are drastically over-served by default configurations. These threads point to a broader realization: optimization at the inference and configuration layer can deliver more practical value than waiting for the next model release.

Harness Engineering & Agent Orchestration

A new discipline is crystallizing around the idea that the scaffolding around AI models, what Red Hat is calling "harness engineering," deserves as much rigor as the models themselves. Today's discourse featured multiple angles on this thesis, from academic papers to production workflow architectures.

@JeremyCMorgan summarized Red Hat's position: "The environment an AI works in matters as much as the weights: integrating telemetry, repos, and testing constraints into a single deterministic orchestrator measurably moves code generation reliability by 5%+. You cannot prompt-engineer your way out of bad infrastructure." That last line could be the motto for this emerging field. @odysseus0z took it further with a provocative idea: "All the harness/skills etc should be compiled/generated by model hill climbing against eval/task itself... when new model comes out you just need to let the new model recompile/optimize it."

The most concrete implementation came from @mattpocockuk, who described a complete agent-driven development pipeline: Slack commands trigger triage, planning agents create dependency DAGs across branches, implementer agents work in parallel with Slack thread updates, and periodic review sweeps catch architectural drift. Built on Sandcastle and Vercel's Chat SDK, it represents the kind of integrated orchestration that the harness engineering papers are calling for. @dexhorthy endorsed Matt Pocock's accompanying talk on software fundamentals, noting it "captures so much between-the-lines stuff WRT crafting good software that I have struggled to articulate." The message across all these posts is consistent: the next competitive advantage isn't a better model, it's a better system around the model.

AI Market Dynamics & the Open Source Squeeze

Several posts today mapped out the strategic landscape for AI labs through the rest of 2026, with a consensus forming around a fundamental market split between enterprise closed-source and increasingly capable open-source alternatives.

@neural_avb laid out a four-part thesis: "Big labs are gonna push expensive bigger closed-source models directly to big tech... Open source labs are making comparable coding models now... anybody got a clear shot here to deliver a Sonnet or a GPT5.2 at 1/10th pricing... Local models are gonna go crazy because people will figure out speculative decoding + kv cache quantization." This framework, posted as a quote-tweet of Sam Altman promoting Codex enterprise rollouts with NVIDIA, highlights the tension between OpenAI's enterprise push and the open-source models nipping at their heels.

The business opportunity in this shift is already being captured. @AndrewWarner profiled a consultant hitting $4.5M ARR helping SMBs implement AI, while @Zephyr_hg teased that "the actual money is one layer deeper, in the skill nobody's filming courses about yet." The implication across these posts is that as models commoditize, value accrues to those who can deploy, integrate, and orchestrate them effectively, whether at the enterprise consulting level or in the tooling infrastructure layer.

Claude Desktop's Quiet Third-Party Inference Support

Perhaps the most consequential development today flew under the radar: Claude Desktop apparently now supports third-party inference providers, and Anthropic hasn't said a word about it.

@PawelHuryn flagged the silence: "20+ hours later and @AnthropicAI + @claudeai + official accounts still haven't said a single word about this. This is huge. Full third-party inference support for Cowork & Code in Claude Desktop: OpenRouter, Local models via LiteLLM, Any compatible gateway. Is this a bug? Was it released prematurely?" If intentional, this represents a major strategic shift, decoupling Claude's IDE experience from Anthropic's own inference, effectively turning Claude Desktop into a model-agnostic development environment. If accidental, it reveals architectural decisions that suggest this direction was always on the roadmap. Either way, it validates the broader trend of the day: the value is moving from models to the systems built around them.

Sources

Luke Pierce @lukepierceops · Apr 23

Everyone on X is saying to sell 45-minute audits for $1,000. You can't actually audit a business in 45 minutes. Here's how we actually run one: First, we identify every stakeholder we need to talk to. The founder, the head of ops, the sales lead, the person running fulfillment, the admin doing data entry, etc. Every role that touches the workflows we're auditing. Then we schedule calls with each of them. Not one 45 minute call. Multiple calls over the first week. 60-90 minutes each. The founder never knows where the real bottlenecks are. The people closer to the ground do. Once the calls are done, we map every workflow end to end. Intake, fulfillment, sales, ops, reporting. Every handoff, every approval, every place data gets re-entered. Then async follow-ups with whoever we need to clarify gaps. Loom walkthroughs of their actual tools. Screenshots of the spreadsheets in their shadow systems. From there we quantify the pain. Hours per week, dollars per month, throughput lost. Real numbers we can show them. Then we architect the solution. Database, workflows, AI agents, integrations. Priority-ranked by ROI. Finally we package the deliverable. Process maps, architecture diagrams, implementation plan, ROI calculations, timelines. Then a 60 minute walkthrough to present it. That's 1.5/2 weeks of work minimum. 5-10 calls. A real deliverable the client can hand to any developer.

Michel Lieben @MichLieben · Apr 23

I'm giving away the Claude Code skills we use to manage $300k/mo in ad spend at ColdIQ. 4X ROAS on $1M+ spent. Ivan, our head of growth, built them off 300+ hours running ad campaigns for our clients. They run Google, Meta, and LinkedIn ads from the terminal in plain English: → bulk edits across platforms → custom audiences from CRM lists → creative fatigue detection before CTR dips → bid adjustments at scale → performance audits across periods Reply "ads" and I'll send the full repo. Must be following.

Eric @Ex0byt · Apr 23

We now have amazing local models (and more will come). We need better harnesses. Stop listening to the blue-pill shills. Read the paper and audit your own harnesses. Paper: https://t.co/tNqR7KOmBB Code: https://t.co/y0iHaxvS77 (Credit: Stanford CS faculty/Iris Lab)

Andrew Warner @AndrewWarner · Apr 23

That guy with a big smile just hit $4.5M ARR. His business: helps SMB implement AI. I grilled @cheneypiano about his playbook: + He DM'd CEOs on LinkedIn + Within 72 hours, he got his first sale: $15k + His pitch: CEOs get ai strategy. Their teams learn AI + He built a .org site, which gives him credibility + @Replit does his SEO/GEO, which lands him clients + He shifted to subscriptions to get steady revenue Even at $15k/month, every 6 clients add $1M ARR. (He's increasing prices.) Next: He's setting clients up with his version of NeoClaw (NVIDIA's safer OpenClaw). Yes, I simplified his story. But he goes through it in detail in our interview. (YouTube link below.)

Matt Pocock @mattpocockuk · Apr 23

New flow: 1. /triage command in Slack (#sandcastle) 2. Creates a discussion thread with agents for each issue in the repo needing triage 3. Resolve discussions one-by-one until issues are marked ready for the agent 4. /implement in Slack 5. Planning agent creates a DAG of all branches based on PR's, with dependencies between PR's if needed 6. Implementer agent works on them all in parallel, each one gets a Slack thread for updates 7. I can re-run /implement any time with new issues for maximum concurrency, whole DAG gets recomputed 8. I review the PR's and merge while implementation goes on 9. Periodically run /review on the whole codebase to look for architectural improvements/doc drift Driven by Sandcastle and Vercel's Chat SDK. Slack, Claude Code, and the sandboxes are all swappable dependencies. Crazy world we're living in

Paweł Huryn @PawelHuryn · Apr 23

20+ hours later and @AnthropicAI + @claudeai + official accounts still haven't said a single word about this. This is huge. Full third-party inference support for Cowork & Code in Claude Desktop: → OpenRouter → Local models via LiteLLM → Any compatible gateway Is this a bug? Was it released prematurely?

P PawelHuryn @PawelHuryn

This is wild. You can now use Claude Cowork with any LLM - GPT, Grok, Gemma, MinMax. Point it directly at: https://t.co/ikBnGt4MAI https://t.co/82mXeiaOJB

Alvin Sng @alvinsng · Apr 23

My opinionated "no useEffect" skill is now @vercel official! Thanks to @rauchg for the nudge to share these

E EnoReyes @EnoReyes

Factory is now an official provider on the @vercel Skills Marketplace. We have added a fresh revamp of our skills pack, including the legendary "no useEffect" skill. Also find dozens of other skills covering frontend design, security reviews, full-stack playbooks, and more. And we're not stopping. Expect a steady stream of new skills landing soon. Try us with one command: npx skills add factory-ai/factory-plugins

George @odysseus0z · Apr 23

Hot take: all the harness/skills etc should be compiled/generated by model hill climbing against eval/task itself (Lopopolo's agent gym idea). That way when new model comes out you just need to let the new model recompile/optimize it. of course it all goes back to @lateinteraction

_ _lopopolo @_lopopolo

Go forth and let it rip fam. And take some time to see how the new model does with your skills and local set up! I usually take a week or so to tweak and get a feel for this. The model is still my lil guy but it is a different lil guy. Every time.

Cory House @housecor · Apr 23

Teammate wanted to try Claude's new expensive ultrareview but didn't have access. So he told Claude "just do what ultrareview does"... Result? It spawned 5 parallel agents: 1. Security 2. Correctness 3. Conventions 4. Test coverage 5. Architecture Then, it generated an impressive report categorized by these 5 areas. Slick. And available to anyone using Claude.

AVB @neural_avb · Apr 23

Direction of AI mid through late-2026: 1. Big labs are gonna push expensive bigger closed-source models directly to big tech. The moat will shift away from consumer markets coz OSS models are getting too good to compete at current price point. Plus big labs got more money to make directly going B2B. 2. Open source labs are making comparable coding models now. They lack marketing exposure, but it will be impossible to keep serving (example) Opus at the ridiculous token price Anthropic is. Qwen, Kimi, Minimax, GLM, etc... anybody got a clear shot here to deliver a Sonnet or a GPT5.2 at 1/10th pricing, completely agentic-pilled with coding and tool calls. 3. Local models are gonna go crazy because people will figure out speculative decoding + kv cache quantization to make models run fast on-device. If Qwen 3.6 27B is any indication, local coding models will be a thing soon enough. 4. Devs will realize that there is a lot of money to be made by making AI first local apps that use private local edge LMs (~0.5-4B models). You can literally see indications of all 4 things above if you followed last 2 weeks of AI news. Mythos, Kimi K2.6, Qwen3.6-27B, DFlash, TurboQuant, Gemma-4... live examples of all the above at play. I feel this is the next phase of evolution for LLMs.

S sama @sama

We tried a new thing with NVIDIA to roll out Codex across a whole company and it was awesome to see it work. Let us know if you'd like to do it at your company! https://t.co/Xjn6ShrRuq

Ben Davis @davis7 · Apr 23

Yes actually low. I am going to be insufferable about this for the next week or two: USE THE MODEL ON LOW REASONING I am serious. Not medium. LOW.

T thameraltoimi @thameraltoimi

@davis7 low? not even medium?

Jeremy Morgan @JeremyCMorgan · Apr 23

Red Hat's "harness engineering" paper argues the environment an AI works in matters as much as the weights: integrating telemetry, repos, and testing constraints into a single deterministic orchestrator measurably moves code generation reliability by 5%+. You cannot prompt-engineer your way out of bad infrastructure. https://t.co/CckuhhG2Ha

dex @dexhorthy · Apr 23

drop everything and watch this immediately Captures so much between the lines stuff WRT crafting good software that I have struggled to articulate Incredible lessons from 40 years of software engineering, distilled and applied masterfully to the craft of building with AI

M mattpocockuk @mattpocockuk

A talk I gave a few weeks ago. Software fundamentals matter more than ever. Here's why: https://t.co/KG7gdqFo2j

Zephyr @Zephyr_hg · Apr 23

Everyone fighting over the commoditized AI skills. Meanwhile the actual money is one layer deeper, in the skill nobody's filming courses about yet. https://t.co/04HcGNPk1T

Z Zephyr_hg @Zephyr_hg

The AI Skill Paying $400/Hour in Q2 2026 (That Almost Nobody Is Teaching)

Trevin Chow @trevin · Apr 24

Do this right now. Not later. Now. Install RTK: https://t.co/drk9OjxagP In 7d I've saved almost 400M tokens. If you have Compound Engineering v3, you can run `/ce-setup` and it'll flag this as a recommended install :) https://t.co/IP8mIkexuO

Ahmad @TheAhmadOsman · Apr 24

I highly recommend all newcomers to local AI to read my last 2 article before they jump into local AI and purchase any hardware If you're interested in running LLMs and/or diffusion models locally, these will save you a lot of time, pain, and money https://t.co/7mc4RxWYHz

NeilXbt @neil_xbt · Apr 24

Andrej Karpathy could have charged $10,000 for this course. He put it on YouTube. The man who built Tesla Autopilot from scratch. Co-founded OpenAI. Understands AI at a level most engineers at Google and Meta never reach. Sat down. Recorded 2 hours. No frameworks. No libraries. No shortcuts. Then dropped it for free. The gap between people who watch it this week and those who save it for later is not 2 hours. It is everything those 2 hours quietly unlock for the rest of your career.

Eric ⚡️ Building... @outsource_ · Apr 24

My 4090 went from 26 -> 154 tok/s Qwen 3.6 27B🤯 Same GPU. Same Q4_K_M . No FP8, no extra quant. The unlock: ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model. 85% acceptance rate. Full config + benchmarks 👇🏻 https://t.co/NUZec2Y7J0

Elon Musk @elonmusk · Apr 24

Cybercab has started production https://t.co/MAeswanf96

Ole Lehmann @itsolelehmann · Apr 24

RT @itsolelehmann: i built a free claude skill on the most critical content principle in 2026: The Minto Pyramid (from a 1970s McKinsey bo…

Steve Yegge @Steve_Yegge · Apr 24

I just published Welcome to Gas City https://t.co/zktstOmxgq In a nutshell, some actually good engineers came along and rewrote Gas Town into an enterprise grade SDK for building your own orchestrators. It uses the original Gas Town MEOW stack, based on Beads and Dolt. MIT-licensed. It has been out for a few weeks, just launched to v1.0.0, and is ready for use. Check out Discord at https://t.co/UuutSdBH6r for more info.

chiefofautism @chiefofautism · Apr 24

openai built a model that HIDES personal data in text so nothing leaks i flipped it INSIDE OUT same 1.5B weights, same label taxonomy, but instead of masks you get structured spans, name, email, phone, bank account, address, secrets, char offsets and all point it at logs, dumps, stolen inboxes and it just... returns every private thing in the pile

Jen Zhu @jenzhuscott · Apr 24

Just discovered the young genius Chief AI Scientist of Tencent Yao Shunyu’s X handle 🧠

S ShunyuYao12 @ShunyuYao12

Our goal is to build practical models with comprehensive capabilities beyond open benchmarks. And the only way to do it to co-design with diverse products while scaling solidly. Tencent has the best product ecosystem and a solid, low-ego culture, and we are just getting started!