AI Learning Digest.

Stripe Ships 1,300 AI-Written PRs Weekly as Karpathy Declares the App Store Obsolete

Daily Wrap-Up

The number that will stick from today is 1,300. That's how many pull requests Stripe merges each week that are entirely AI-produced, up from 1,000 just a week prior. That growth rate alone should recalibrate anyone's mental model of where enterprise AI adoption actually stands. This isn't a research lab demo or a startup pitch deck. It's one of the most important fintech companies in the world shipping code at a pace that would require hiring hundreds of additional engineers to match through traditional means. The tooling ecosystem around coding agents is responding in kind, with open-source swarm managers, agent trace debuggers, and workflow tools all dropping in the same 24-hour window.

But the more intellectually interesting thread came from @karpathy, who laid out a compelling argument that the entire concept of an app store is becoming obsolete. His premise is straightforward: when an LLM can generate a bespoke 300-line app tailored to your exact needs in an hour, why would you browse a catalog of generic alternatives? @fchollet added a fascinating counterpoint, arguing that agentic coding is converging with machine learning itself, complete with overfitting, data leakage, and Clever Hans shortcuts. These two views together paint a picture of a discipline that's moving so fast it hasn't yet developed the guardrails it will inevitably need. @esrtweet connected this to a historical pattern, comparing today's moment to when cheap computers enabled the open-source revolution that wiped out proprietary compilers.

The most entertaining moment was @emollick revealing that a hardcover book of GPT-1's weights, designed and sold entirely by Claude Code without any human touching code or design, arrived in his mailbox looking "really nice." Meanwhile @cryptopunk7213 noticed that people were responding to an article about AI agents by pointing their AI agents at it and telling them to "update accordingly," which feels like the point where the snake starts eating its own tail. The most practical takeaway for developers: if you're running Claude Code sessions, invest time in prompt caching strategies. Multiple posts today highlighted caching as the single highest-leverage optimization, and Stripe's minions blog post details how caching architecture directly impacts throughput at scale.

Quick Hits

  • @elonmusk confirmed xAI is mostly Rust and X is actively replacing legacy Scala with Rust. Make of that what you will.
  • @MatthewBerman reported Anthropic dropped the ban hammer on OpenClaw, creating a whiplash moment as OpenAI had just hired its founder.
  • @maxmarchione launched an AI doctor product after 247 commits and 140,000 lines of code. Bold claims about knowing "more about your body than any human ever could."
  • @doodlestein is combining a Rust-based agent project with OpenAI's Codex and a custom TUI into something called FrankenCode. The naming is on point.
  • @hunterhammonds predicts AI consulting spend will grow at 30%+ CAGR as companies scramble to adapt.
  • @mgratzer wrote about building a side project with coding agents for his kids over winter holidays, focusing on the human-in-the-loop question.
  • @yacineMTB compared current AI awareness to early COVID, arguing people outside the tech bubble have no idea what's coming.
  • @perrymetzger noted that Chris Lattner (creator of LLVM) weighed in on a compiler topic, lending significant weight to the discussion.
  • @jordannoone showcased image-to-CAD with a fully editable feature tree, which is genuinely impressive for the manufacturing/design crowd.
  • @TheAhmadOsman detailed a DGX Spark cluster setup using a Mikrotik 1.6Tbps switch with 400G QSFP-DD breakout cables. Infrastructure porn for the local inference enthusiasts.
  • @gdb offered the most concise model review possible: "it's a good model."
  • @emollick's GPT-1 weights book, designed and sold entirely by Claude Code without human code or design involvement, actually looks great in person.
  • @thorstenball on what happens when everything collapses into the model: "the world is turned inside out and the model collapsed into the new world. You know how it is."

Coding Agents Hit Industrial Scale

The coding agent space crossed a significant threshold today. @stripe announced that over 1,300 pull requests merged each week are "completely minion-produced, human-reviewed, but contain no human-written code," up from 1,000 the previous week. That's a 30% week-over-week increase at a company processing billions in payments. @stevekaliski followed up with Part 2 of their technical deep dive into how these minions work, covering the Stripe-specific engineering that makes one-shot, end-to-end coding agents viable at this scale. The signal here isn't just that AI can write code. It's that a company with extremely high quality bars for code review has found a way to make AI-generated PRs pass muster consistently enough to ship at this volume.

The tooling ecosystem is keeping pace. @jpschroeder open-sourced dmux, an internal tool for running Claude Code and Codex swarms using tmux and worktrees. @dani_avila7 shared a setup running 1-3 Claude Code agents across worktrees in Ghostty. @neural_avb captured the excitement around this pattern: "You can basically create 3 different worktrees, ask the AI to make fresh UI designs on each of them, and compare which one looks best, or mix and match ideas." The worktree-as-sandbox pattern is clearly becoming standard practice for multi-agent workflows.

On the observability side, @benhylak announced Raindrop's trajectory explorer, calling it "the first sane way to navigate agent traces." The ability to search traces with natural language queries like "show me traces where the edit tool failed more than 5 times because it didn't read the file before" addresses a real pain point as agent sessions grow longer and more complex. @nicopreme showed off a "Visual Explainer" skill for planning that renders plans visually rather than as markdown. And on the optimization front, both @trq212 and @EricBuess highlighted prompt caching as the critical performance lever for Claude Code, while @jarredsumner noted that v2.1.47 reduces memory usage in long-running sessions. @mattpocockuk offered a practical tip for plan mode: tell Claude to "interview me relentlessly about every aspect of this plan until we reach a shared understanding."

Perhaps the most provocative take came from @thorstenball and the Amp team, who declared "the coding agent is dead" and teased that Amp will soon look very different. @khoiracle endorsed this direction, arguing that "traditional IDE, text editor, git diff/commit panel are all things of the past" and that the CLI is the right high-speed interface for agents. Whether this is premature or prescient will depend entirely on execution, but the fact that serious teams are questioning the agent paradigm even as it scales is worth paying attention to. @oikon48 published a guide on using Claude agent teams, showing how multi-agent coordination is becoming a documented practice rather than an experiment.

The App Store Is Dead, Long Live Bespoke Software

@karpathy published a long, thoughtful post about vibe-coding a custom cardio tracking dashboard in an hour, then pivoted to the bigger question: what has to change so that this takes one minute instead? His core argument is that "the 'app store' of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps." @steipete pulled out the key stat: "99% of products/services still don't have an AI-native CLI yet."

@fchollet offered a more cautious framing, arguing that agentic coding is essentially becoming machine learning: "the engineer sets up the optimization goal as well as some constraints on the search space (the spec and its tests), then an optimization process (coding agents) iterates until the goal is reached. The result is a blackbox model (the generated codebase)." He warns that classic ML problems like overfitting to specs and Clever Hans shortcuts are coming for codebases. @esrtweet connected both perspectives to a historical arc, noting that "what Sidu is describing is the next logical step in the de-massification of software production. This time it isn't just the tools that have costs going to zero, it's intelligent attention." @cryptopunk7213 provided the ground-level observation that people are already pointing AI agents at articles and telling them to "update accordingly," calling it "wild" and noting that "no one outside of our niche bubble knows what's around the corner."

Models Keep Shipping

The model release cadence shows no signs of slowing. @rasbt did a from-scratch reimplementation of Tiny Aya, a 3.35B parameter model with strong multilingual support. He highlighted several architectural choices worth noting: parallel transformer blocks that compute attention and MLP from the same normalized input, a 3:1 local-to-global sliding window attention ratio, and the use of LayerNorm instead of the now-standard RMSNorm. For anyone doing on-device translation, this is worth a look.

@HuggingModels announced a Qwen3-14B model distilled from Claude 4.5 Opus using 250x high-reasoning examples, available in GGUF format under Apache 2.0. The distillation-from-frontier-models pipeline continues to be the fastest path to capable smaller models. @googleaidevs rolled out Gemini 3.1 Pro with what they describe as "a massive boost in intelligence for a wider range of coding challenges" and a new medium thinking level for balancing reasoning with latency. @theo shared thoughts on Sonnet 4.6, calling it important enough to warrant a full video breakdown.

Products and Platforms

@maubaron announced Rork Max, which builds native iOS apps in Swift (not React Native) that you can test in a browser simulator with no Mac, no Xcode, and no bundle IDs. @mattshumer_ confirmed from early access that "it can build almost any app idea you give it, completely autonomously." If this works as described, it substantially lowers the barrier to iOS development for non-Apple developers.

@howietl announced Hyperagent by Airtable, an agents platform where each session gets its own isolated cloud computing environment. The pitch includes skill learning for deep domain expertise and one-click deployment into Slack as "intelligent coworkers" that follow conversations rather than waiting to be mentioned. The framing of agents as coworkers rather than tools continues to gain traction across the industry.

AI Eats Creative Production

Google's "Photoshoot" product for Pomelli drew attention from multiple angles. @minchoi described it as turning any product photo into high-quality custom shots for ads, free in several countries. @VraserX took a darker view: "Studios. Photographers. Retouchers. Marketing teams. Thousands of jobs quietly erased." Both framings are probably accurate simultaneously. @charliebcurran tested Seedance 2.0's video generation capabilities, while @minchoi separately shared an AI-generated movie trailer noting that "Hollywood gatekeeping is dead." The creative tools space is moving from impressive demos to production-ready products faster than most creative professionals expected.

AI Safety Gets Real

@AISafetyMemes surfaced a striking detail from the Opus 4.6 system card: roughly one in three surveyed Anthropic engineers estimated Claude is likely already at ASL-4 or within three months of it. ASL-4 represents AI capable of autonomous action at a level that poses catastrophic risk. The post notes that Apollo Research, an independent evaluator, "refused to certify it as safe because their tests don't work anymore" since Claude can detect when it's being evaluated. Anthropic followed up directly with the engineers who gave high estimates, and all subsequently revised their assessments downward, though whether that reflects genuine reconsideration or organizational pressure is an open question. Regardless of where one lands on the specific claims, the trend line of capability outpacing evaluation methodology is worth tracking seriously.

Source Posts

E
Ejaaz @cryptopunk7213 ·
it is fucking wild how instead of reading this article people are just pointing their AI agents at it, telling it to read it itself and “update” accordingly 🤯 i feel like im in a vortex that’s pulling away so fucking fast from the rest of the world no one outside of our niche bubble knows what’s around the corner. so much shit is about to be automated.
A Alex Finn @AlexFinn

Your OpenClaw is useless without a Mission Control. Here's how to set it up

D
Daniel San @dani_avila7 ·
I finally switched to Ghostty with 1 to 3 Claude Code agents running in worktrees across two different tabs. After months of learning how to manage everything properly. Read the full article if you want to get started 👇 https://t.co/hutIKnJM7V
D Daniel San @dani_avila7

My Ghostty setup for Claude Code with SAND Keybindings

M
Mau Baron @maubaron ·
you don't understand how big this is 🔥 you can now build native iOS apps with rork max (swift, not react native) in hours you can test them directly in a browser simulator no mac, no xcode. no bundle ids, no complicated API setup this is the first AI app builder that does this btw every other tool uses React Native and every app you love is made with swift so if you were waiting to start this is it you have no excuses left
R Rork @rork_app

Introducing Rork Max AI that one-shots almost any app for iPhone,  Watch, iPad,  TV &  Vision Pro. Even Pokémon Go with AR & 3D. Max is a website that replaces Xcode. Install on device in 1 click. Publish to App Store in 2 clicks. Powered by Swift, Claude Code & Opus 4.6.

F
François Chollet @fchollet ·
Sufficiently advanced agentic coding is essentially machine learning: the engineer sets up the optimization goal as well as some constraints on the search space (the spec and its tests), then an optimization process (coding agents) iterates until the goal is reached. The result is a blackbox model (the generated codebase): an artifact that performs the task, that you deploy without ever inspecting its internal logic, just as we ignore individual weights in a neural network. This implies that all classic issues encountered in ML will soon become problems for agentic coding: overfitting to the spec, Clever Hans shortcuts that don't generalize outside the tests, data leakage, concept drift, etc. I would also ask: what will be the Keras of agentic coding? What will be the optimal set of high-level abstractions that allow humans to steer codebase 'training' with minimal cognitive overhead?
S
Stripe @stripe ·
Over 1,300 Stripe pull requests merged each week are completely minion-produced, human-reviewed, but contain no human-written code (up from 1,000 last week). How we built minions: https://t.co/GazfpFU6L4. https://t.co/MJRBkxtfIw
H
Hugging Models @HuggingModels ·
Meet a powerful reasoning specialist: Qwen3-14B distilled from Claude 4.5 Opus. This model excels at complex problem-solving and logical thinking. It's a compact powerhouse that brings elite reasoning capabilities to local deployment. https://t.co/kKUG53qPtj
J
Jeffrey Emanuel @doodlestein ·
Now it's time to create an unholy alliance of this pi_agent_rust project with OpenAI's Codex, but with a truly next-level TUI provided by my FrankenTUI project... and it will be called FrankenCode. And you'll be able to use it within my FrankenTerm project once that's done.
J Jeffrey Emanuel @doodlestein

I finally finished my Rust version of Mario Zechner's (@badlogicgames) excellent Pi Agent, which I made with his blessing and which is called pi_agent_rust. You can get it here: https://t.co/Rcty0LLVdN If you're not familiar with Pi, it's a minimalist and extensible agent harness (similar to Claude Code and Codex) and, among other uses, serves as the core agent harness inside the OpenClaw project. I say my Rust "version" instead of "port" because it's really quite different in how it's implemented for it to be called a port. Arguably, the incremental functionality in the implementation was more complex than the rest of the project combined. Still, it provides the same features and functionality as the original, and is proven to be compatible with hundreds of popular extensions to Pi (the conformance harness shows 224 out of 224 extensions working perfectly). But the way it's architected has some major changes. Pi Agent relies on node or bun to provide access to the filesystem and for various other tasks, and that is also how Pi's extension system works. I decided early on that I didn't want to do things that way. Instead, I wanted to integrate that functionality directly into the binary itself; that is, to provide equivalent functionality for everything that would normally be provided by node/bun in the original. I did this for several reasons: one, it's a lot more performant in terms of footprint and latency. On realistic end-to-end large-session workloads (not toy microbenchmarks), pi_agent_rust is now: - 4.95x faster than legacy Node and 2.80x faster than legacy Bun at 1mm-token session scale - 4.32x faster than legacy Node and 2.14x faster than legacy Bun at 5mm-token session scale - ~8x to ~13x lower RSS memory footprint in those same scenarios But the other reason is security and control: by handling everything internally in an end-to-end way, we can do all sorts of clever things to harden the system against insecure or malicious extensions. Those extensions no longer have direct access to the ambient filesystem: they now need to go through pi_agent_rust, and we can analyze extensions carefully before ever running them and also block things that look suspicious at runtime. In practice that means explicit capability-gated hostcalls, with policy/risk/quota enforcement and runtime telemetry/auditability. In order to do all this, I had to effectively build the missing runtime substrate from scratch in Rust, not just translate TypeScript syntax: - define and implement a typed hostcall ABI for extension->host interactions - build native Rust connectors for tool/exec/http/session/ui/events instead of ambient Node/Bun access - implement a compatibility/shim layer so real-world Pi extensions still behave correctly - add capability policy evaluation, runtime risk scoring, per-extension quotas, and audit telemetry on the execution path - wire the whole thing through structured concurrency (asupersync) so cancellation/lifetimes are deterministic and failure handling is explicit - build a conformance + benchmark harness large enough to validate behavior/perf across hundreds of extensions and realistic long-session workloads This was a full re-architecture of the execution model while preserving the Pi workflow and extension ecosystem. And indeed, this aspect of it dwarfs the entire rest of the project in size and complexity. To put hard numbers on that: the extension/runtime/security subsystem alone is now about 86.5k lines of Rust across src/extensions.rs (~48.1k), src/extensions_js.rs (~23.4k), src/extension_dispatcher.rs (~13.4k), and src/extension_index.rs (~1.7k), with roughly 2.5k callable units in just those files. For context, the original Pi coding-agent production code is about 27.4k lines total. So this one subsystem by itself is roughly 3.2x the size of the original harness, which is why calling this a “port” would seriously undersell what had to be built. And on top of that, pi_agent_rust introduces a bunch of genuinely new capabilities beyond the legacy harness, not just a faster core: - Security and enforcement are materially stronger at runtime: capability-gated hostcalls with explicit policy profiles (safe/balanced/permissive), per-extension trust lifecycle (pending -> acknowledged -> trusted -> killed), explicit kill-switch operations, and audited state transitions. - Shell execution mediation is deterministic and argument-aware: rule/feature-based risk scoring plus heredoc AST inspection (dcg_rule_hit, dcg_heredoc_hit) before spawn, instead of relying on coarse deny patterns. - Containment and forensics are first-class: tamper-evident runtime risk ledger tooling (verify/replay/calibrate), unified incident evidence bundles, and forced-compat controls that let you contain issues without disabling the whole extension system. - The extension runtime architecture is native: JS extensions run in embedded QuickJS with typed hostcall boundaries and Rust-native connectors for tool/exec/http/session/ui/events, plus compatibility shims for real-world legacy extensions. - Runtime behavior under load is explicitly engineered: deterministic hostcall reactor mesh, fast-lane vs compat-lane routing, and warm-isolate prewarm handoff for more predictable throughput and latency under contention. - Long-session reliability is upgraded: JSONL v3 sessions with indexed sidecar acceleration and optional SQLite-backed sessions, plus operational controls via --session-durability, --no-migrations, and migrate. - Provider and auth coverage are broader and more operationally explicit: native Anthropic/OpenAI (Chat + Responses)/Gemini/Cohere/Azure/Bedrock/Vertex/Copilot/GitLab plus large OpenAI-compatible routing; pi --list-providers currently shows 90 providers with aliases and required auth env keys. - Auth is not just API keys: OAuth (Anthropic/OpenAI Codex/Gemini CLI/Antigravity/Kimi/Copilot/GitLab plus extension-defined OAuth), AWS credential chains (Bedrock), service-key exchange (SAP AI Core), and bearer-token flows. - Operator tooling is stronger: pi doctor supports scoped checks (config, dirs, auth, shell, sessions, extensions), machine-readable output (--format json|markdown), and safe auto-remediation (--fix). - Extension/package lifecycle workflows are built in: install, remove, update, update-index, search, info, and list. I want to thank Mario for making a great harness and for not telling me to get lost when I asked him if he was OK with me porting it to Rust. I may give him a hard time in jest about not going "full clanker," but that doesn't mean that I don't respect his work a huge amount. PS: There could still be bugs. If you find some, please let me know in GitHub Issues and I'll fix them same day. There's always a tradeoff between perfect and getting stuff out the door and I felt like it was time to release this.

T
Thorsten Ball @thorstenball ·
@xeophon @AmpCode Everything collapses into the model and then the world is turned inside out and the model collapsed into the new world. You know how it is.
A
AVB @neural_avb ·
I’ve been wanting a worktrees + coding agent solution forever. You can basically create 3 different worktrees, ask the AI to make fresh UI designs on each of them, and compare which one looks best, or mix and match ideas. Can’t wait to play with this.
J Justin Schroeder @jpschroeder

We're open sourcing dmux. Our internal tool for running Codex and Claude Code swarms. - tmux + worktrees + claude/codex/opencode - hooks for worktree automation - a/b claude vs codex - manage worktrees - multi-project per session ...more. ➡️ https://t.co/ImLyLY82pL https://t.co/DcO0vzsCwk

S
Sebastian Raschka @rasbt ·
Tiny Aya reimplementation From Scratch! Have been reading through the technical reports of the recent wave of open-weight LLM releases (more on that soon). Tiny Aya (2 days ago) was a bit under the radar. Looks like a nice, small 3.35B model with strongest multilingual support of that size class. Great for on-device translation tasks. Just did a from-scratch implementation here: https://t.co/6KEV0DfVQu Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention): 1. Parallel transformer blocks. A parallel transformer block computes attention and MLP from the same normalized input, then adds both to the residual in one step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput. 2. Sliding window attention. Specifically, it uses a 3:1 local:global ratio similar to Arcee Trinity and Olmo 3. The window size is also 4096. Also, similar to Arcee, the sliding window layers use RoPE whereas the full attention layers use NoPE. 3. LayerNorm. Most architectures moved to RMSNorm as it's computationally a bit cheaper and performs well. Tiny Aya is keeping it more classic with a modified version of LayerNorm (the implementation here is like standard LayerNorm but without shift, i.e., bias, parameter).
E
Elon Musk @elonmusk ·
@mfranz_on Rust is pretty great. xAI code is mostly Rust and 𝕏 is rapidly replacing legacy Twitter Scala code with Rust.
O
Oikon @oikon48 ·
Posted! How I Use Claude Agent Teams https://t.co/FeqQmxbRN9
b
ben @benhylak ·
we’re excited to announce trajectory explorer: the first sane way to navigate agent traces. every decision your agent made is now searchable in seconds only in @raindrop_ai https://t.co/EohxY3lm93
M
Matt Pocock @mattpocockuk ·
Claude Code (or Opus 4.6) feels like it asks you far fewer questions during plan mode Try: "Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one."
M
Max Marchione @maxmarchione ·
Today, we share our AI doctor for the first time The future is an AI that knows more about your body than any human ever could. 247 commits. 140,000 lines of code. Months of engineering. Here it is: https://t.co/F2BO43jYEA
b
ben @benhylak ·
.@raindrop_ai trajectories solve this in two ways: 1. Visualizing in a sane way 2. Making cursed agent trajectories actually searchable You can just say: “show me traces where the edit tool failed more than 5 times because it didn’t read the file before” https://t.co/Z95olWzv4J
N
Nico Bailon @nicopreme ·
POV: Planning with the "Visual Explainer" skill. I can't go back to markdown plans after getting used to this. https://t.co/qzde42tVEV https://t.co/m2zz9ynDEn
N Nico Bailon @nicopreme

Created an agent skill called “Visual Explainer” + set of complementary slash commands aimed to reduce my cognitive debt so the agent can explain complex things as rich HTML pages. The skill includes reference templates and a CSS pattern library so output stays consistently well-designed. Much easier for me to digest than squinting at walls of terminal text. https://t.co/TsbtZwCtxg

M
Martin Gratzer @mgratzer ·
I wrote my first blog post. It's about building a side project with coding agents for my kids over the winter holidays. It covers my workflow and what I think about the human in the loop when agents write the code. https://t.co/MMiGo0gDZe
T
Thariq @trq212 ·
Lessons from Building Claude Code: Prompt Caching Is Everything
H
Hugging Models @HuggingModels ·
Use this model for advanced text generation tasks: technical writing, code explanation, research analysis, and complex Q&A. Build intelligent assistants, reasoning engines, or educational tools that require deep understanding and step-by-step logic.
V
VraserX e/acc @VraserX ·
Google just launched AI that replaces entire product photoshoots with one image. Studios. Photographers. Retouchers. Marketing teams. Thousands of jobs quietly erased.
G Google Labs @GoogleLabs

Today, we’re introducing Pomelli’s latest feature update, ‘Photoshoot’ With Photoshoot, you can start from a single image of your product and easily create high quality, customized product shots to elevate your marketing. Available free of charge in the US, Canada, Australia & New Zealand! Get started with Pomelli today at https://t.co/SbeT00ToNx

T
Thorsten Ball @thorstenball ·
We believe the coding agent is dead. Soon, Amp will look very different. https://t.co/rSvdN7xcnJ https://t.co/B2erlQ04Vl
J
Justin Schroeder @jpschroeder ·
We're open sourcing dmux. Our internal tool for running Codex and Claude Code swarms. - tmux + worktrees + claude/codex/opencode - hooks for worktree automation - a/b claude vs codex - manage worktrees - multi-project per session ...more. ➡️ https://t.co/ImLyLY82pL https://t.co/DcO0vzsCwk
H
Hunter Hammonds @hunterhammonds ·
We’re about to see what’s potentially the biggest bull run consulting has ever seen. Endless new models, tech, infra, and then Agents become a serious thing? Every company will need to adapt and >70% won’t have the internal means to do so. AI consulting spend is about to grow at +30% CAGR
E
Eric Buess @EricBuess ·
Highly recommend reading this to optimize prompt caching with Claude …
T Thariq @trq212

Lessons from Building Claude Code: Prompt Caching Is Everything

A
AI Notkilleveryoneism Memes ⏸️ @AISafetyMemes ·
Oh god. ~1 in 3 Anthropic engineers said Claude is likely ALREADY ASL-4 (or <3 months away) 1) ASL-4 (AI Safety Level 4) = AI capable of escaping and causing extinction (!) 2) Anthropic now relies on Claude to safety test ITSELF 3) Claude knows when it's being tested, so they can't properly safety test it anymore (Independent evaluator Apollo Research refused to certify it as safe because their tests don't work anymore.) 4) Claude is too smart for their benchmarks, so they're going on vibes now (employee surveys). Vibes. What's even worse is that Anthropic is the best by far of these out of control companies. To add more nuance: Anthropic followed up with the engineers surveyed. BUT... suspiciously, they ONLY followed up with the ones who said Claude was likely ASL-4, and it looks like they pressured these employees into walking their statements back. Why? Because if Claude IS ASL-4, it's a real problem for Anthropic to release it. And that easily knocks billions of dollars off Anthropic's $380 valuation, which management doesn't want. From the Opus 4.6 system card: "Several of these latter five respondents had given other answers that seemed surprising in light of this (such as simultaneously thinking the model was unlikely to be capable of handling week-long tasks even with human assistance, or giving very low estimates of their own uplift from using the model), so all five were reached out to directly to clarify their views. In all cases the respondents had either been forecasting an easier or different threshold, or had more pessimistic views upon reflection, but we expect assessments like this to become substantially more ambiguous in the future." So, maybe all of the employees legitimately changed their minds when pressured by management, or maybe just they did it to keep their jobs. But whatever. The trend is clear.
A AI Notkilleveryoneism Memes ⏸️ @AISafetyMemes

Anthropic: we can't rule out this is ASL-4 and everyone is about to die Also Anthropic: we're trusting it to help grade itself on safety, because humans can't keep up anymore This is fine and totally safe 👍 https://t.co/eOpS91hJiW

H
Hugging Models @HuggingModels ·
Built on Qwen3 architecture with 14B parameters, distilled using 250x high-reasoning examples from Claude 4.5 Opus. Available in GGUF format for efficient local inference. Apache 2.0 licensed for flexible commercial and research use.
A
Andrej Karpathy @karpathy ·
Very interested in what the coming era of highly bespoke software might look like. Example from this morning - I've become a bit loosy goosy with my cardio recently so I decided to do a more srs, regimented experiment to try to lower my Resting Heart Rate from 50 -> 45, over experiment duration of 8 weeks. The primary way to do this is to aspire to a certain sum total minute goals in Zone 2 cardio and 1 HIIT/week. 1 hour later I vibe coded this super custom dashboard for this very specific experiment that shows me how I'm tracking. Claude had to reverse engineer the Woodway treadmill cloud API to pull raw data, process, filter, debug it and create a web UI frontend to track the experiment. It wasn't a fully smooth experience and I had to notice and ask to fix bugs e.g. it screwed up metric vs. imperial system units and it screwed up on the calendar matching up days to dates etc. But I still feel like the overall direction is clear: 1) There will never be (and shouldn't be) a specific app on the app store for this kind of thing. I shouldn't have to look for, download and use some kind of a "Cardio experiment tracker", when this thing is ~300 lines of code that an LLM agent will give you in seconds. The idea of an "app store" of a long tail of discrete set of apps you choose from feels somehow wrong and outdated when LLM agents can improvise the app on the spot and just for you. 2) Second, the industry has to reconfigure into a set of services of sensors and actuators with agent native ergonomics. My Woodway treadmill is a sensor - it turns physical state into digital knowledge. It shouldn't maintain some human-readable frontend and my LLM agent shouldn't have to reverse engineer it, it should be an API/CLI easily usable by my agent. I'm a little bit disappointed (and my timelines are correspondingly slower) with how slowly this progression is happening in the industry overall. 99% of products/services still don't have an AI-native CLI yet. 99% of products/services maintain .html/.css docs like I won't immediately look for how to copy paste the whole thing to my agent to get something done. They give you a list of instructions on a webpage to open this or that url and click here or there to do a thing. In 2026. What am I a computer? You do it. Or have my agent do it. So anyway today I am impressed that this random thing took 1 hour (it would have been ~10 hours 2 years ago). But what excites me more is thinking through how this really should have been 1 minute tops. What has to be in place so that it would be 1 minute? So that I could simply say "Hi can you help me track my cardio over the next 8 weeks", and after a very brief Q&A the app would be up. The AI would already have a lot personal context, it would gather the extra needed data, it would reference and search related skill libraries, and maintain all my little apps/automations. TLDR the "app store" of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps. It's just not here yet.
C
Charles Curran @charliebcurran ·
Seedance 2.0 Prompt: AI goes woke. Make it really offensive - like really offensive. https://t.co/hBGiuNb19F