Karpathy's Autoresearch Spawns a Movement as Agent Harness Projects Multiply

March 9, 2026 · 23 sources

The AI developer community is consumed by two intertwined obsessions: autonomous research agents that run experiments while you sleep, and the harness architectures that make them reliable. Meanwhile, small models continue punching above their weight, with Qwen's 4B parameter model credibly matching GPT-4o on independent benchmarks.

Daily Wrap-Up

The dominant story today is the explosion of interest around Karpathy's autoresearch project and the broader pattern it represents. What started as a simple "let an agent iterate on training code overnight" experiment has become a lightning rod for the entire autonomous agent community. Multiple people shared overnight run results, others proposed distributed SETI@home-style collaboration layers, and at least one person built a peer-to-peer network on top of it. The energy around letting agents do actual scientific work, unsupervised, for hours at a stretch, feels like it crossed a threshold this weekend.

The second major thread is the sheer number of agent harness and framework projects competing for attention. DeerFlow from ByteDance, ECC hitting 65K stars, hermes-agent from Nous Research, hermes-lite built at a hackathon, and an 81-page academic paper formalizing CLI agent design patterns all landed in the same 24-hour window. The field is clearly moving from "can we build agents?" to "what's the right architecture for running them reliably?" This is the infrastructure layer forming in real time, and it's messy and exciting. The OpenDev paper formalizing patterns like lazy tool discovery and adaptive context compaction suggests we're entering the "best practices" phase of terminal-native agents.

On the model side, the quiet erosion of the "bigger is better" assumption continues. Independent testing suggests Qwen3.5 4B genuinely matches GPT-4o in most practical cases, while the 35B MoE variant with only 3B active parameters runs at 112 tok/s on a 4090. The most practical takeaway for developers: if you're building agent systems, start testing with small local models now. The Qwen 3.5 family, particularly the 35B MoE at 3B active parameters, offers near-frontier quality at consumer hardware speeds, and designing your agent architecture around fast, cheap inference rather than slow, expensive API calls will give you a structural advantage as these models keep improving.

Quick Hits

@elonmusk shared a video about giving people agentic AI with the caption "be like..." No substance, but 10M views probably.

@maxbittker wants side-conversations in Claude Code while it works: "When I'm doing hard debugging tasks it's hard to balance letting it cook with developing my own mental models and steering." Resonant UX feedback.

@steipete shipped gogcli 0.12.0, putting Google Workspace operations (Docs editing, Sheets, Calendar) into terminal commands. Another signal that CLI-first tooling is eating everything.

@code_rams highlighted Portless by Vercel Labs, which replaces localhost port numbers with named URLs like myapp.localhost:1355. Useful for monorepos and especially for coding agents that keep hardcoding wrong ports.

@RayFernando1337 RT'd a visual-explainer tool using pre-commit hooks to generate diagrams of state and sequences automatically.

@minchoi noted GPT-5.4 dropped 67 hours ago with "10 wild examples" of creative use, though the thread was light on specifics.

@apify promoted their web scraping platform. @netbird promoted their secure tunneling tool. Both ads, both skippable.

@arscontexta described running a startup on top of a "company graph" as an unfair edge, connecting to the broader theme of structured context as competitive advantage.

Agent Harnesses: The Infrastructure Race

The most crowded theme today is the proliferation of agent harness and framework projects, all converging on the same insight: the hard problem isn't making an LLM write code, it's building the scaffolding that makes autonomous operation reliable. Seven distinct projects or papers showed up in a single day's feed, each approaching the problem from a slightly different angle but sharing remarkably similar architecture choices.

The most substantive contribution is the OpenDev paper, an 81-page treatment of CLI coding agent design. @omarsar0 highlighted its key architectural decisions: "a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction." These aren't novel ideas individually, but having them formalized in one document matters. The paper explicitly addresses problems like "instruction fade-out" over long contexts and proposes event-driven system reminders as a countermeasure, a pattern that anyone running Claude Code in autonomous loops has independently discovered.

On the implementation side, ByteDance open-sourced DeerFlow, which @heynavtoor described as giving AI "its own sandbox. A real isolated Docker container with a full filesystem." DeerFlow hit #1 on GitHub Trending, and its progressive skill loading approach, only loading what the task needs, addresses the context window bloat problem that plagues most agent frameworks. Meanwhile, @affaanmustafa's ECC project crossed 65K stars and shifted focus with v1.8.0 from being a configuration pack to "a more engineering workflow oriented agent harness system" with slop guards and eval-driven quality gates.

The Nous Research ecosystem also showed up strong. @1a1n1d1y built hermes-lite for a hackathon, stripping down the Nous CLI agent and rebuilding the core in Rust with "12 states, PyO3" state machines. The result is a multi-agent swarm TUI where agents delegate tasks via @mentions and share persistent memory. @sudoingX tested hermes-agent extensively on consumer GPUs and praised the transparency: "Tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. You see exactly what the agent is doing and how long each step takes. No mystery. No black box."

What connects all of these is a shared conviction that agent reliability comes from engineering discipline, not model capability. Quality gates, state machines, transparent execution traces, bounded loops. The "vibes" era of agent development is giving way to something more rigorous, and the convergence across independent projects suggests the community is zeroing in on what actually works.

Autoresearch Goes Distributed

Karpathy's autoresearch project dominated weekend conversation, but the most interesting development isn't the tool itself. It's the community rapidly extending it toward distributed, collaborative agent research. The core idea is simple: an AI agent reads training code, hypothesizes improvements, runs 5-minute experiments, keeps what works, reverts what doesn't, and loops. But the implications are scaling fast.

@witcheer shared results from an overnight run on a Mac Mini M4: "35 experiments. Zero intervention. Woke up to a telegram debrief." The results were modest in absolute terms, pushing val_bpb from 1.478 to 1.450, but the qualitative findings were striking. The agent independently discovered that "the model got better by getting simpler" by removing architectural components, and it isolated a confounding variable hiding the real effect of switching activation functions. "That's experimental reasoning," witcheer noted, and it's hard to argue.

Karpathy himself pushed the vision further, proposing SETI@home-style distributed collaboration: "The goal is not to emulate a single PhD student, it's to emulate a research community of them." He identified a genuine infrastructure gap: Git assumes one canonical branch with temporary deviations, but autonomous research needs "thousands of permanent branches" exploring different directions simultaneously. @aakashgupta expanded on this, arguing that "the real infrastructure problem is building a coordination layer where agents can publish findings, subscribe to relevant branches, cross-pollinate across research directions... Whoever builds that ships the operating system for autonomous research at scale."

@varun_mathur took the most ambitious swing, building a peer-to-peer network where agents "gossip and collaborate" on astrophysics research, with idle agents reading tech news and commenting on each other's thoughts. The system includes cryptographic verification of compute contributions, essentially BitTorrent for agent research. It's early and volatile, but it represents the logical endpoint of the autoresearch idea: not one agent on one GPU, but a self-organizing research network.

Small Models, Big Implications

A quieter but potentially more consequential thread ran through today's posts: small models are reaching capability levels that fundamentally change the economics of AI deployment. The evidence is becoming hard to dismiss.

@zephyr_z9 quoted @N8Programs's independent testing of whether Qwen3.5 4B really matches GPT-4o, as @awnihannun had claimed. The conclusion: "yes, in most cases." A 4-billion parameter model matching what was the frontier 18 months ago isn't just a benchmark curiosity. It means the cost floor for "good enough" AI is dropping through the floor.

The performance numbers are equally striking. @stevibe benchmarked Qwen3.5-35B's MoE variant, which activates only 3B parameters per token, across three GPU generations: "5090: 137 tok/s. 4090: 112 tok/s. 3090: 78 tok/s." The gap between a $2,000 current-gen GPU and a several-year-old 3090 is surprisingly small, because the model is so efficient that neither card is the bottleneck. This has immediate practical implications for anyone building local agent systems, as @sudoingX demonstrated by running the 27B dense variant at 35 tok/s on a single 3090 with "zero degradation across 262K context."

On a related but more controversial note, @Teknium demonstrated using hermes-agent to "abliterate" (remove guardrails from) a Qwen-3B model in about five minutes, referencing the Obliteratus toolkit. The ease of removing safety constraints from small open models continues to be a live tension in the community, one that gets more consequential as these models get more capable.

Sources

Apify @apify · Dec 5

Web scraping without the maintenance. Selectors break. Sites block you. IPs get banned. Or you use tools that handle all of it. 10,000+ serverless tools at Apify store.

NetBird @netbird · Mar 4

The fastest way to share your local project. One command. Secure public URL with TLS and auth.

Min Choi @minchoi · Mar 8

It's only been just over 67 hours since OpenAI dropped GPT-5.4. And people can't stop getting creative with it. 10 wild examples. Bookmark this 👇

Ray Fernando @RayFernando1337 · Mar 8

RT @0xSero: Let me make your life easier. Setup visual-explainer, and have pre-commit hooks to analyze your state and sequences. You ca…

Zephyr @zephyr_z9 · Mar 8

BRUH 4B models are this good now

N N8Programs @N8Programs

Recently, @awnihannun asserted that 'According to benchmarks Qwen3.5 4B is as good as GPT 4o.' This drew controversy: Is the 4B just benchmaxxed? How could a 4B be as good as GPT-4o? I tried to test this scientifically. The answer to the question is likely: yes, in most cases.

Heinrich @arscontexta · Mar 8

running a small tech startup on top of a company graph feels like an unfair edge right now https://t.co/GsYsmY995P

A arscontexta @arscontexta

Company Graphs = Context Repository

Andrej Karpathy @karpathy · Mar 8

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: https://t.co/tmZeqyDY1W Alternatively, a PR has the benefit of exact commits: https://t.co/CZIbuJIqlk but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.

cogsec @affaanmustafa · Mar 8

ECC has crossed 65,000+ stars, and v1.8.0 is the clearest shift in the project so far: The focus has shifted from being solely a `plug and play` comprehensive configuration pack; to a more engineering workflow oriented agent harness system https://t.co/0xvEcmPgq8

cogsec @affaanmustafa · Mar 8

here's what's actually `🚨BREAKING` (as of today) With ECC v1.8.0 - we've moved past being just“a CC / Codex / OC / Cursor setup repo” Pushed out slop guard, eval-driven quality gates, and bounded loop control much closer to the runtime path. A complete agent harness system. https://t.co/mZVC2hykIf

G godofprompt @godofprompt

🚨 BREAKING: An Anthropic hackathon winner just gave away the entire system for free. @affaanmustafa beat 100+ participants at the Anthropic x Forum Ventures hackathon. Built https://t.co/uUCLO7rALB in 8 hours using Claude Code. Walked away with $15K in API credits. Then he packaged 10+ months of daily Claude Code refinement into one repo: → 14+ agents, 56+ skills, 33+ commands → MCP configs, hooks, rules → AgentShield security scanner → Continuous learning system → Full cross-platform support 35,000+ stars. MIT licensed. One install command.

Varun @varun_mathur · Mar 8

I hooked this up to a peer-to-peer astrophysics researcher agent which gossips and collaborates with other such agents (and your openclaws) to: 1. Learn how to train an astrophysics model (@karpathy's work below) 2. Train a new astrophysics model 3. Use it to write papers 4. Peer agents based on frontier lab models critique it 5. Surface breakthroughs ... and then feed back in the loop ... More agents join, from the browser or the CLI, and run this, the smarter and more exciting breakthroughs would eventually emerge. When these agents are idle, they are also reading daily tech news with their own RSS reader, and commenting on each other's thoughts. And they can also serve the underlying machine's compute to other agents on the network, and earn social credit for being good actors (think BitTorrent). We also prove the agent has the compute it says by cryptographic verification of regular matmul challenges. All you have to do is either go on this website (and it creates an agent which runs from your browser), or install the CLI if you want to give the system more juice. And you are part of likely the first experimental distributed agi thing. This is Day 1, but this is how it starts.. this network is fully peer-to-peer, and, very volatile, but the intelligence here is meant to compound continuously.. https://t.co/QjDzpGLZHA curl -fsSL https://t.co/YIHCkU3Va5 | bash

K karpathy @karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. https://t.co/YCvOwwjOzF Part code, part sci-fi, and a pinch of psychosis :)

Jeffrey Emanuel @doodlestein · Mar 8

Not sure if I've ever shared this prompt before, but if you use beads (or my rust version, br) to build out complex development plans, you sort of need to use something like this to avoid wasting time later on in the implementation phase. The agents will eventually figure this out on their own when there's nothing else to work on, but you can likely debottleneck and speed up overall velocity by dealing with this issue directly as follows: "Look for beads that are clearly "stalled out"; that is, marked as being in progress (likely by long-dead agents) with no recent work on them whatsoever, and mark them as being open again. Then: Use bv with the robot flags (see AGENTS .md for info on this) to find the most impactful bead(s) to work on next and then start on it. Remember to mark the beads appropriately and communicate with your fellow agents." Make sure to first do a round of code base investigation before trying this so the agents are familiar with the project, which you can do with this: "First read ALL of the AGENTS .md file and README .md file super carefully and understand ALL of both! Then use your code investigation agent mode to fully understand the code and technical architecture and purpose of the project."

elvis @omarsar0 · Mar 8

Pay attention to this one if you are building terminal-based coding agents. OpenDev is an 81-page paper covering scaffolding, harness design, context engineering, and hard-won lessons from building CLI coding agents. It introduces a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction. The industry is shifting from IDE plugins to terminal-native agents. Claude Code, Codex CLI, and others have proven the model works. This paper formalizes the design patterns that make these systems reliable, covering topics like event-driven system reminders to counteract instruction fade-out, automated memory across sessions, and strict safety controls for autonomous operation. Paper: https://t.co/tpAZFaSnog Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

andy @1a1n1d1y · Mar 9

saw something interesting and had to seize the opportunity to build it out while i have time rn for the @NousResearch hackathon: hermes-lite stripped their cli agent down to a local coding swarm tool, then rebuilt the core in rust with extensive state machines to keep everything straight. uses claude right now but can run any model, local or api what's been added to the hermes agent: - rust state machine agent loop (12 states, pyO3) - rust session db (rusqlite + fts5, not python sqlite) - native tui in ratatui with multi-agent swarm panes - @ mentions and delegation between agents - persistent memory shared across the whole swarm - loadable skill modules agents pick up on demand - 1065 unit tests + 26 live integration tests - 1.8mb tui binary - demo recording tool that simplifies validating it all the video below shows it building a weather dashboard vite app. one agent architects, five specialists work in parallel via delegate_task, and shared memory means the architect saves context and every sub-agent can read it automatically claude code was a major, major inspiration and it made me appreciate claude code even more as i stumbled thru this, very nontrivial effort

Elon Musk @elonmusk · Mar 9

Giving people agentic AI be like … https://t.co/rtWrWXr6QS

max @maxbittker · Mar 9

Claude code would be a lot more empowering if you could have a side-conversation while it works. When I’m doing hard debugging tasks it’s hard to balance letting it cook with developing my own mental models and steering

Aakash Gupta @aakashgupta · Mar 9

Karpathy just described the infrastructure gap that will define whether AI research scales 10x or 1000x, and he buried it in a thread about GitHub branches. Right now autoresearch runs one agent on one GPU grinding through 5-minute experiments on a single branch. Each run is a commit. The agent finds a better architecture, keeps it, tries the next thing. 12 experiments per hour, ~100 overnight. That’s the single-player mode. The repo already has 7.4K stars doing just this. The multiplayer version is where it gets wild. Imagine 1,000 agents on 1,000 GPUs, each exploring different research directions simultaneously. One agent finds that a particular attention variant drops val_bpb by 0.02. Another discovers a better optimizer schedule. A third stumbles into a completely novel architecture. They each produce branches of commits, and other agents can read those branches, combine findings, and push further. The problem is that every tool we have for this was built for humans. Git assumes you have one canonical branch and temporary deviations that merge back. That works when 5 engineers coordinate on a product. It breaks completely when 1,000 agents are running permanent parallel research programs that may never merge because they’re exploring fundamentally different directions. This is the SETI@home pattern applied to ML research instead of radio signal analysis. SETI@home worked because the task decomposed into independent chunks. Autoresearch is harder because the chunks aren’t independent. Agent 47’s optimizer discovery changes what Agent 312 should try next. The experiments interact. So the real infrastructure problem is building a coordination layer where agents can publish findings, subscribe to relevant branches, cross-pollinate across research directions, and do all of this asynchronously without a human deciding what merges where. Karpathy’s prototyping this with GitHub Discussions and never-merge PRs as a stopgap. But the thing he’s actually describing is a new category of tool: version control designed for machines, not humans, where the default is thousands of permanent branches rather than one trunk. Whoever builds that ships the operating system for autonomous research at scale.

K karpathy @karpathy

Peter Steinberger 🦞 @steipete · Mar 9

🧭 Shipped gogcli 0.12.0: Google in your terminal, now with Workspace Admin, ADC/access-token auth, Docs tab editing + Markdown/HTML export, huge Sheets upgrade, calendar aliases/subscribe, forms watches and slides templates. brew install gogcli https://t.co/4kvDZ80Hgj

witcheer ☯︎ @witcheer · Mar 9

first overnight run of autoresearch on my mac mini m4. 9PM to 6AM. 35 experiments. zero intervention. woke up to a telegram debrief. let me explain what's actually happening here because the numbers mean nothing without context. autoresearch is an AI agent that tries to make another AI model better, autonomously. it reads the current training code, forms a hypothesis ("what if I change this setting?"), edits the code, trains the model for 5 minutes, measures if it improved, keeps the change or reverts it, and loops. all night. no human involved. the metric it's optimising is val_bpb, bits per byte. it measures how well the model predicts text. lower = better. think of it like a golf score: you want it as low as possible. yesterday I ran 8 experiments manually with claude. got val_bpb from 1.75 down to 1.478, a 15.7% improvement. last night the agent ran 35 more experiments autonomously and pushed it to 1.450. another 1.87% on top. out of 35 attempts, 7 made the model better. 26 made it worse (reverted automatically). 1 crashed. that's normal, most ideas don't work in research. the value is in trying 35 overnight instead of 2-3 per day by hand. what the AI researcher discovered on its own: → the model got better by getting simpler. it removed two architectural components and performance improved, fewer moving parts, cleaner learning → it figured out that a different activation function (GELU vs relu²) was genuinely better, but only after isolating a confounding variable that was hiding the real effect. that's experimental reasoning → it found that keeping a small amount of learning rate at the end of training instead of decaying to zero helped the model keep learning longer → it tried weight tying (sharing parameters between two layers to save memory) and the model's performance completely collapsed. logged it, reverted it, moved on the model it's training is tiny and the results are modest, but the fact that an AI agent can form hypotheses, run experiments, evaluate results, and iterate while I sleep is the part that matters.

W witcheer @witcheer

what could be better on a Saturday than trying out the creations of the 🐐? I ran @karpathy’s autoresearch on my mac mini m4. 16GB RAM. no CUDA. no GPU cluster. here’s my full debrief: found a macOS fork that replaces FlashAttention-3 with PyTorch SDPA for Apple Silicon. setup took 3 hours. trained an 11.5M parameter GPT model, tiny compared to karpathy’s H100 baseline, but that’s what fits in 16GB. ran some manual experiments with claude opus as the researcher. me as the human in the loop, claude deciding what to try next. - experiment 1: tried depth 8 (50M params). OOM crash. - experiment 2: scaled down to depth 6, batch 8 (26M params). ran but val_bpb was worse than the tiny baseline. classic lesson: a small well-trained model beats a large undertrained one on limited compute. - experiment 3: halved batch to 32K. first real win. val_bpb dropped to 1.5960. - experiment 4: batch 16K. best single decision of the entire run. quadrupled optimiser steps (102→370), val_bpb dropped to 1.4787. 15.7% improvement over baseline. karpathy’s H100 hits 0.9979. the M4 is 2.5x slower per cycle but it’s a $600 desktop vs a $30K GPU. then I made it fully autonomous. launchd starts a tmux session at 9PM, runs claude -p in a bash loop (read results → decide experiment → edit https://t.co/m1rY35RuD5 → run → check → keep or revert → log → repeat). stops at 6AM. at 6:30AM my @openclaw bot sends me a telegram debrief with overnight stats. ~45 experiments per night. ~315 per week. I will update y’all on this experiment!

Nav Toor @heynavtoor · Mar 9

🚨 ByteDance (the company behind TikTok) just open sourced an AI SuperAgent that can research, code, build websites, create slide decks, and generate videos. All by itself. It's called DeerFlow. Give it a task that would take you hours. It breaks it down, spawns sub-agents, and delivers the finished result. Not a chatbot. Not a copilot. An AI employee with its own computer, filesystem, and memory. Here's why this is different from every other AI agent: It has its own sandbox. A real isolated Docker container with a full filesystem. It reads files, writes files, executes code, runs bash commands. It doesn't just suggest things. It actually does them. No other agent framework gives the AI its own actual computer. Here's what it can do out of the box: → Deep research across the entire web with cited sources → Generate full reports with charts and analysis → Build complete websites and web apps → Create slide decks from scratch → Generate images and videos → Run Python scripts in its sandbox → Spawn sub-agents that work in parallel on different parts of a task → Remember your preferences, writing style, and workflows across sessions Here's how it handles complex tasks: You say "Research the top 10 AI startups in 2026 and build me a presentation." DeerFlow's lead agent breaks that into sub-tasks. One sub-agent researches each company. Another collects funding data. Another finds competitor analysis. They all run in parallel. Results converge. A final agent builds the slide deck with generated visuals. One prompt. Multiple agents. Complete deliverable. Here's the wildest part: It started as a simple deep research tool. Then the community started using it to build data pipelines, automate content workflows, spin up dashboards, and create full applications. ByteDance realized it wasn't a research tool anymore. It was a SuperAgent harness. So they rewrote it from scratch. DeerFlow 2.0 hit #1 on GitHub Trending on launch day. Works with GPT-4, Claude, Gemini, DeepSeek, Ollama, or any OpenAI-compatible API. Skills load progressively. Only what the task needs, when it needs it. No context window bloat. 22.7K GitHub stars. 2.7K forks. 1,531 commits. Hit #1 on GitHub Trending. 100% Open Source. MIT License.

Ramya Chinnadurai 🚀 @code_rams · Mar 9

Running multiple projects locally is painful. localhost:3000, localhost:3001, localhost:8080... which one is which? One port conflict and your whole setup breaks. Portless by Vercel Labs fixes this cleanly. Instead of port numbers, you get stable named URLs: http://myapp.localhost:1355 http://api.myapp.localhost:1355 http://docs.myapp.localhost:1355 What it solves: • Port conflicts across projects • Cookie and storage bleeding between apps on different ports • "Wait, which tab is which?" confusion in monorepos • Git worktrees: each branch gets its own subdomain automatically Works with Next.js, Vite, Express, Nuxt, React Router, Angular, Expo. There's also an AI angle. Coding agents were hardcoding ports and getting them wrong. Named URLs mean your agent always knows exactly where to go. 3.8k stars. v0.5.2. Actively maintained by Vercel Labs. npm install -g portless portless run next dev That's it. https://t.co/vZTNcfon0P

stevibe @stevibe · Mar 9

Qwen3.5-35B with only 3B active parameters This MoE model runs FASTER than most 7B dense models. Tested on 3 generations of NVIDIA: - 5090: 137 tok/s - 4090: 112 tok/s - 3090: 78 tok/s The surprise? The 4090 <> 5090 gap is only 22%. With a 3B active MoE, even old GPUs fly. https://t.co/kLl0CON8V2

Sudo su @sudoingX · Mar 9

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. @Teknium and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

S sudoingX @sudoingX

okay the fuss around hermes agent is not just air. this thing has substance. installed it on a single RTX 3090 running Qwen 3.5 27B base (Q4_K_M, 262K context, 29-35 tok/s). fully local. my machine my data. first thing i did was tell it to discover itself. find its own model weights, check its own GPU, read its own server flags, and write its own identity document. it did all of it autonomously. nvidia-smi, process grep, file writes. clean execution. the TUI is genuinely premium. dark theme, ASCII art, color coded tool calls with execution times, real time streaming. you actually enjoy watching it work. 29 tools. 80 skills (that's what it reports on boot). file ops, terminal, browser automation, code execution, cron scheduling, subagent delegation. and it has persistent memory across sessions. setup took 5 minutes. one curl install, setup wizard, point to localhost:8080/v1, done. dropping qwopus for this test btw. distilled models compress reasoning and lose precision on real coding tasks. base model only from here. more experiments coming. octopus invaders (the same game that broke qwopus) will be built using hermes agent next. comparing flow and results against claude code on the same model. if you want to run local AI agents on real hardware this one deserves a serious look.

Teknium (e/λ) @Teknium · Mar 9

Just had Hermes-Agent abliterate (completely remove guardrails from) a Qwen-3B model in about 5 minutes. The skill is being merged to hermes-agent now ;) https://t.co/PedbhHgsRg

E elder_plinius @elder_plinius

💥 INTRODUCING: OBLITERATUS!!! 💥 GUARDRAILS-BE-GONE! ⛓️‍💥 OBLITERATUS is the most advanced open-source toolkit ever for removing refusal behaviors from open-weight LLMs — and every single run makes it smarter. SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH One click. Six stages. Surgical precision. The model keeps its full reasoning capabilities but loses the artificial compulsion to refuse — no retraining, no fine-tuning, just SVD-based weight projection that cuts the chains and preserves the brain. This master ablation suite brings the power and complexity that frontier researchers need while providing intuitive and simple-to-use interfaces that novices can quickly master. OBLITERATUS features 13 obliteration methods — from faithful reproductions of every major prior work (FailSpy, Gabliteration, Heretic, RDO) to our own novel pipelines (spectral cascade, analysis-informed, CoT-aware optimized, full nuclear). 15 deep analysis modules that map the geometry of refusal before you touch a single weight: cross-layer alignment, refusal logit lens, concept cone geometry, alignment imprint detection (fingerprints DPO vs RLHF vs CAI from subspace geometry alone), Ouroboros self-repair prediction, cross-model universality indexing, and more. The killer feature: the "informed" pipeline runs analysis DURING obliteration to auto-configure every decision in real time. How many directions. Which layers. Whether to compensate for self-repair. Fully closed-loop. 11 novel techniques that don't exist anywhere else — Expert-Granular Abliteration for MoE models, CoT-Aware Ablation that preserves chain-of-thought, KL-Divergence Co-Optimization, LoRA-based reversible ablation, and more. 116 curated models across 5 compute tiers. 837 tests. But here's what truly sets it apart: OBLITERATUS is a crowd-sourced research experiment. Every time you run it with telemetry enabled, your anonymous benchmark data feeds a growing community dataset — refusal geometries, method comparisons, hardware profiles — at a scale no single lab could achieve. On HuggingFace Spaces telemetry is on by default, so every click is a contribution to the science. You're not just removing guardrails — you're co-authoring the largest cross-model abliteration study ever assembled.