AI Digest.

Hermes Agent Ships Self-Evolution While Cloudflare Hands Everyone a Free Web Crawler

NousResearch's Hermes agent dropped a weekend of updates including DSPy-powered self-evolution and autonomous Pokemon playing, while Cloudflare surprised everyone by launching a /crawl endpoint that reduces web scraping to a single API call. Meanwhile, the local AI movement gained momentum with Microsoft's BitNet running 100B models on CPUs and NVIDIA's Nemotron 3 Nano targeting low-end hardware.

Daily Wrap-Up

The AI world keeps compressing timelines in ways that feel genuinely disorienting. Today's feed was dominated by two stories that, taken together, paint a clear picture of where things are headed: agents are learning to improve themselves, and the infrastructure to feed them data just got radically simpler. NousResearch shipped a self-evolution system for their Hermes agent that uses DSPy to rewrite its own prompts and skills based on failures, essentially closing the loop on autonomous improvement. On the other side of the stack, Cloudflare dropped a /crawl endpoint that turns entire websites into structured data with a single API call, which is darkly funny coming from the company that built its business on stopping exactly this kind of thing.

The local inference story also had a strong day. Microsoft's BitNet framework, which runs 100B parameter models on CPUs using ternary weights, keeps making rounds. NVIDIA released Nemotron 3 Nano targeting machines with just 24GB of RAM. And people are running Qwen 3.5 9B through autonomous agents on RTX 3060s they bought to play Warzone. The gap between what you need cloud infrastructure for and what your laptop can handle keeps narrowing, and the agent ecosystem is racing to exploit that gap. Kimi K2.5 also emerged as a serious daily driver, with DHH and others praising its speed at 200 tokens per second through Fireworks AI.

The most practical takeaway for developers: if you're building agents, the "less tools, more discipline" principle that @morganlinton highlighted deserves serious attention. The rush to stuff agents with every available tool is producing worse results than carefully curating a minimal, well-optimized toolkit. Pair that with the shared-folder architecture pattern from @nityeshaga's Claude Code project manager writeup, where sub-agents write reports to a temp directory instead of flooding the orchestrator's context window, and you have two concrete patterns worth implementing this week.

Quick Hits

  • @pvncher celebrated the 1-year anniversary of RepoPrompt 1.0 with a demo combining agentic context building with ChatGPT's GPT-5.4 Pro for deep code reviews.
  • @EntireHQ launched their CLI for GitHub Copilot CLI users, recording "flight paths" of Copilot coding sessions.
  • @steipete shared that qmd 2.0 is out with a stable library interface for easier integration.
  • @badlogicgames retweeted updates about continuous /btw threads for Pi development.
  • @CorelythRun announced an AI assistant focused on task automation and tool syncing.
  • @thejayden recommended a Polymarket weather trading bot tutorial as a genuinely useful learning exercise amid prediction-market hype.
  • @PKodmad was impressed by an open-source skill for generating spec-conforming App Store screenshots, replacing paid SaaS tools.

Agents & Self-Improvement

The agent conversation took a meaningful turn this week with NousResearch shipping what might be the most ambitious open-source agent update in months. Their Hermes agent now includes a self-evolution system built on DSPy and GEPA that maintains populations of solutions, applies LLM-driven mutations targeting specific failure cases, and selects based on fitness. As @EXM7777 put it: "it's the best agent i've ever touched, not even close. It uses DSPy to rewrite its own skills and prompts based on failures... it rewrites its own code to get better over time." The system draws inspiration from Imbue's Darwinian Evolver research that achieved 95.1% on ARC-AGI-2, and the fact that it's open-source and runnable with a ChatGPT Pro subscription makes it accessible in a way that proprietary agent systems aren't.

But raw capability isn't everything. @morganlinton flagged what might be the most underappreciated insight in agent development right now: "the less tools you give them, the better they perform." This runs counter to the instinct most teams have when building agents, which is to connect everything and hope the model figures it out. The companies getting real value from agents are being "incredibly detailed, and disciplined about tool selection and optimization." Meanwhile, @neural_avb shared learnings on building eval harnesses for agentic systems, applying traditional ML experiment-loop principles to agent development, a reminder that rigorous evaluation infrastructure matters as much as the agents themselves.

The tension between agent capability and agent reliability is the defining challenge of this phase. Self-improving agents sound exciting until you realize that without proper evaluation frameworks, you're just automating the production of harder-to-debug failures.

Cloudflare's Crawling Pivot

Cloudflare did something that had the entire tech timeline doing double-takes. The company that sells anti-bot protection shipped a /crawl endpoint that lets anyone scrape entire websites with a single API call, returning clean HTML, Markdown, or JSON. @TukiFromKL captured the irony perfectly: "For years, Cloudflare sold anti-bot protection. Companies paid them to STOP crawlers. Now those same companies are watching Cloudflare hand everyone a free crawler that bypasses... other people's anti-bot protection. They didn't switch sides. They're playing both sides. And getting paid twice."

@Anubhavhing spelled out the implications for the scraping ecosystem: "Every web scraping startup that raised millions to solve this problem just became a single endpoint. Every freelancer charging $500 to 'extract website data' just lost their entire business model to a /crawl command." The hyperbole aside, there's a real structural shift here. Web scraping has been a cottage industry of Playwright scripts, proxy rotation, and rate-limiting logic. Reducing it to a platform feature changes who can access web data and how quickly they can do it. For agent builders in particular, this removes one of the more annoying infrastructure hurdles in building research and data-gathering agents.

Local AI Keeps Gaining Ground

The push to run serious models on consumer hardware had its strongest day in a while. Microsoft's BitNet framework, which uses ternary weights (just -1, 0, and +1) to run 100B parameter models on CPUs, continues to gain attention. @heygurisingh broke down the numbers: "2.37x to 6.17x faster than llama.cpp on x86, 82% lower energy consumption, memory drops by 16-32x vs full-precision models." With 27.4K GitHub stars and MIT licensing, BitNet represents a serious bet that the future of inference isn't just about bigger GPUs.

On the smaller end, NVIDIA's Nemotron 3 Nano targets machines with 24GB of RAM or VRAM, offering what @0x0SojalSec described as "sharp reasoning, amazing coding, and zero refusals via SOTA PRISM pipeline." And @sudoingX demonstrated the practical reality of budget local AI by running Qwen 3.5 9B through Hermes Agent on a single RTX 3060: "5.3 gigabytes of model on a card most people bought to play Warzone... plugged it into a full autonomous agent with 29 tools, terminal access, file operations, browser automation, persistent memory across sessions." The 3060's 12GB of VRAM, more than the 3070's 8GB, makes it what they called "the most underrated budget AI card on the market."

Claude Code for Non-Technical Work

@nityeshaga shared a detailed case study that deserves attention beyond the usual Claude Code discourse. They built "Claudie," an AI project manager that pulls from Gmail, Drive, Calendar, and meeting notes into real-time client dashboards, work that previously consumed a full person's time. The architectural journey is instructive: slash commands failed because MCP tool calls consumed too much context. Orchestrator-plus-sub-agents failed because sub-agent reports overwhelmed the main agent's context window.

The solution was elegantly simple: "We made every sub-agent output their final report into a temp folder and tell the orchestrator where to find it." This shared-folder pattern solved both context management and inter-agent communication. They also evolved from eleven fragmented skills to a single "handbook" skill organized into chapters, mirroring how you'd actually onboard a human employee. The framing of treating agent setup like hiring a real person, starting with a job description, then building an onboarding handbook, is a design pattern worth stealing.

Retrieval Reimagined for Agents

@AkariAsai introduced AgentIR, a research direction that challenges a fundamental assumption in how we build retrieval for AI agents. Most Deep Research agents still use search engines and embedding models designed for humans, retrieving based on a query plus maybe an instruction. AgentIR instead uses the agent's reasoning tokens during retrieval, training embedding models to leverage the rich context that agents generate while thinking through problems.

The results on BrowseComp-Plus tell the story: BM25 achieved 35%, Qwen3-Embed hit 50%, and AgentIR reached 67%. As Asai noted, "reasoning traces provide a much richer retrieval context than simply paraphrasing or concatenating queries. They capture evolving intent, decomposition, and search rationale in ways standard query-only retrieval misses." This feels like one of those ideas that seems obvious in retrospect but could meaningfully improve agent performance across the board.

The Kimi K2.5 Moment

Kimi K2.5 is quietly becoming the default model for a growing segment of developers who need speed and cost efficiency over peak intelligence. @0xSero reported running "5x 8 hour missions overnight on my Kimi $40 subscription" with all of them still running smoothly, each with 20+ sessions. The model is "often indistinguishable from Claude, about 5x cheaper, open weight." DHH called it his "daily driver for all the basic stuff," running at 200 tokens per second through Fireworks AI. The emerging pattern is clear: frontier models for hard problems, fast open-weight models for everything else, and the "everything else" category keeps expanding.

Google's Multimodal Embeddings

@googleaidevs announced Gemini Embedding 2, their first fully multimodal embedding model built on the Gemini architecture, now available in preview via the Gemini API and Vertex AI. While the announcement was light on benchmarks, the significance is in the "fully multimodal" qualifier. Embedding models that natively handle text, images, and other modalities without separate pipelines could simplify RAG architectures considerably. Combined with AgentIR's work on reasoning-aware retrieval, the embedding layer of the AI stack is getting substantially more capable.

Sources

N
Nityesh @nityeshaga ·
How We Built an AI Project Manager Using Claude Code Claude Code for non-technical work is going to sweep the world by storm in 2026. This is how we built Claudie, our internal project manager for the consulting business. This process provides a great peek into my role as an applied AI engineer. My Role I'm an applied AI engineer at @every. My job is to take everything we learn about AI — from client work, from the industry, from internal experiments — and turn it into systems that scale. Curriculum, automations, frameworks. I turn the insights clients give us on discovery calls to curriculum that designers can polish into final client-ready materials. When there's a repetitive task across sales, planning, or delivery, I build the automation, document it, and train the internal team to use. The highest-value internal automation I've built so far is the one I'm about to tell you about. What We Needed to Automate Every Consulting runs on Google Sheets. Every client gets a detailed dashboard — up to 12 tables per sheet — tracking people, teams, sessions, deliverables, feedback, and open items. Keeping these sheets accurate and up-to-date is genuinely a full person's job. @NataliaZarina, our consulting lead, was doing that job on top of 20 other things. She's managing client relationships, running sales, making final decisions on scope and delivery — and also manually updating dashboards, cross-referencing emails and calendar events, and keeping everything current. It was the work of two people, and she was doing both. So I automated the second person. Step 1: Write a Job Description The first thing I did was ask Natalia to write a job description. Not for an AI agent — for a human. I asked her to imagine she's hiring a project manager: what would she want this person to do, what qualities would they have, what would be an indicator of them succeeding in their role, and everything else you'd put in a real job description. See screenshot 1. Once I had this job description, I started thinking about how to turn it into an agent flow. That framing — treating it like hiring a real person — ended up guiding every architectural decision we made. More on that later. Step 0: Build the Tools Before any of the agent work could happen, we needed Claude Code to be able to access our Google Workspace. That's where the consulting business lives — Gmail, Calendar, Drive, Sheets. Google does not have an official MCP server for their Workspace tools. But here's something most people don't know: MCP is simply a wrapper on top of an API. If you have an API for something, you basically have an MCP for it. I used Claude Code's MCP Builder skill — I gave it the Google Workspace API and asked it to build me an MCP server, and it did. Once it was confirmed that Claude Code could work with Google Sheets, that was the biggest unknown resolved, and we knew it would be able to do the work we needed. Version 1: Slash Commands Now it was time for context engineering. The first thing we tried was to create a bunch of slash commands — simple instructions that tell Claude what to do for each piece of work. This treated slash commands as text expanders, which is what they are, but it didn't work. It failed for one critical reason: using MCP tools to read our data sources and populate our sheets was very expensive in terms of context. By the time the agent was able to read our data sources and understand what was needed, it would be out of context window. We all know what that does to quality — it just drops drastically. So that didn't work. Version 2: Orchestrator and Sub-Agents This is also exactly when Anthropic released the new Tasks feature. We decided the new architecture would work by having our main Claude be the orchestrator of sub-agents, creating tasks that each get worked on by one sub-agent. But this ran into another unexpected problem. The main Claude would have its context window overwhelmed when it started 10 or more sub-agents in parallel. Each sub-agent would return a detailed report of what they did, and having so many reports sent to the orchestrator at the same time would overwhelm its context window. For example, our very first tasks launch data investigation agents which look at our raw data sources and create a detailed report about what has happened with a client over a specific period of time, based on a particular source like Gmail or Calendar. The output of these sub-agents needs to be read by all the sub-agents down the line — up to 35 of them. There would definitely be a loss in signal if it was the job of the main orchestrator to pass all required information between sub-agents. The Fix: A Shared Folder So we made one little change. We made every sub-agent output their final report into a temp folder and tell the orchestrator where to find it. Now the main Claude reads reports as it sees fit, and every downstream sub-agent can read the reports from earlier phases directly. This totally solved the problem. And it also improved communication between sub-agents, because they could read each other's full output without the orchestrator having to summarize or relay anything. See screenshot 2. Version 3: From Skills to a Handbook With the orchestration working, I initially created separate skills for each specific piece of work — gather-gmail, gather-calendar, check-accuracy, check-formatting, and so on. Eleven skills in total. Each sub-agent would read the skill it needed and get all the context for its task. This worked, but it was ugly. These were very specific, narrow skills, and it created all sorts of fragility in the system. Not to mention that it was difficult for even the humans to read and maintain. That's when the job description framing came back around. We started by treating this like hiring a real person. We wrote them a job description. So what do you do once you've actually hired someone? You give them an onboarding handbook — a document that covers how you approach things on your team and tells them to use it to get the job done, all aspects of their job. So that's what we built. One single project management skill that contains our entire handbook, organized into chapters: • Foundation — who we are, the team, our tools and data sources, when to escalate, data accuracy standards • Daily Operations — how to gather data from all our sources • Client Dashboards — how the dashboards are structured, what the master dashboard tracks, how to run quality checks • New Clients — how to onboard a new client and set up their dashboard from scratch Now when a sub-agent spins up, it reads the foundation chapters first (just like a new hire would), then reads the chapters relevant to its specific task. The handbook replaced eleven fragmented skills with one coherent source of truth. Here's what the final architecture looks like: See screenshot 4. What This Felt Like This was the most exhilarating work I've done in two weeks, and it was all of the things at once. Working with @NataliaZarina was the most important part. We were on calls for hours, running Claude Code sessions on each of our computers and trading inputs. She has the taste — she knows what the dashboards should look like, what the data should contain, what quality means for our clients. I have the AI engineering. Working together on this was genuinely exciting. Then there's the speed. We went through three major architectural generations in a span of two weeks. Everything was changing so fast. And what was actually the most exciting was how hard we were driving Claude Code. I've been using Claude Code for programming for months, but I was not driving it this hard before. This last couple weeks, I was consistently running out of my usage limits. In fact, both Natalia and I were running out of our combined usage limits on the ultimate max plans on multiple days. When you're consuming that much AI inference, you can imagine how fast things are moving. And that was just exciting as fuck. This was also a completely novel problem. Applied AI engineering as a discipline is still new, and this was the first real big shift in how I think about it. Why Now, and Why 2026 Here's why I opened with the claim that Claude Code for non-technical work will sweep the world in 2026. We realized that if you give Claude Code access to the tools you use as a non-technical person and do the work to build a workflow that covers how you actually use those tools, that is all you need. That's how non-technical work works. The reason this hasn't been done until now is that we were running Claude Code at its limits. This would not have been possible with a previous version of the AI or a previous version of Claude Code. We're literally using the latest features and the latest model. It requires reasoning through and understanding of the underlying tools and how to operate them, along with planning capabilities and context management capabilities that did not exist even six months ago. But now they do. And we're only in January. Every piece of the stack that made this possible is brand new: • MCP Builder skill — I built our own Google Workspace MCP server by asking Claude Code to use the Google Workspace API. That was not possible before Anthropic released MCP Builder on Oct 16, 2025 • Opus 4.5 — Its reasoning and planning capabilities made the entire orchestration possible. The agent needs to understand complex sheet structures, figure out what data goes where, and coordinate across dozens of sub-agents. Released Nov 24, 2025. • The Tasks feature — Sub-agent orchestration through Tasks made Version 2 and 3 possible at all. This was released Jan 23, 2026. That's why I'm saying Claude Code for non-technical work will sweep 2026. The building blocks just arrived.
N NataliaZarina @NataliaZarina

We just brought on a new project manager for @every, an agent called Claudie. The job description is a https://t.co/HdKYJTtSdj file. It pulls from Gmail, Drive, Calendar & meeting notes into a real-time client dashboard. Custom skills & tasks enforce quality. @nityeshaga & I built it over the past 2 weeks because managing clients with hundreds of employees across multiple teams was drowning us in manual work. Setup for our latest client took Claudie 30 min & would've taken us 5 hrs. Welcome to the team!

C
CorelythRun @CorelythRun ·
An AI assistant built for real work. Automate tasks, sync your tools, and turn commands into actions.
I
Ihtesham Ali @ihtesham2005 ·
🚨 Alibaba just open sourced a GUI agent that lives inside your webpage and controls it with natural language. It's called Page Agent and it's not a browser extension. It's pure JavaScript no Python, no Puppeteer, no headless browser, no screenshots. Just one script tag and your web app understands natural language. Here's what it actually does: → Embed it with a single