AI Digest.

Cursor Ships Its SDK as Agent Infrastructure Becomes the Defining Challenge of 2026

Cursor released its SDK to let developers embed coding agents anywhere, while the broader conversation centered on agent infrastructure challenges from harness engineering to observability. Ramp launched AI procurement agents, Aaron Levie announced a new "agent engineer" role at Box, and researchers introduced frameworks for self-improving agent harnesses.

Daily Wrap-Up

If you scrolled through AI Twitter today and felt a strange sense of convergence, you weren't imagining it. Nearly every major post touched the same nerve: we've moved past "can agents do things?" and landed squarely on "how do we make agents do things reliably, at scale, without losing our minds?" The Cursor SDK launch was the headline, but the real story is the ecosystem crystallizing around agent infrastructure. From Browserbase shipping observability for browser agents to a research paper on self-evolving harnesses to Aaron Levie literally creating a new job title for internal agent wiring, the message is clear: the plumbing era of AI agents has arrived.

What's striking is how fast the conversation has matured. Six months ago, the discourse was about which model was smartest. Today it's about sandboxing, checkpointing, context management, and CI optimization. @DevanshuXi's deep dive into the real infrastructure behind autonomous coding agents, @omarsar0 highlighting a paper where harnesses improve themselves through falsifiable predictions, and @jainarvind from Glean talking about their third-generation harness iteration all point to the same conclusion: the model is increasingly a commodity, and the harness is where the value lives. Meanwhile, @eglyman made the most quotable observation of the day, noting that everyone's fixated on AI replacing creative work while the quieter, bigger revolution is back-office agents running procurement at 2am for three cents.

The most practical takeaway for developers: if you're building with AI agents, stop optimizing prompts and start investing in your harness, the orchestration layer that manages context, tool calls, and recovery. Today's posts made clear that harness engineering, not model selection, is the bottleneck separating demo-grade agents from production-grade ones. Start with observability (know what your agent sees and does), add checkpointing (so you can recover from failures), and treat your harness configuration as code that can be versioned, tested, and evolved.

Quick Hits

  • @theblessnetwork introduced Memorybase, a universal AI memory layer so you never have to re-explain project context to a new AI session. Solves a real pain point for anyone juggling multiple tools.
  • @bhalligan (HubSpot co-founder) teased a writeup about the "most clever way" a founder is AI-training their entire 300-person team. No details yet, but the engagement bait worked.
  • @Hacubu shared his four-month journey coding with Deep Agents across six different models (Opus 4.6 through GPT-5.5), currently running 5.5 as his primary driver with Opus handling code reviews.
  • @justsisyphus RT'd a post about the next generation of agentic engineers, citing claw-code hitting 100k GitHub stars in 24 hours.
  • @_smontlouis gave a passionate endorsement of Matt Pocock's approach to AI-assisted architecture, claiming "ZERO SLOP" in his repos from following Pocock's methods.
  • @lateinteraction (Omar Khattab) signal-boosted HALO (Hierarchical Agent Loop Optimizer), an RLM-based technique for recursively optimizing agent behavior.

The Agent Infrastructure Stack

Today's feed read like a syllabus for a course on agent systems engineering. The sheer density of posts about what sits between the model and the real world suggests we've hit an inflection point where the hard problems are no longer about intelligence but about reliability, observability, and orchestration.

@jainarvind from Glean framed the core challenge well: "Models have a fixed attention span, and the harness decides how it gets filled. Agents are now taking on longer-running, more complex work. To do that reliably and to completion, the harness itself has to be built to scale context." This is Glean's third iteration of their agent harness, which tells you something about how much trial and error is involved.

On the research side, @omarsar0 highlighted a paper on Agentic Harness Engineering that introduces a framework for making harness evolution observable and self-correcting: "Each edit becomes a contract you can verify or revert." The results are compelling, with pass@1 on Terminal-Bench 2 climbing from 69.7% to 77.0% in ten iterations, beating both human-designed systems and self-evolving baselines while using 12% fewer tokens. Meanwhile, @DevanshuXi went deep on what companies like Cursor, Cognition, and Anthropic actually need to run thousands of autonomous agents in production: "Not the polished demo. The real infrastructure underneath, the sandboxing, real-time sync, isolation, checkpointing, recovery, semantic indexing, and all the distributed systems chaos hiding behind a simple 'AI, fix this bug.'"

The Chinese-language post from @AYi_AInotes about Browserbase's /browser-trace added another critical piece to the puzzle: agent observability. The tool records every CDP event, DOM snapshot, network request, and console log when an agent operates in a browser, then generates an interactive HTML report. As they put it (translated): "We've been building hands and eyes for agents, but nobody built them a black box." This is the OpenTelemetry moment for browser agents, transforming them from opaque executors into transparent, reproducible systems. Together, these posts paint a picture of an industry rapidly building out the boring-but-essential middleware that will determine which agent systems actually work.

Cursor SDK Launch and the Embeddable Agent Era

Cursor dropped its SDK, and the developer community immediately started stress-testing what "embed agents everywhere" actually means. The SDK exposes the same runtime, harness, and models that power Cursor's editor, but packages them for CI/CD pipelines, automations, and third-party products.

@agrimsingh, who's been playing with the SDK since November, shared that one of his first builds was "a way to take a recorded interaction trace and use the SDK to generate the entire codebase that captured this." He noted the experience has only gotten better with Cursor's newer composer-2-fast models. @jack___driscoll went a different direction entirely, embedding a Cursor agent directly inside Gmail after just a few days with the SDK. These aren't toy demos; they represent the SDK's core thesis that coding agents shouldn't be trapped inside an editor.

The timing feels deliberate. With agent harness engineering becoming the central challenge, releasing an SDK that packages a battle-tested harness gives Cursor a platform play beyond their editor. If developers build on Cursor's runtime rather than rolling their own, Cursor becomes infrastructure rather than just a tool.

AI Agents in Business: Procurement and the Back Office Revolution

Ramp's launch of AI procurement agents generated significant buzz, with multiple posts circling the same insight: the most impactful AI applications might be the least glamorous ones. @geoffintech laid out the numbers: "customers saving 16% annually on vendor spend. 46 hours per month of manual purchasing work eliminated. Approved requests moving 3x faster." With AI contracts ballooning from $39k to over half a million in two years, the complexity has outgrown manual processes.

@eglyman captured the meta-narrative perfectly: "the loud AI story is models replacing creative work. the quiet one is the drudgery of the back office evaporating, agents running procurement, AP, and renewals at 2am for three cents. the second one is bigger." This framing resonated because it cuts through the noise. While Twitter debates whether AI will replace writers and artists, the actual revenue impact is happening in purchase orders and contract renewals.

@levie extended this to organizational design, announcing that Box is "starting to hire and retrain for new agent engineering roles for internal functions." His description of the role is telling: someone "extremely technical and capable of building secure, governed agents for internal workflows that connect to business systems." He even predicted a complementary role on the business side, something like "agent product management for internal processes." The key insight: "It's not about bringing automation to a job, but bringing automation to a process."

The Full-Stack Agent Developer

A cluster of posts focused on what it means to be an effective developer in the agent era. @dboskovic described building "Autobuild," an agent that oversees the entire software development lifecycle: planning features across dozens of PRs, babysitting code review, running QA with recorded videos, monitoring logs after staging releases, and even optimizing CI pipelines. They're onboarding companies through workshops that promise "12 weeks of roadmap in 2 days."

@Av1dlive pointed to Karpathy's framework distinguishing 10x engineers (normal) from 100x agentic engineers, highlighting the key skills: "context engineering, tool design, orchestrator-subagent patterns, evals, the harness mindset." This aligns with the broader theme that the bottleneck has shifted from writing code to designing systems that help agents write code well. The agentic engineer doesn't just use AI; they architect the entire feedback loop.

Developer Tools and Open Source Drops

Beyond the Cursor SDK, several developer-focused launches caught attention. @mattpocockuk open-sourced Sandcastle, his personal "software factory," and separately shared a skills changelog introducing /grill-with-docs and experimental /diagnose and /triage skills. @burakkarakann released dac, a dashboard-as-code tool that lets agents generate standardized dashboards from YAML or JSX: "Agents need regular files, but your dashboarding tool doesn't work that way." And @hazelcough from Stripe turned their internal API design principles into a public tool that reviews your API for $2, a nice example of productizing institutional knowledge.

@RayFernando1337 endorsed Fallow as a solution to "dead code, duplication and drift," which he called "a massive pro tip to kill slop." As agents generate more code, tools that detect and clean up the resulting entropy are becoming essential infrastructure rather than nice-to-haves.

Sources

B
Bless @theblessnetwork ·
You never have to re-explain projects to a new AI again. Introducing @memorybase, universal AI memory that works everywhere.
R
Ray Fernando @RayFernando1337 ·
Fallow is fantastic and a massive pro tip to kill slop.
S stolinski @stolinski

Dead code, duplication and drift are huge problems with coding with AI. You can't prompt this away. Lately I've been really loving Fallow to reign this in. https://t.co/frQW5AxqDI

A
Arvind Jain @jainarvind ·
One of the questions I often get from leaders is: Can agents reliably get real work done, end to end? A lot of that answer depends on the harness around the model—the system that decides how to break a request into steps, which tools to call, what to remember, and when to stop. Models have a fixed attention span, and the harness decides how it gets filled. Agents are now taking on longer-running, more complex work. To do that reliably and to completion, the harness itself has to be built to scale context. We’ve been solving this problem at @Glean since day one. We wrote up some of what we've learned from our 3rd iteration of our harness here:
T tonygentilcore @tonygentilcore

The harness as the context manager

阿绎 AYi @AYi_AInotes ·
今天看到一个东西,我觉得是2026年Agent工程化到目前为止最重要的进展之一! 说实话,我之前对各种Agent浏览器工具已经有点审美疲劳了。 无非就是能点按钮、能填表单,一到复杂页面就乱点,一出问题就抓瞎。 但Browserbase刚出的这个/browser-trace不一样。 他们做了个演示:让Claude在TikTok上自动刷视频,50秒内刷了几十条,全程后台悄无声息地记录了一切: 2077个CDP事件, 43张DOM和截图快照, 320多个网络请求,包括失败的、中断的、媒体预加载的, 所有console日志、JS异常、页面生命周期。 最后自动生成了一份带仪表盘的HTML报告,点任何一个事件都能跳转到对应的截图和DOM。 卧槽这个就太屌了, 以前Agent在网页上乱点卡住, 兄弟们知道最痛苦的是什么吗? 我想了很久,答案就是:你根本不知道它为什么卡住🤣🤣🤣 它看到了什么?它点击了哪里?网络请求返回了什么?有没有JS报错? 传统的Playwright调试已经够痛苦了,Agent更惨, 它自己决策、自己执行、自己出问题,你连复现都做不到。 所以有时候你会发现,我们之前一直在给Agent造手和眼睛,但从来没人给它造黑匣子。 我觉得这才是/browser-trace真正的意义。 它不是一个更好的调试器, 更像一个浏览器Agent的OpenTelemetry。 把浏览器从Agent的黑箱执行器,变成了一个完全透明、可查询、可复现的系统。 而正是有了这种真正的可观测性,才构成了Agent可靠性的起点。
E
elvis @omarsar0 ·
// Agentic Harness Engineering // Pay attention to this one, AI devs. (bookmark it) Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution. This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes. Each edit becomes a contract you can verify or revert. Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified. Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise. Paper: https://t.co/9fEgqwlTSf Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
B
Burak Karakan @burakkarakann ·
Open-sourcing our dashboard-as-code product: dac 🚀 Dac is an open-source product that allows you to build dashboards using YAML or JSX. You can throw your agents at it and get a beautiful, standardized dashboard as a result of it. One of the things I hate about the existing BI tools is that they lock you into a weird UI with no way to bring agents into it. Agents need regular files, but your dashboarding tool doesn't work that way. If you are like me, your first instinct would be to get your agents to build you dashboards from scratch. That's nice to get the first dopamine hit, but it doesn't work well within a team. Every dashboard is different, deploying the dashboards requires a lot of automation work, and there's no way to standardize, govern or review dashboards. Agents generate a ton of code, and it is very hard to review. Enter dac: it is an open standard, and an implementation of the standard as a standalone Go binary. It allows you to build and serve dashboards locally or in any server that can run a single binary. It has a built-in semantic layer, it comes with its own skill, and your agent can build your first dashboard in minutes. Dac supports all the major cloud databases: Postgres, MySQL, SQL Server, Snowflake, BigQuery, Databricks, and more. Another fancy feature of dac is: since it supports JSX, it means you can do load-time dynamism: display different charts per value, create tabs or charts dynamically, and have your dashboard live and breathe. You can customize themes, change the color palette, and make it yours. We are open-sourcing the spec and the implementation today. It has rough edges, but getting started with it should be as simple as "dac init". We are very excited to launch dac and we are looking forward to your feedback on this!
M
Matt Pocock @mattpocockuk ·
I built my own software factory, and I open-sourced it. It's called Sandcastle. Here's how to use it: https://t.co/SpX9E5u4k4
J
Jack Driscoll @jack___driscoll ·
I've been building with the cursor SDK for a few days now. It's awesome. 🧵 I embedded a cursor agent directly inside Gmail: https://t.co/LZ1amK1GF4
C cursor_ai @cursor_ai

We’re introducing the Cursor SDK so you can build agents with the same runtime, harness, and models that power Cursor. Run agents from CI/CD pipelines, create automations for end-to-end workflows, or embed agents directly inside your products. https://t.co/bRcn9xjuVz

A
agrim singh @agrimsingh ·
i've been playing with the sdk since november(!!) last year when @ericzakariasson had us in the @cursor_ai NYC office to playtest and it's been absolutely cracked running cursor's harness inside your own apps. one of the first things i built was a way to take a recorded interaction trace and use the sdk to generate the entire codebase that captured this. this was running the old composer model under the hood - the demo video is from the night i first used the sdk. it's actually been even better building with the composer-2-fast models as the backbone for apps. big release!!
C cursor_ai @cursor_ai

We’re introducing the Cursor SDK so you can build agents with the same runtime, harness, and models that power Cursor. Run agents from CI/CD pipelines, create automations for end-to-end workflows, or embed agents directly inside your products. https://t.co/bRcn9xjuVz

E
Eric Glyman @eglyman ·
the loud AI story is models replacing creative work. the quiet one is the drudgery of the back office evaporating — agents running procurement, AP, and renewals at 2am for three cents. the second one is bigger.
T tryramp @tryramp

98% of companies don't have a procurement team. The ones that do are stretched thin. Today, they all get backup. Introducing a suite of AI agents to run your entire purchasing process, saving you 46 hours of manual work per month and 16% on yearly vendor spend. https://t.co/0a7vpbDqza

D
Devanshu @DevanshuXi ·
Watched founders and senior engineers on twitter talk about spinning up armies of AI agents like it's effortless. That naturally led to a bigger question: how do companies like @cursor_ai , @cognition , and @AnthropicAI @OpenAI actually make this work in production? Not the polished demo. The real infrastructure underneath, the sandboxing, real-time sync, isolation, checkpointing, recovery, semantic indexing, and all the distributed systems chaos hiding behind a simple "AI, fix this bug." So I went deep and designed the entire architecture from scratch. If you've ever wondered what it takes to safely run thousands of autonomous coding agents at scale, this one's for you. https://t.co/SC0pMdSBhP
G
Geoff Charles @geoffintech ·
Business purchasing is broken. Today we're fixing that. We're giving every Ramp customer access to AI agents that can source vendors, review contracts, run approvals, negotiate pricing, check compliance, and handle payments and renewals. This is more important than ever when AI adoption has crossed 50% of US businesses and the average AI contract has gone from $39k to over half a million in two years. The complexity outgrew the process. Early results: customers saving 16% annually on vendor spend. 46 hours per month of manual purchasing work eliminated. Approved requests moving 3x faster. This one has been a long time coming. Excited to say that the new era of Ramp Procurement is now available to over 50,000 teams More on what we built below.
M
michelle 👻 @hazelcough ·
Stripe obsesses over API design. For Stripe Sessions we turned our well-worn API design principles into an app that reviews your API for $2. We liked it so much we’re making it public for 30 days. https://t.co/LhD96K12O6 https://t.co/xjNxh1CPFT
J
Jacob Lee @Hacubu ·
So excited for this - in the past four months I've written all my code using Deep Agents with: - Opus 4.6 - GPT-5.4 - GLM-5 - Kimi 2.6 - Opus 4.7 - GPT-5.5 Currently using 5.5 as my driver with Opus supporting with reviews. Pumped to make that work even better!
V Vtrivedy10 @Vtrivedy10

Tuning Deep Agents to Work Well with Different Models

B
Brian Halligan @bhalligan ·
What's the smartest, fastest way you've seen a company force-multiply their people with AI? Just saw the most clever way a founder is AI-pilling their entire 300-person team. Writing it up to share, but I wonder if it can be topped...
A
Avid @Av1dlive ·
Andrej Karpathy : 10x engineers are normal. real agentic engineers are 100x this guy just shipped the playbook to become 100x context engineering. tool design. orchestrator-subagent. evals. the harness mindset. watch & bookmark it for this weekend https://t.co/7qQNW8KJSN
R rohit4verse @rohit4verse

What to Learn, Build, and Skip in AI Agents (2026)

S
Stéphane - smo @_smontlouis ·
Je vous conseille de bouffer du matt pocock matin midi et soir. Je pense legit que c'est le meilleur architecte IA actuel. Ses skills sont excellents, ses vidéos sont excellentes tout ce qu'il fait est EXCELLENT. ZERO SLOP dans mes projets mes repos sont cleeaannnnnnn https://t.co/lSKILFhaX7
D
David Boskovic @dboskovic ·
we were spending too much time "carrying water" between agents - here's a spec build it - address the pr comments - fix the failing CI so we made an agent for overseeing the SDLC e2e (Autobuild) we invited 10 startups to use it last week in SF (NYC next week!) what it does: - plans out entire feature builds across dozens of PRs - oversees the coding agents as they work - babysits PRs and addresses human and agentic review - conducts security, performance, and architectural reviews - QAs the work and records videos of the outcomes - monitors logs for issues after staging release - collects ux feedback from humans and address them - indexes all the concepts in your codebase - automatically writes updates to your team about what shipped - knows the current rollout state of features - maintains running sandboxes with a full dev env - dogfoods features before reporting success - engages with you in slack as it builds - automatically fixes reported issues - nags you for PR reviews when needed - optimizes your CI so it's not shit (big bottleneck for velocity) we're planning on making this the most insane building experience for established companies with a focus on quality/safety and human collaboration while accelerating velocity by 1-2 orders of magnitude if you want to join us in NYC next week (Thur/Fri - May 7/8) or future workshops lmk - we're onboarding up to 50 companies at a time by helping you ship 12 weeks of roadmap in 2 days - a sort of reset on baseline velocity no cost to attend beyond the inference you burn (you'll build a lot so not for the faint of heart)
O
Omar Khattab @lateinteraction ·
RT @samhogan: We’re introducing HALO 😇 Hierarchal Agent Loop Optimizer HALO is an RLM-based agent optimization technique capable of recur…
S
Sisyphus Labs @justsisyphus ·
RT @RayFernando1337: We aren't ready for this next generation of agentic engineers. 100k Github stars in 24 hours (claw-code), Yeachan Heo…
A
Aaron Levie @levie ·
Starting to hire and retrain for new agent engineering roles for *internal* functions to help get more powerful agents working well on critical business processes. I expect this type of role to be a very big deal over time at Box and other companies. It looks something like an internal FDE, whose job it is to wire up internal systems and get agents working with them effectively. The person will be extremely technical and capable of building secure, governed agents for internal workflows that connect to business systems (like Box, Salesforce, Workday, etc.), and codify workflows in skills. In some cases this person may understand the business process well enough to do it fully, but in most cases I expect them to work with the business directly in an embedded fashion. Ironically, that may introduce another new role on the business side that is more akin to agent product management for internal processes. The key is that you need technical + process people that can span multiple teams or functions in an organization. It’s not about brining automation to a job, but bringing automation to a process. This is going to be a very big trend in most companies going forward. Fun to watch the early innings of what this will look like.
M
Matt Pocock @mattpocockuk ·
This weeks' skills changelog: - /ubiquitous-language deprecated, use /grill-with-docs instead - /grill-with-docs for codebases, /grill-me everywhere else - Skills can now be used with any issue tracker - Experimental /diagnose and /triage skills