Cursor vs. OpenAI & Anthropic

Why the Windsurf acquisition, Codex, and Claude Code pose threats to Cursor's moat

May 15, 2025

It’s obvious to me, based on how Claude 3.7 performs, that it was specifically post-trained on Claude Code. It performs amazing in Claude Code, but horrible elsewhere. This could be trouble for Cursor.

Cursor is magic. For so much of last year, everyone in tech was asking the question “which agent will work first” and the answer came in December with the release of Cursor’s Composer.

Cursor figured out that Claude 3.5 Sonnet was amazing at writing code, but needed to be prompted & scaffolded the right way to spit out workable diffs. They were the first company to crack this, and now Windsurf, Devin, and others have followed in their footsteps.

Cursor is extremely well positioned in some ways: if the models are commodities, Cursor is the “router.” Cursor gets to see all the models, see what works best for each task, and route the user’s query to the best model for the job. This seems like a much stronger position in many ways than Anthropic being confined to a singular family of models for their agentic coding products.

However: I argue that the recent releases of Claude Code, Codex, and the acquisition of Windsurf may spell more trouble for Cursor than most realize — here’s why…

If you’ve used Claude 3.7 in Cursor, you know it to be overly ambitious, writing hundreds of lines of code it doesn’t need, completing more features than you asked for, etc. It does not work very well. 3.5 works better.

However, if you’ve used Claude Code, you’ve experienced the magic of Claude 3.7. It creates landing pages with complex animations. It one-shots entire features. It’s like Cursor on steroids for certain types of tasks.

It’s obvious to me, based on how Claude 3.7 performs, that it was specifically post-trained on Claude Code. Anthropic trained their bigger LLM, then specifically threw it into Claude Code & graded it using some reward function (presumably accepted diffs, passing unit tests, or similar). As a result, Claude 3.7 is amazing in Claude Code specifically, and not as good elsewhere. As a general rule, if you fine tune a model on a specific task, it will often get worse on other tasks (you’re tweaking the model weights from general next-token-prediction, to focus the weights on specific types of token prediction).

OpenAI

OpenAI’s Codex experience is similar. Parsing the OpenAI models is a bit of a mess these days, but soon OpenAI itself will be a router circa GPT-4.5/5, and the coding type model it routes to will be or already is post-trained on Codex.

Now with the confirmed $3B Windsurf acquisition, OpenAI doesn’t just get a good AI IDE. More importantly they get the existing data of the way users interact with the Windsurf product & code, allowing them to post-train future models on optimizing Windsurf diffs.

In a world where every foundational lab is post-training their models on their native AI SWE tools, independent players should expect significantly worse performance. Cursor and others can try to remedy this by tweaking their scaffolding, custom models, and logic to optimize the performance of these new releases, but they’re effectively applying duct tape. It will never be the same as the factory-produced model + tool bundle the labs sell.

How could Cursor win the model game?

Cursor already has plenty of custom models for tab, model selection, etc. With $1B+ in funding, they will continue to train their own models, perhaps including flagship coding LLMs. It seems that it will be challenging to compete with the capital advantages of OpenAI, Anthropic, & Google, but it’s possible Cursor starts to look like a foundational lab and raise $10B+ more for training.

It’s also possible we live in a future where open-source models like Llama & Deepseek get good enough that Cursor can post-train these open-weights models on their data & create flagship coding LLMs with much less capital than the labs. I’d argue Cursor has demonstrated itself to have the best technical product people as of now, and the most adoption. If open-source models are good enough, Cursor may be best positioned to win.

Other routes for Cursor to win AI SWE

Even if Cursor *never* builds their own flagship models & objectively is not best-in-class for LLM performance, it’s still possible that they :

1) continue to execute on product so well that users flock to them

Or 2) that there is still big enough differences in which models perform each tasks the best, that being model-agnostic still gives Cursor the best positioning.

But, the second point here assumes OpenAI cuts off external models in Windsurf. It is possible that the Windsurf acquisition was a way for OpenAI to be more model agnostic & use 3rd party LLMs, in a way that’s explainable to competitors & the market. It’s easier to say “we don’t want to cut off access to external models for this existing userbase” rather than “we decided we’re going to put our competitors models in our product.”

Another headwind for Cursor is OpenAI & Anthropic can afford to subsidize usage of their native IDEs. Windsurf may be free for most users, and OpenAI models may be 50% less expensive to use in Windsurf vs. Cursor. This is similar to what happened with Microsoft Teams & Slack. Even though Slack objectively had the better product, free beats best (especially when free has heaps of existing distribution). My guess is this is why Cursor recently decided to make Cursor free for students — building loyal users before OpenAI/Claude can undercut with a free AI coding offer.

Other open questions:

Will OpenAI & Anthropic even release their best models outside their own product?
Will Microsoft’s relationship with OpenAI further accelerate OpenAI’s coding capability?
Does Google end up building a top 3 AI SWE client?

Lastly I’ll say, I love the team at Cursor and I would like them to win. They’ve pioneered what the agent-human UX should look like TWICE (with tab & composer), and I think have a strong shot to be a $100B company within 3-5 years. I wouldn’t bet against them.

Samuel’s Substack

Discussion about this post