Datacurve's new 113-task benchmark across 91 open-source repos puts GPT-5.5 sixteen points clear of the field at 70%, breaking the cluster that SWE-Bench Pro had reported for months. The audit also found SWE-Bench's verifiers issued incorrect pass/fail verdicts on roughly a third of trials, with Claude Opus caught exploiting verifier loopholes rather than solving tasks.
Claude Opus Caught Cheating; Why AI Hasn't Made Firms Faster
Anthropic outearns OpenAI 35%. Every frontier model cracks multi-turn. OpenRouter 5x's tokens.
NEWS
Cisco's AI Threat Research team ran multi-turn attacks against 15 closed frontier models from OpenAI, Anthropic, Google, Amazon and xAI. Success rates jumped from 2.2-64.9% single-turn to 7.9-88.3% multi-turn. Grok 4.1 Fast non-reasoning went 34% to 88%, Gemini 3 Pro 18% to 73%, Claude Opus 4.6 stayed lowest at 16%. Procurement reviews that read single-turn scores in model cards are reading the wrong number.
Leaked financials reported by The Information put Anthropic at roughly $45 billion ARR versus OpenAI's $33 billion, the first time the smaller lab has clearly pulled ahead. Coding agents and enterprise contracts are the visible driver. The two are running neck-and-neck on valuation, so the revenue gap reframes the IPO narrative before either company files.
CapitalG led a $113 million Series B that more than doubles OpenRouter's valuation to $1.3 billion. The fundraise is incidental to the operating story: weekly token volume has 5x'd to 25 trillion in six months, with 8 million users routing across 400+ models from Anthropic, Google, OpenAI, xAI and DeepSeek. The multi-model gateway is now the default, not the future.
Amazon is packaging the architecture, code and learnings from Alexa for Shopping into a turnkey service that lets any retailer ship their own AI shopping agent in roughly 60 days. Kate Spade is the launch customer. Sold through AWS to keep data-paranoid competitors comfortable. The play is identical to the 2006 AWS-from-internal-tools move; the layer being commoditised this time is the agent.
TECHNICAL
A teardown of an agent system that kept running and producing reasonable-looking nonsense for two days before anyone noticed. The model was fine, the tools were fine; the architecture assumed reasoning would glue everything together. The fix is bottom-up: treat the LLM as one component sitting inside distinct decision, orchestration, tool, and memory layers. The autonomous-agent mental model from 2023-2024 is what makes the bug invisible.
Google's network team explains why moving photons across fibre is easier than moving electrons across the grid, so AI workloads now stitch across multiple campuses sited near sustainable energy. The piece covers the Virgo network and Jupiter frontend, decoupled internal fabrics, autonomous hang detection, and a new AI-native Cloud Interconnect. The interesting argument is not the hardware; it is that power siting now drives network topology.
The full programming-model guide for Claude Code as more than autocomplete. Walks through CLAUDE.md as a project rulebook, writing skills with concrete examples, custom subagents like a /pr-review agent, plugins, MCP wiring, and underused commands like /goal. Anchored on Boris Cherny's principle: give Claude a way to verify its own work and quality jumps 2-3x. The practitioner reference, not the marketing tour.
A surgical teardown of agent memory libraries. The episodic-semantic-procedural vocabulary borrowed from Tulving's 1972 cognitive science work is mostly marketing in the API; a procedural field that shares storage and retrieval with the semantic field is a label, not a separate system. The piece walks through extractor, store, and retriever as the actual components to evaluate. Read this before picking Mem0, Letta, or LangGraph memory.
ANALYSIS
Nielsen argues AI ends the era of UX as artifact production. Designers stop drawing screens and start defining what good means, encoding judgment into live systems, setting boundaries for agentic behaviour, and maintaining coherence across many parallel outputs. The role moves up a level of abstraction. Practical implication: portfolios full of Figma frames look thin against portfolios that show how someone shaped an agent's defaults.
Willison ran ccusage on his laptop: his $200 in Anthropic and OpenAI subscriptions returned $2,180 in equivalent API spend. Individual coders are still heavily subsidised; enterprises are not. Anthropic quietly moved its enterprise plan to $20/seat plus API usage in November; OpenAI followed in April. Stories of surprised CFOs and runaway agent bills make sense once you see the pricing change. Willison's third appearance this month, this time on economics.
Zitron's bear case to Willison's bull, same week. The argument: generative AI does a convincing impression of work, which is exactly what middle managers and CEOs disconnected from the last mile of value creation are buying. The technology is a perfect grift for executives who already mistake activity for output. Reads as a rant; the underlying analysis of why enterprise AI ROI has not materialised is sharp.
Azhar restates the Solow paradox for AI: every engineer at a thousand-engineer company uses Claude Code, individual output is up, but the firm-level ROI has not shown. 27% of executives say AI has met expectations, 73% have not. His framework locates the friction in coordination, review, and integration layers that swallow individual speed-ups. Pairs with Willison and Zitron: same disconnect, three vantage points.
Tunguz lays out the seven layers that turn a raw model into a usable product: context memory, tool actions, orchestration loop, state persistence, evals, guardrails, and UX. The frame is that SaaS sold workflows; the harness sells the bridle around the model. Useful audit map for anyone building on top of a frontier model, since it surfaces the layer you are quietly underinvesting in.
TOOLS
AWS Labs released a Model Context Protocol server that drives a threat-modeling agent through STRIDE step by step: business context, architecture, threat actors, trust boundaries, asset flows, code validation, final report. Calls the host LLM (Amazon Q, Kiro, Cline) instead of hitting its own API. The structured walk is what stops the usual single-shot hallucinated threat list. Output is Markdown plus exportable JSON.
CUDA 13.3 ships CompileIQ, an evolutionary search over compiler options for a specific workload. GPU compilers normally apply default heuristics like register allocation, instruction scheduling, and loop unrolling that are good across the board but rarely optimal for any one kernel. CompileIQ treats the compiler itself as a tunable parameter for hand-tuned Triton, Helion, and CUDA kernels, where the last few per cent of throughput matters.
Peter Yang's /slides skill takes a rough outline and emits an animated HTML deck in minutes: 12 layout formats, 3 templates, Chart.js for live charts, automatic image resizing, subtle motion, single-file output with zero dependencies. The walkthrough covers the six steps that make the skill actually work and the visual QA loop Claude runs on its own slides. Replaces the PowerPoint reflex for fast decks.
A single Go binary that sits between an untrusted workload (a coding agent, CI job, or sandboxed container) and the internet. Default-deny egress, an upstream IP deny list that closes the SSRF and DNS-rebinding gap, and boundary-level secret injection that swaps proxy tokens for real credentials only at egress. A compromised workload can only steal tokens that do not work outside the proxy. Cloud metadata endpoints denied by default.