Back to archive
Issue #112··40 min read·20 stories

Claude Opus 4.8 Ships; You Can't Manage 20 Agents at Once

Anthropic passes OpenAI at $965B. 10,000 agents hunt Ramp's own bugs. Robinhood opens to agents.

Cognition raised its own billion at a $26B valuation with Devin at $492M run-rate, so the day's two biggest cheques both went to coding agents. Elsewhere OpenAI and Thrive showed how they built a tax agent that rewrites itself through Codex, ElevenLabs put out Music v2 with API prices halved, and CrowdStrike helped take down a botnet that spent two years hunting open-source developers. There's also a $9.99 Mac app that scans your AI chat logs for leaked keys.

NEWS

Cognition raised over $1 billion at a $26 billion valuation led by Lux Capital, General Catalyst, and 8VC. Its AI software engineer Devin now runs at $492 million in run-rate revenue, with enterprise usage up more than tenfold since the start of the year. The company points to outcomes like Mercedes-Benz cutting an eight-month legacy modernisation project down to eight days, and customers including Citi, Goldman Sachs, and the US Army.

Anthropic upgraded its flagship to Claude Opus 4.8, just 41 days after 4.7 and at the same price. Users on claude.ai can now control how much effort Claude spends on a task, Claude Code gains a Dynamic Workflows feature that coordinates hundreds of parallel subagents for codebase-scale migrations, and fast mode runs at 2.5x speed for a third of the previous cost. Early testers say it flags its own uncertainty more readily.

Anthropic raised $65 billion in Series H funding led by Altimeter, Dragoneer, Greenoaks, and Sequoia, valuing it at $965 billion post-money and pushing it past OpenAI as the most valuable AI startup. Run-rate revenue crossed $47 billion earlier this month, up sharply since February's Series G. The round folds in $15 billion of previously committed hyperscaler money and funds compute, safety research, and product expansion.

ElevenLabs released Music v2, its latest generative music model, with better vocals, instrumentation, and arrangement across genres plus improved multilingual support. It powers three products: ElevenMusic for listening and remixing, ElevenAPI for embedding music generation into your own product, and ElevenCreative for downloadable tracks for ads and video. The company also cut pricing by up to 50% on the API and up to 40% for Creative self-serve customers.

Robinhood launched Agentic Trading and an Agentic Credit Card, letting customers connect their own AI agents to trade stocks and spend through Robinhood's MCP servers. Agents get a dedicated trading account kept separate from the main portfolio, with built-in safety controls and a real-time activity feed. The equities beta works with agents built on tools like Claude or Cursor.

CrowdStrike, working with Google and Shadowserver, disrupted Glassworm, a botnet that spent two years pushing malware and stealing passwords from open-source software developers. The operation went after attackers who compromise individual developer machines to reach the wider supply chain. As CrowdStrike put it, adversaries are no longer just targeting products, they are targeting the developers who build them, because one compromised workstation can cascade into thousands of downstream organisations.

YouTube is moving past creator self-disclosure and will start automatically applying a label when its systems detect significant photorealistic AI use in a video. It is also making the existing AI-content labels more prominent for viewers. The shift matters for anyone publishing AI-generated or heavily edited footage, since the platform will now flag it whether or not the uploader discloses the AI use themselves.

Meta still makes almost all its money from advertising, yet the stock is down on the year while capital expenditure climbs toward $125 to $145 billion. Sherwood reports the company is now pushing into subscriptions, enterprise, and even floating a cloud business to find revenue that justifies the spend. It is a real shift for a firm whose asset-light software model long funded itself on ads alone.

TECHNICAL

Calling torch.compile can make a PyTorch model run up to 10x faster. Without compilation the GPU launches a separate kernel for every operation, paying launch overhead and writing intermediate results to memory each time. The Inductor compiler fuses dependent operations into single Triton kernels, keeping data in fast registers and cutting memory traffic and launch costs, shown with a concrete vertical-fusion example.

Databricks serves more than 125 trillion tokens a month, and this engineering post argues that at that scale LLM inference is a reliability problem before it is a throughput one. It walks through the failure modes of spiky agent traffic on unreliable GPU fleets, including the blast radius when a single node goes down in disaggregated prefill and decode setups, and how the team holds p95 latency under load.

OpenAI forward-deployed engineers and Thrive Holdings built Tax AI for Crete's network of 30-plus accounting firms, processing 7,000 returns this season at up to 97% accuracy and saving practitioners about a third of their prep time. It measurably improves itself: returns hitting 75% field completion climbed from 25% to 86% in six weeks, via a loop that turns practitioner corrections into production traces, then targeted evals, then scoped Codex tasks.

Ramp pointed roughly 10,000 coding agents at its backend over an eight-hour run, each told to find one vulnerability, then used thousands more agents to deduplicate and reproduce the results. Seven confirmed high-severity bugs survived, all novel and missed by prior pen tests, bug bounties, and AI scans. The pipeline is model-agnostic: cheaper open-weight models like Kimi K2.6 and DeepSeek V4 Pro still surfaced real high-severity issues.

ANALYSIS

Raindrop's founder argues agent evaluation has gotten needlessly complicated, with vendors selling strategies that diverge from what actually works in production. Drawing on building agents for companies like Framer, Clay, and Vercel, the guide favours raising the reliability floor in critical workflows over chasing benchmark scores: do error analysis, teach agents to say I don't know, define golden cases, inspect trajectories locally, and lean on code-aware offline evals.

Lenz.io ran 1,000 real-world factual claims past a panel of five frontier models and found that on 67% of them, at least one model dissents from the majority verdict. The takeaway for anyone building evals or using an LLM as a judge: a single model is not ground truth. Treating agreement as a signal and disagreement as a flag for human review beats trusting one confident answer.

With 42% of committed code now AI-assisted and roughly 29% of it merged without manual review, InfoWorld describes a treadmill: generate fast, then bolt on ever more checks to catch the slop. Its fix is an AI assembly model that shrinks the surface area needing review in the first place, rather than scaling guardrails linearly with every new AI-built feature.

Drawing on the NeurIPS Artificial Hivemind paper, this piece shows how more than 70 language models collapse toward identical phrasing and metaphors, with DeepSeek-V3 and GPT-4o independently producing the same Elevate your iPhone with our line. The argument for marketers and builders: letting LLMs draft brand copy unedited flattens every brand into one default voice, so the editing pass is where any distinctiveness now has to come from.

Addy Osmani argues that spinning up AI agents is cheap, but closing the loop is not: every agent's output still routes through one serial reviewer, you. Borrowing the GIL and Amdahl's Law, he shows that adding agents past your review rate just deepens a queue of unmerged work and lowers your standards. The fix is to treat attention as the scarce resource and scale your fleet to what you can actually review.

TOOLS

Sieve is a $9.99 Mac app that scans the chat histories of Claude Code, Cursor, Copilot, Windsurf, and Codex, plus your .env files, for secrets you pasted or that turned up in autocomplete. It can redact found keys straight from VS Code's SQLite databases, stores rotated values in the macOS Keychain behind Touch ID, and ships a local MCP server so Claude Code can check for exposures without seeing raw values.

Cate is an open-source desktop IDE built around an infinite canvas instead of stacked tabs. You arrange code, terminals, browsers, documents, and AI agents spatially in a per-project workspace, and the layout, including working directories, comes back exactly as you left it across sessions. Instead of cycling a dozen scattered terminals and agent windows, each project gets one persistent canvas you navigate by position.

bumblebee is an open-source tool from Perplexity that inventories the packages, extensions, and developer-tool metadata on macOS and Linux dev machines. When a supply-chain advisory names a compromised package and version, you can instantly check which of your laptops actually have a match across npm, pnpm, PyPI, Go, and MCP configs. It ships as a single static Go binary with no dependencies, built for fast incident response.