Issue #87·Friday, April 24, 2026·38 min read·19 stories

GPT-5.5 Launches. DeepSeek Drops V4 Open Weights.

Anthropic hits $1T. Pentagon wants $54B for AI drones. Tesla buries a $2B deal.

GPT-5.5 scored 82.7% on Terminal-Bench 2.0, narrowly beating Claude Mythos Preview. DeepSeek followed with V4, open-sourcing 1.6 trillion parameters and native 1M context at a fraction of closed-model costs. Anthropic crossed $1 trillion on secondary markets while publishing a postmortem tracing weeks of Claude Code quality issues to three separate bugs.

NEWS

GPT-5.5 Matches GPT-5.4 Latency While Scoring 82.7% on Terminal-Bench 2.0

OpenAI released GPT-5.5, a model that handles messy, multi-part tasks with less hand-holding. It scores 82.7% on Terminal-Bench 2.0, narrowly beating Claude Mythos Preview, and matches GPT-5.4 per-token latency despite being substantially more capable. The model uses fewer tokens to complete the same Codex tasks, making it both smarter and cheaper to run. Available now in ChatGPT and Codex for Plus, Pro, Business, and Enterprise users.

DeepSeek V4 Drops 1.6 Trillion Parameters as Open Weights With Native 1M Context

DeepSeek open-sourced V4 in two sizes: Pro (1.6T total, 49B active) and Flash (284B total, 13B active). Both ship with 1M context as standard, using a novel sparse attention mechanism that cuts long-context compute costs compared to V3. Pro rivals top closed-source models on reasoning and agentic coding. Flash matches Pro on simpler agent tasks at far lower cost, with API pricing starting at $0.028 per million input tokens.

Anthropic Hits $1 Trillion on Secondary Markets, Overtaking OpenAI

Anthropic has surpassed OpenAI in secondary market valuation, reaching $1 trillion amid frantic investor demand. Buyers are paying nearly three times the company's last primary round valuation of $61 billion from March. The premium reflects growing enterprise adoption of Claude across federal agencies and Fortune 500 companies. Just two months ago, Anthropic was fielding $800 billion offers.

Tesla Buries a $2 Billion AI Hardware Acquisition in a Single SEC Sentence

A one-sentence disclosure in Tesla's Q1 10-Q filing reveals a deal to acquire an unnamed AI hardware company for up to $2 billion in stock and equity awards. Only $200 million is guaranteed, with the remaining $1.8 billion tied to deployment milestones. The filing names no target and provides no description. Tesla never mentioned the deal during its earnings call, despite it being 20 times larger than any previous acquisition.

Pentagon Requests $54 Billion for Autonomous Drone Warfare, a 24,000% Increase

The Pentagon's 2027 budget requests over $54 billion for the Defense Autonomous Warfare Group, a 24,000% increase over the prior year. The funding covers autonomous systems across air, land, and sea, including the "Drone Dominance" programme. Former CIA director David Petraeus called it the largest single commitment to autonomous warfare in history. AI safety researchers warn that every frontier system tested by the UK AI Security Institute had exploitable safeguard failures.

AI Agent Designs a Complete RISC-V CPU From a 219-Word Spec in 12 Hours

Chip design startup Verkor claims its agentic AI system, Design Conductor, autonomously produced a complete RISC-V CPU core from a 219-word requirements document in 12 hours. The resulting VerCore processor met timing at 1.48 GHz on a 7nm process kit. During optimisation, the system independently selected a Booth-Wallace multiplier after testing multiple variants. The chip exists only in simulation, but commercial chip design typically takes 18 to 36 months.

Meta Is Tracking Employee Keystrokes Across Google, GitHub, and Slack to Train AI

CNBC reports that Meta's Model Capability Initiative captures employee keystrokes and mouse clicks across hundreds of websites and apps, including Google, LinkedIn, GitHub, and Slack. The surveillance tool monitors how employees use their work computers to generate training data for AI agents that can perform office and coding tasks. Meta confirmed the project but declined to comment on the tracked sites. The list originally included ChatGPT and Claude but is still in flux.

TECHNICAL

The LLM Inference Trilemma: You Can't Optimise Throughput, Latency, and Cost at Once

DigitalOcean's practitioner guide breaks down why scaling LLM inference is nothing like scaling web services. Push throughput up and latency rises. Clamp latency down and GPU costs inflate. Optimise for cost and you compromise on both. The article unpacks four distinct cost dimensions (capital, operational, opportunity, and amortised per-token) and walks through hardware selection, batching strategies, and benchmarking approaches that reveal the real cost surface.

Augment Measured Dozens of AGENTS.md Files. The Best Matched a Model Upgrade.

Augment Code studied AGENTS.md files across their monorepo and found the gap between best and worst was equivalent to upgrading from Haiku to Opus. Files between 100-150 lines with focused reference documents performed best, delivering 10-15% improvements across all metrics. Longer files reversed the gains. The most effective pattern was numbered procedural workflows for multi-step tasks, which moved agents from unable-to-complete to correct on the first try.

Anthropic Traces Claude Code Quality Reports to Three Separate Bugs

Anthropic traced reports of degraded Claude responses to three separate changes. On March 4, they defaulted reasoning effort from high to medium (reverted April 7). A March 26 bug kept clearing thinking context every turn instead of once (fixed April 10). An April 16 verbosity reduction hurt coding quality (reverted April 20). The API was never affected. All fixes shipped in Claude Code v2.1.116, and Anthropic is resetting subscriber usage limits.

How One Engineer Doubled GPU Efficiency by Splitting Prefill From Decode

Called in to assess whether a retailer needed 96 H100s for holiday traffic, the author discovered a bimodal workload hiding behind averaged metrics. Prompt processing ran at 92% GPU utilisation for 200ms per request. Token generation dropped to 30% for 3-9 seconds. Splitting prefill and decode onto separate hardware, following UC San Diego's DistServe paper, doubled effective throughput without buying a single new card.

Your Synthetic Data Passed Every Test and Still Broke Your Model

A fraud-detection model trained on certified synthetic data began failing completely on entire transaction classes three months after deployment. All quality metrics still passed. The problem: standard fidelity metrics evaluate individual feature distributions in isolation, never testing how features interact. The article identifies three specific failure modes in the fidelity-utility-privacy framework and proposes measuring conditional distributions and feature interactions rather than marginals alone.

ANALYSIS

Ethan Mollick: GPT-5.5 Signals the Frontier Is Still Moving Fast

Mollick tested GPT-5.5 Pro against models from o3 to Kimi K2.6 on a coding challenge: build a procedurally generated 3D harbour town simulation evolving from 3000 BCE to 3000 AD. Only GPT-5.5 Pro actually modelled an evolving town rather than swapping in replacement buildings. He argues the release proves AI improvement has not plateaued, while noting the frontier of capability remains jagged across domains. GPT-5.5 Pro also completed the task in 20 minutes versus 33 for GPT-5.4 Pro.

Nilay Patel: The Gap Between AI Hype and Public Sentiment Is Widening

The Verge's editor-in-chief coined "software brain" to describe the tech industry's tendency to reduce every problem to algorithms and databases. Polling shows AI has worse favourability than ICE, Gen Z anger toward the technology jumped from 22% to 31% in a year, and politicians backing data centres are getting voted out. Patel argues the backlash stems from an industry that has not earned social permission for the resources it consumes.

Why Both Code-First and Markdown-First Agent Specs Are Failure Modes

The debate between Python and Markdown for agent specification has become one of AI's most consequential architectural questions. Code-maximalism strips out the reasoning that makes agents useful. Markdown-maximalism produces systems you cannot debug or improve. The piece argues both positions are the same mistake: neither is agent-native. Real agent architecture requires the flexibility to pick the right approach for each subtask, not a single spec language for everything.

TOOLS

HuggingFace Ships ml-intern, an Open-Source ML Agent That Reads Papers and Trains Models

HuggingFace released ml-intern, an open-source agent that reads research papers, trains models, and ships them to the Hub. The project hit 2,990 stars with 530 added in 24 hours. Built in Python, it automates the full ML research workflow from literature review through model training and deployment. The agent handles paper comprehension, experiment design, hyperparameter tuning, and model packaging without human intervention.

Context Mode Cuts AI Coding Agent Context Windows by 98%

Context Mode sandboxes tool output for AI coding agents, reducing context window consumption by 98%. The TypeScript project works across 12 platforms and has accumulated 9,362 stars with 302 added in a single day. For teams hitting context limits on long coding sessions, this addresses one of the most common friction points with agentic workflows. The approach intercepts and compresses tool responses before they reach the model.

Honker Adds Postgres-Style NOTIFY/LISTEN to SQLite, No Broker Required

Honker is a SQLite extension that brings Postgres-style pub/sub, task queues, and event streams without a separate broker like Redis. Queue operations commit atomically with business writes in the same transaction, so rollback drops both. Built in Rust with bindings for Python, Node, Bun, Ruby, Go, Elixir, and C++, it uses WAL file notifications for push semantics with single-digit millisecond delivery. Eliminates the dual-write problem for SQLite-first projects.

A 9.7B Qwen Model Beats Frontier Coding Agents With Better Scaffolding

little-coder is a coding agent built for small local models running on laptop GPUs. Using 16 extensions, 30 skill files, and adaptations like per-turn tool injection and output repair, a 9.7B Qwen model scored 45.56% on Aider Polyglot where vanilla Aider managed 19.11%. A later run with Qwen3.6-35B hit 78.67%. All benchmarks ran on consumer hardware with zero cloud inference. The project ships as extensions for the pi coding agent framework.