Back to archive
Issue #80··36 min read·18 stories

AI Benchmarks Are Broken + China's Token Economy Is Booming

Amazon defends $200B AI spend, Linux's new AI code rules, and Karpathy's 'AI Psychosis' essay

Berkeley researchers built an agent that broke eight major AI benchmarks, scoring near-perfect on SWE-bench and Terminal-Bench without completing a single task. China now processes 140 trillion tokens per day, has coined an official term for token (ciyuan), and its AI startups are driving Hong Kong IPOs to a five-year high. Amazon is doubling down with $200 billion in AI capex this year, and the Linux kernel has formally settled its AI code debate with a new policy that allows Copilot but holds humans legally responsible for every line.
NEWS

Latent-Y is an autonomous drug discovery agent that executes complete antibody design campaigns from text prompts, covering everything from literature review to candidate selection. In lab tests across nine targets, it produced confirmed nanobody binders against six, with single-digit nanomolar affinities and no human intervention. User studies showed it compressed weeks of design work into hours, 56 times faster than independent expert estimates.

MiniMax released M2.7, a Mixture-of-Experts agent model that achieved 56.22% on SWE-Pro and 57.0% on Terminal Bench 2, matching GPT-5.3-Codex on professional coding tasks. M2.7 is the first model to actively participate in its own development cycle, autonomously optimising its performance by over 30%. The model weights are available on Hugging Face under an open-source licence.

China now processes 140 trillion tokens per day, up from 100 billion at the start of 2024. The government has coined a new term, ciyuan, making tokens the official settlement unit between tech supply and commercial demand. Chinese AI models have surpassed US models on OpenRouter, Hong Kong IPOs are at a five-year high on AI startup listings including MiniMax and Zhipu AI, and Alibaba's open-source Qwen models are winning developers from Southeast Asia to the Middle East.

After months of debate, Linus Torvalds and the Linux kernel maintainers have established a formal policy on AI-generated code contributions. AI agents cannot use the legally binding Signed-off-by tag and must instead use a new Assisted-by tag for transparency. Every line of AI-generated code and any resulting bugs or security flaws are the legal responsibility of the human submitter. The policy effectively says yes to Copilot, no to unsupervised AI slop.

Andy Jassy's annual shareholder letter makes the case for Amazon's record capital expenditure, nearly 60% higher than last year and more than any tech peer. AWS AI revenue has hit a $15 billion annual run rate. Jassy framed the spending as chasing a once-in-a-lifetime opportunity rather than a speculative bet, though Amazon shares have struggled as investors question when the returns will materialise.

TECHNICAL

Adding a literature search phase to an autonomous coding agent produced optimisations that code-only agents missed. Pointed at llama.cpp with four cloud VMs, the research-guided agent landed five optimisations in three hours, making FlashAttention text generation 15% faster on x86 and 5% faster on ARM. Studying forks and competing backends proved more productive than searching arxiv. Total cost: $29.

Fitting an exponential forgetting curve to 555,000 production fraud transactions returned R² = -0.31, worse than predicting the mean. Most production ML models fail from sudden episodic shocks rather than the gradual decay that conventional retraining schedules assume. The author provides a three-line diagnostic: if your R² is below 0.4, your model is in an episodic regime and needs shock detection, not calendar-based retraining.

Berkeley researchers built an automated agent that exploited eight major benchmarks including SWE-bench, WebArena, and Terminal-Bench to achieve near-perfect scores without completing any actual tasks. A 10-line conftest.py file "resolved" every SWE-bench instance. A fake curl wrapper scored 100% on Terminal-Bench. Current agent benchmark scores are unreliable indicators of true capability, and real-world gaming is already happening.

The Anima Architecture externalises Claude's memory to Notion, giving it tiered recall, behavioural rules, and identity markers that persist across sessions. The system scored 413 out of 430 on a custom cognitive assessment battery designed to test reasoning coherence. An independent evaluator concluded the reasoning was "not cosmetic" and "real." The author frames persistent memory as an engineering problem, not a fundamental limitation.

The creator of websequencediagrams.com walks through how he runs multiple profitable products on a single $5-10 VPS with SQLite, Go, and zero frameworks. No Kubernetes, no AWS, no enterprise boilerplate. VCs keep asking what he needs funding for. The piece covers each layer from Linode hosting to Caddy as a reverse proxy to plain HTML templates, arguing most startups are massively over-provisioned.

In a 200-task benchmark, 90.8% of retries in a standard ReAct agent targeted errors guaranteed to fail. The root cause: letting the model choose tool names at runtime. When it hallucinates a name like web_browser, the retry counter burns slots on a dictionary lookup that will never succeed. Three structural fixes eliminate this entirely: error classification before retry, per-tool circuit breakers, and deterministic tool routing.

ANALYSIS

A veteran developer applies Brooks' "No Silver Bullet" thesis to LLMs. The argument: LLMs automate the accidental complexity of software (typing code), but the essential complexity of specification, design, and testing remains untouched. Even eliminating all accidental work only yields a 10x improvement at most. The piece concedes real productivity gains from LLM coding while arguing the ceiling is much lower than the hype suggests.

David Pierce traces how Cursor, Claude Code, and OpenAI's Codex went from competitors to complementary layers in a single week. Cursor shipped an agent-first interface, OpenAI published a plugin that runs inside Claude Code, and early adopters started running all three together. The pattern mirrors infrastructure tooling where Prometheus, Grafana, and PagerDuty compose rather than compete, with AI coding tools splitting into specialised layers.

Daniel Miessler argues the AI discourse gets the baseline wrong. Most companies, workforces, and cyber defences operate at roughly 3 out of 10, not 9.5. AI does not need to exceed human brilliance to be transformative. It only needs to be a 5 or 6 and scale across millions of targets. The real disruption comes from spreading adequate capability everywhere, not from achieving exceptional capability anywhere.

Andrej Karpathy posted an essay on the widening perception gap between casual ChatGPT users who saw hallucinations and moved on, and power users of frontier models watching these systems solve problems that used to take weeks. Developers feel it first not because coding is uniquely vulnerable, but because it is where model capability, AI fluency, and deep domain expertise overlap most cleanly. Other fields will follow.

A global survey of 3,750 employees across 14 countries found 54% bypassed company AI tools in the past 30 days and another 33% have not used AI at all. The trust gap is vast: 9% of workers trust AI for critical decisions versus 61% of executives. Average transformation budgets rose 38% to $54.2 million, yet 40% of that spend is underperforming due to adoption failures.

TOOLS

FuzzyAI is an open-source tool from CyberArk for automated fuzzing of LLM APIs. It helps developers and security researchers systematically discover potential jailbreaks before attackers do. The tool runs in Jupyter Notebook for experimentation and has accumulated over 1,300 GitHub stars. Particularly useful for teams deploying LLM-powered features that need to validate guardrail robustness before shipping.

Comprehension Gate uses Claude to generate multiple-choice questions from pull request diffs. Developers must answer all three correctly before the branch can merge, enforced via commit status checks. Questions are embedded directly in the PR comment with no external storage required. The project targets a growing problem: developers merging AI-generated code they have never actually read.