Researchers from Sakana AI, UBC, Vector Institute, and Oxford published the architecture of their AI Scientist system in Nature. The agent autonomously generates research ideas, designs experiments, runs them, and writes complete papers. A prior version passed human peer review at a top-tier AI conference, outscoring 55% of human-authored submissions. The Nature publication includes new scaling results and discusses what fully automated scientific research would actually require.
Inside Claude's Brain + Apple Reboots Siri for iOS 27
ARC-AGI-3 drops. Fox runs LLMs 2x faster than Ollama. Open source needs to start charging.
François Chollet and Mike Knoop's ARC Prize Foundation released ARC-AGI-3, the first interactive reasoning benchmark designed for AI agents. Instead of static puzzles, it drops agents into unfamiliar environments where they must explore, form strategies, and adapt without any instructions. Every scenario is human-solvable, and scoring measures how efficiently agents learn compared to humans. The benchmark targets the gap between memorisation and genuine reasoning that current models struggle with.
Bloomberg's Mark Gurman reports Apple is testing a standalone Siri app alongside an 'Ask Siri' feature embedded across iOS 27 and macOS 27. The revamped assistant will offer chatbot-style conversation, control features within third-party apps, tap into personal data, and access the open web. Apple plans to unveil it at WWDC on June 8. The overhaul is Apple's clearest signal yet that it sees conversational AI as a core platform capability, not a sidecar feature.
This interactive guide from ngrok walks through LLM quantisation from first principles. It starts with how parameters are stored as floating-point numbers, then builds up to symmetric and asymmetric quantisation methods that can shrink models by 4x and double inference speed with only 5-10% accuracy loss. The post includes working visualisations you can manipulate, covers measuring quality degradation with perplexity and KL divergence, and explains why quantised models can run capable LLMs on a laptop.
Prithvi Rajasekaran from Anthropic Labs describes a generator-evaluator pattern that broke through the performance ceiling on long-running autonomous coding. Inspired by GANs, the system pairs a generator with a skeptical evaluator that grades outputs against concrete criteria, preventing agents from becoming self-congratulatory over extended sessions. Adding a planner as a third agent, the architecture produces full-stack applications over multi-hour runs by decomposing work into tractable chunks and carrying context through structured artifacts.
Figma's growth pushed its Redis caching layer to breaking point. Connection volumes approached hard limits, rapid service scale-ups triggered thundering herds, and a single cluster outage could cascade across the platform. The team built FigCache, a stateless RESP-protocol proxy that decouples connection scaling from fleet capacity. It centralises traffic routing, enforces consistent security policies, and provides end-to-end observability. Since rolling out for Figma's main API, the caching layer has held six nines of uptime.
Anthropic built interpretability tools that trace Claude's internal computations step by step. The findings were surprising: ask Claude to add 36 and 59, and it claims to carry the ones. Inside, two parallel strategies actually run at once, one estimating the rough answer and another computing the last digit. No carrying happens. The research also found Claude's default state is refusal, with hallucinations occurring when its recognition system misfires on partial pattern matches.
After twelve months of coding agents in production, Mario Zechner sees a clear pattern: software is becoming brittle, 98% uptime is the new norm, and user interfaces ship with bugs that any QA team would catch. His argument is that agents compound errors without learning from them, and the speed they offer masks the maintenance debt building underneath. His prescription: deliberate architecture, more code review, and human oversight at decision points.
a16z argues real-world robot deployments remain rare despite rapid technical progress, and the bottleneck is people, not research. Most robotics startups inherit lab culture that optimises for novelty over reliability. The field needs operators, customer-obsessed engineers, and outsiders willing to deploy imperfect robots with teleoperation. Getting hardware into real environments creates feedback loops that pure lab work can't replicate, and the current moment is the best window for that shift.
Steven Vaughan-Nichols makes a blunt case: companies with a combined $7.7 trillion market cap have donated $12.5 million in grants to the Linux Foundation, OpenSSF, and Alpha-Omega. That ratio is the problem. Drawing on his experience sitting on nonprofit boards, he argues maintainers should stop begging for goodwill and start charging enterprises for access. Dependency on charity has never sustained critical infrastructure.
AI is reversing the decades-long trend toward specialisation, according to FormAssembly's Cedric Savarese. An Anthropic study found 27% of AI-assisted work involves tasks people previously lacked the expertise to attempt. Engineers are becoming more full-stack, and non-technical staff handle work that once required specialists. But AI outputs still need a human trust layer to catch hallucinations and overconfidence, and generalists with broad pattern recognition are better positioned for that than narrow experts.
Swizec Teller's hiring policy is simple: your name is on the pager, so you own what your AI wrote. He encourages candidates to use AI in interviews, then probes whether they understand what it produced. His screening question, "tell me about a past project, then we'll dig in," tests for fractal knowledge: the deeper you go, the more detail someone who actually built the thing can provide. LeetCode tests memorisation. Production ownership tests engineering.
RepoClip creates professional demo videos from any public GitHub repository URL. It analyses your code structure with Gemini 2.5 Flash, writes a script, renders visuals with Nano Banana 2, produces video clips with Kling 3.0 Pro, and adds AI narration via OpenAI TTS. The whole pipeline finishes in under a minute. A free tier includes one image and one short video per month with no credit card required.
Intel's Open Edge Platform released anomalib, a Python library packaging state-of-the-art anomaly detection algorithms with built-in experiment management, hyperparameter optimisation, and edge inference support. The library sits at 5,500 GitHub stars and provides ready-to-use implementations across manufacturing defect detection, medical imaging, and surveillance use cases. Models can be exported for edge deployment, making it practical for teams that need anomaly detection running outside the data centre.
Fox is a new open-source LLM inference engine that benchmarks at 2x the throughput of Ollama on an RTX 4060 running Llama 3.2. The speed comes from context caching: Fox remembers previously processed system prompts and messages, so conversations get faster over time instead of re-reading everything. It's a drop-in replacement with an OpenAI-compatible API, supports CUDA, Metal, and Vulkan backends, and is MIT licensed with a commitment to stay free.