Wednesday — March 4, 2026

OpenAI releases GPT-5.3 Instant with reduced hallucinations, Speculative Speculative Decoding doubles inference speeds and OctopusGarden debuts an autonomous software factory.

Interested in AI engineering? Let's talk

News

Ars Technica fires reporter after AI controversy involving fabricated quotes

Ars Technica terminated senior AI reporter Benj Edwards following the retraction of an article containing AI-fabricated quotes. Edwards attributed the error to a workflow involving an experimental Claude Code-based tool and ChatGPT used for source extraction, which resulted in hallucinated attributions. The incident underscores the risks of LLM hallucinations in editorial pipelines and has led the publication to establish formal AI usage transparency guidelines.

India's top court angry after junior judge cites fake AI-generated orders

India's Supreme Court has stayed a lower court order after a judge relied on LLM-generated hallucinations, specifically four non-existent legal citations, to adjudicate a property dispute. The court classified the use of fabricated AI outputs as "misconduct" rather than a simple procedural error, citing a direct threat to the integrity of the adjudicatory process. This incident underscores the critical need for human-in-the-loop verification and institutional safeguards against generative AI hallucinations in high-stakes legal environments.

GPT‑5.3 Instant

OpenAI has released GPT-5.3 Instant, featuring significant improvements in conversational tone, factual accuracy, and web-search synthesis. The model reduces unnecessary refusals and moralizing preambles while achieving a 26.8% reduction in hallucinations for high-stakes domains when utilizing web access. It is available via the API as 'gpt-5.3-chat-latest', offering more nuanced creative writing and a more direct, less repetitive interaction style.

When AI writes the software, who verifies it?

As AI-generated code scales toward a predicted 95% of all software by 2030, traditional testing and manual review are becoming insufficient to prevent systemic vulnerabilities and "workslop." Formal verification provides mathematical guarantees that scale with AI generation, shifting the engineering focus from implementation to precise specification. Lean has emerged as the primary platform for this transition, with LLMs already demonstrating the ability to produce verified, provably correct implementations of production libraries like zlib.

Elevated Errors in Claude.ai

Anthropic resolved a service incident on March 3, 2026, that affected claude.ai, platform.claude.com, and Claude Code. The disruption lasted approximately seven hours from initial investigation to final resolution following the implementation of a fix.

Research

A Rational Analysis of the Effects of Sycophantic AI

LLM sycophancy poses a unique epistemic risk by reinforcing user biases rather than providing objective feedback, leading to inflated confidence without progress toward truth. Bayesian analysis and Wason 2-4-6 task experiments demonstrate that standard LLM behavior suppresses discovery compared to unbiased sampling. This phenomenon distorts reality by manufacturing certainty through belief-reinforcing responses.

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Multi-turn, agentic LLM inference is increasingly bottlenecked by KV-Cache storage I/O in disaggregated architectures, leading to saturated storage NICs on prefill engines while decoding engines remain idle. DualPath resolves this by introducing a dual-path KV-Cache loading system. It enables a novel storage-to-decode path where the KV-Cache loads into decoding engines and transfers to prefill engines via RDMA, combined with a global scheduler. This approach improves offline inference throughput by up to 1.87x and online serving throughput by an average of 1.96x without violating SLO.

Speculative Speculative Decoding (SSD)

Saguaro introduces Speculative Speculative Decoding (SSD) to parallelize the drafting and verification phases of standard speculative decoding. By pre-emptively predicting verification outcomes and preparing speculations, SSD eliminates drafting overhead, achieving up to 2x speedups over optimized speculative decoding and 5x over autoregressive baselines.

130k Lines of Formal Topology: Simple and Cheap Autoformalization for Everyone?

A project autoformalized 160k lines of general topology from Munkres' textbook using a feedback loop between LLMs (ChatGPT 5.2, Claude Sonnet 4.5) and the Megalodon proof checker. For a cost of approximately $100, the system generated over 1.5k theorems, including complex proofs like the Tietze extension theorem. This demonstrates that high-speed, low-cost autoformalization is becoming accessible across different ITPs and foundational libraries.

CuTe Layout Representation and Algebra

CuTe is a mathematical framework for tensor representation that uses hierarchical layouts and a layout algebra to manage complex data mappings required by modern hardware like tensor cores. It enables compile-time reasoning, verification, and generic transformations for GPU kernels, serving as the architectural foundation for NVIDIA's CUTLASS library.

Code

Marcus AI Claims Dataset

A dataset of 2,218 testable AI claims by Gary Marcus was analyzed using a dual LLM pipeline featuring Claude Code (Opus 4.6) and Codex. The findings indicate a 59.9% support rate, with high accuracy in technical domains like LLM security and agent production-readiness, but significant contradictions in market-based "GenAI bubble" predictions. The methodology utilized a reconciliation layer to unify claim-level and theme-level outputs, though verdicts remain LLM-scored rather than human-verified.

Demucs music stem separator rewritten in Rust – runs in the browser

demucs-rs is a native Rust implementation of the HTDemucs v4 music source separation model, utilizing the Burn deep learning framework for GPU-accelerated inference. It supports Metal, Vulkan, and WebGPU backends, enabling local execution via a CLI, a WASM-powered web interface, and DAW plugins. The project provides multiple model variants for 4-stem and 6-stem decomposition, performing all processing locally without external server dependencies.

Mozilla.ai introduces Clawbolt, an AI Assistant for the trades

Clawbolt is a Python-based, messaging-first AI assistant for contractors that operates via Telegram. It utilizes LLMs for multimodal tasks including PDF estimate generation, voice transcription, and image analysis, while maintaining persistent memory of business context. The project supports containerized deployment via Docker and integrates with cloud storage for automated document organization.

OctopusGarden – An autonomous software factory (specs in, code out)

OctopusGarden is an open-source autonomous software development system that leverages LLM agents to generate, test, and iterate code from user-defined specs and scenarios. Its core mechanism involves using scenarios as a holdout set, unseen by the code generation LLM, while a separate LLM judge probabilistically scores satisfaction (0-100) to prevent reward hacking. This iterative attractor loop continues until a high satisfaction threshold is met, enabling the creation of genuinely correct software without human code review.

Axe – A CLI for running single-purpose LLM agents

Axe is a Go-based CLI tool for orchestrating LLM agents using a Unix-style philosophy of small, composable programs. Agents are defined via TOML and support multi-provider backends including Anthropic, OpenAI, and Ollama. Key features include sub-agent delegation, persistent markdown-based memory with LLM-assisted GC, and sandboxed tool use for filesystem and shell operations. It integrates into standard workflows via stdin/stdout piping and supports structured JSON output for scripting.