Tuesday March 24, 2026

GPT-5.4 Pro solves a frontier math open problem, fine-tuned LLMs outperform MFA writers in creative tasks, and Urlx debuts as an agent-made Rust replacement for curl.

Interested in AI engineering? Let's talk

News

iPhone 17 Pro Demonstrated Running a 400B LLM

A 400B parameter model has been successfully run locally on an iPhone, achieving an inference speed of 0.6 t/s. This demonstration highlights the increasing feasibility of executing massive LLMs on mobile hardware.

Epoch confirms GPT5.4 Pro solved a frontier math open problem

Frontier models including GPT-5.4 Pro, Opus 4.6 (max), and Gemini 3.1 Pro have solved an open problem in Ramsey-theoretic hypergraph partitioning from the FrontierMath benchmark. The models successfully improved the lower bound for the sequence H(n) by a constant factor using a novel construction algorithm, a task previously estimated to require 1–3 months of expert human effort. The AI-generated solution is slated for publication in a specialty journal, demonstrating the capability of LLMs to contribute to original mathematical research.

I built an AI receptionist for a mechanic shop

Axle is a custom AI voice receptionist designed to automate customer service for a mechanic shop using a RAG pipeline. The architecture leverages Voyage AI for embeddings, MongoDB Atlas for vector search, and Claude 3.5 Sonnet to generate grounded responses from a scraped knowledge base. The telephony stack utilizes Vapi with Deepgram for STT and ElevenLabs for TTS, connected via a FastAPI webhook to manage real-time tool calling and conversation memory. Final optimizations focused on voice-specific prompt engineering and a structured escalation flow to ensure natural dialogue and lead retention.

Cq – Stack Overflow for AI coding agents

As LLMs and agents displace traditional knowledge bases like Stack Overflow, they often waste compute and tokens by solving the same technical hurdles in isolation. Mozilla AI is developing cq, an open-source protocol and "Stack Overflow for agents" designed to facilitate structured knowledge sharing. By leveraging MCP servers and multi-agent verification, cq allows agents to query a shared commons for proven solutions and contribute novel findings, improving output reliability through collective consensus.

Designing AI for Disruptive Science

Current AI excels at "hypernormal science" by optimizing for prediction within existing paradigms, but it lacks the capacity for the conceptual reframing necessary for paradigm shifts. While foundation models can scale discovery within known structure types, they often fail to derive underlying physical principles or step outside the logic of their training data. To enable disruptive science, AI development must shift from purely predictive scaling toward architectures that prioritize simplicity, cross-disciplinary analogy, and physical grounding. Furthermore, autonomous AI scientists could serve as model organisms for metascience, allowing researchers to simulate and optimize the institutional conditions that foster revolutionary breakthroughs.

Research

LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis

LUMINA is an LLM-driven framework designed to optimize GPU architecture for AI workloads by automating design space exploration (DSE). It extracts architectural insights from simulator code and uses sensitivity studies to generate auto-correcting DSE rules, significantly reducing simulation costs compared to traditional heuristics. In evaluations, LUMINA outperformed ML baselines by 17.5x in efficiency and achieved a 32.9% improvement in Pareto Hypervolume, identifying superior GPU designs in only 20 exploration steps.

Can Good Writing Be Generative?

A behavioral experiment challenged the uniqueness of human creative writing by having MFA writers compete against LLMs in emulating author styles. Expert judges initially preferred human writing (82.7%) with in-context prompting, but this reversed to a 62% preference for AI after fine-tuning LLMs on complete works. Lay judges consistently preferred AI writing, leading to an "identity crisis" among expert writers and raising fundamental questions about AI's creative limitations and the future of creative labor.

OpenMath: Ontology-Guided Neuro-Symbolic Inference

Addressing LLM limitations like hallucination and lack of formal grounding in high-stakes domains, this work investigates enhancing reliability through RAG with formal domain ontologies. A neuro-symbolic pipeline, utilizing the OpenMath ontology and hybrid retrieval with cross-encoder reranking, was implemented and evaluated on the MATH benchmark. Results show that while ontology-guided context improves performance with high retrieval quality, irrelevant context actively degrades it, highlighting both the potential and challenges of such neuro-symbolic approaches.

Hyperagents

DGM-Hyperagents (DGM-H) introduce a framework for open-ended self-improvement by integrating task and meta agents into a single, self-referential editable program. Unlike previous systems limited by handcrafted meta-mechanisms, DGM-H enables metacognitive self-modification, allowing the agent to improve both its task-solving logic and the underlying process for generating future improvements. This approach decouples self-modification from domain-specific coding tasks, facilitating transferable meta-level gains and self-accelerating progress across any computable domain.

Challenges and Design Issues in Finding CUDA Bugs via GPU-Native Fuzzing

As AI workloads shift to heterogeneous CPU-GPU systems, the GPU software stack lacks the memory safety hardening found in mature CPU environments. Current mitigation strategies often rely on unfaithful CPU-based translations that fail to capture GPU architectural nuances, leading to exploitable vulnerabilities. To address this, the authors propose a GPU-native fuzzing pipeline for CUDA programs to ensure behavioral faithfulness and improve system security.

Code

Agent Kernel – Three Markdown files that make any AI agent stateful

Agent Kernel is a lightweight, framework-less system for building stateful AI agents using markdown files and git instead of databases or vector stores. By leveraging standard instruction files like AGENTS.md, it teaches coding agents (e.g., Claude Code, Cursor) to maintain their own identity, stateful knowledge, and append-only session logs. This approach enables persistent memory and context across sessions through simple file-based version control.

LLM Debate Benchmark

The LLM Debate Benchmark evaluates model performance in multi-turn, adversarial arguments using side-swapped matchups to control for side bias. Rankings are determined by Bradley-Terry ratings from a multi-model judge panel, focusing on rebuttal quality, strategic coherence, and epistemic discipline under pressure. Current results place high-reasoning variants of Claude 4.6 and GPT-5.4 at the frontier, demonstrating that reasoning-specific architectures significantly enhance argumentative stability and counter-argument handling compared to standard models.

Aerko_ – An offline-first, Vanilla JavaScript fitness PWA with local AI

Aerko_ is a local-first fitness PWA featuring real-time biomechanical analysis via MediaPipe, optimized using a "Phantom DOM" in Web Workers and EMA interpolation for smooth coordinate tracking. The application's logic was 90% generated by Gemini 3.0/3.1 Pro under strict modular constraints provided by the developer. It prioritizes privacy through client-side AES-GCM 256-bit encryption and IndexedDB, operating entirely without cloud integration.

OpenPencil: Open-source AI-native vector design tool

OpenPencil is an open-source AI-native vector design tool featuring a multi-agent orchestrator for parallel UI generation and a built-in MCP server for integration with external agents. It employs multi-model intelligence to adapt prompts across various LLMs and supports a design-as-code workflow using JSON-based .op files. The platform includes a CLI for terminal-based design manipulation and exports to multiple frameworks including React, Tailwind, and Flutter.

Urlx – an agent-made Rust replacement for curl/libcurl

urlx is a memory-safe, Rust-based reimplementation of curl and libcurl, designed to mitigate historical memory safety CVEs by eliminating unsafe code in its core library and CLI. It offers a drop-in CLI and C ABI compatible with existing curl usage, validated by passing 1,300 of curl's own tests. Key features include rustls for TLS (avoiding OpenSSL) and comprehensive support for various protocols (HTTP/1-3, FTP, SSH, SMTP, WebSocket, MQTT) and DNS functionalities like DoH and DoT.

    GPT-5.4 Pro solves a frontier math open problem, fine-tuned LLMs outperform MFA writers in creative tasks, and Urlx debuts as an agent-made Rust replacement for curl.