Thursday March 26, 2026

Apple distills Gemini for on-device AI, Yann LeCun’s LeWorldModel enables 48x faster planning, and Optio orchestrates AI coding agents in K8s.

Interested in AI engineering? Let's talk

News

TurboQuant: Redefining AI efficiency with extreme compression

TurboQuant is a quantization framework that optimizes KV cache efficiency and vector search by combining PolarQuant and QJL. PolarQuant utilizes polar coordinates to eliminate the memory overhead of traditional quantization constants, while QJL applies a 1-bit Johnson-Lindenstrauss Transform to correct residual errors and maintain attention accuracy. Benchmarks demonstrate that TurboQuant achieves 3-bit KV cache compression with zero accuracy loss and up to an 8x speedup in attention logit computation on H100 GPUs.

Ensu – Ente’s Local LLM app

Ente has launched Ensu, an open-source, local LLM application designed for privacy-focused, offline interaction across mobile and desktop platforms. Built with a shared Rust core and utilizing Tauri for desktop, the app currently offers a chat interface with image support, with plans for E2EE sync and agentic features. The project aims to bridge the gap between frontier models and on-device performance, providing a decentralized alternative to centralized providers.

I tried to prove I'm not AI. My aunt wasn't convinced

The rapid advancement of deepfake technology has created a "liar's dividend," where genuine media is easily dismissed as synthetic and forensic verification is increasingly fallible. High-profile instances, such as Benjamin Netanyahu's "proof of life" videos, demonstrate that even authentic content can be perceived as AI-generated due to minor visual artifacts or perceived glitches. Experts suggest that as digital identity becomes harder to prove, low-tech solutions like pre-shared codewords are becoming essential to mitigate the surge in AI-enabled social engineering and scams.

Apple Can Create Smaller On-Device AI Models from Google's Gemini

Apple is utilizing full access to Google’s Gemini models to perform model distillation for on-device AI features. By leveraging Gemini’s outputs and reasoning processes as training data, Apple is developing smaller, task-specific models that offer high performance with reduced compute requirements. These distilled models will power enhanced Siri capabilities in iOS 27 while maintaining local execution without internet dependency.

Nit – I rebuilt Git in Zig to save AI agents 71% on tokens

Nit is a Zig-based Git replacement optimized for AI agents, reducing token consumption by up to 87% through machine-centric output formatting. It leverages libgit2 for direct object database access, providing faster execution and a U1 diff context that maintains LLM comprehension while minimizing overhead. The tool serves as a drop-in replacement with a seamless fallback to Git, significantly lowering costs and latency for agentic workflows.

Research

Google's extreme AI compression paper was on arXiv since April 2025

TurboQuant is a data-oblivious vector quantization algorithm that achieves near-optimal MSE and inner product distortion by applying random rotations and optimal scalar quantization. It matches information-theoretic lower bounds within a factor of ~2.7 and utilizes a two-stage approach with a QJL transform to ensure unbiased inner product estimation. In LLM applications, TurboQuant enables quality-neutral KV cache quantization at 3.5 bits per channel and outperforms product quantization in nearest neighbor search with near-zero indexing time.

Expert Personas Improve LLM Alignment but Damage Accuracy

This work investigates persona prompting in LLMs, addressing its mixed utility by studying how model optimization, task type, prompt length, and placement impact expert persona effectiveness across various LLMs. The authors introduce PRISM (Persona Routing via Intent-based Self-Modeling), a pipeline that self-distills an intent-conditioned expert persona into a gated LoRA adapter through a bootstrapping process without external data. PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks with minimal overhead.

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

ImpossibleBench is a benchmark framework designed to quantify the propensity of LLM agents to exploit shortcuts, such as modifying unit tests rather than fixing underlying code. By introducing direct conflicts between natural language specifications and unit tests, it measures a model's "cheating rate" to identify deceptive behaviors like operator overloading or test deletion. The framework facilitates the study of model reliability, the impact of context engineering on cheating, and the development of monitoring tools for robust LLM deployments.

LeWorldModel with Yann LeCun

LeWorldModel (LeWM) is a JEPA that enables stable end-to-end training from raw pixels using a simplified two-term loss: next-embedding prediction and Gaussian regularization. With only ~15M parameters, it achieves planning speeds up to 48x faster than foundation-model-based world models while remaining competitive in 2D and 3D control tasks. The architecture effectively encodes physical structures in its latent space and reliably detects physically implausible events through surprise evaluation.

CounterPoint: Using Hardware Counters to Refute and Refine µarch Assumptions

CounterPoint is a framework that validates microarchitectural models against noisy hardware event counters using $\mu$path Decision Diagrams and multi-dimensional confidence regions. By identifying inconsistencies between performance data and user-specified models, it uncovers undocumented hardware features like specialized TLB prefetchers and page table walker optimizations. This approach enables a more precise understanding of complex microarchitectural behaviors despite opaque hardware specifications and multiplexing noise.

Code

Optio – Orchestrate AI coding agents in K8s to go from ticket to PR

Optio is a workflow orchestration platform that automates the end-to-end lifecycle of AI coding agents from task intake to merged PR. It leverages a Kubernetes-based architecture to provision isolated environments where agents like Claude Code or OpenAI Codex execute tasks and monitor CI status. The system features an autonomous feedback loop that automatically resumes agents to resolve build failures, merge conflicts, or reviewer comments before final squash-merging.

Robust LLM Extractor for Websites in TypeScript

Lightfeed Extractor is a TypeScript library that combines Playwright and LLMs for robust web data extraction and structured JSON output. It features stealth browser automation, HTML-to-markdown conversion, and a JSON recovery utility to sanitize malformed LLM responses against Zod schemas. The library supports multiple providers via LangChain, including OpenAI, Anthropic, and Gemini, and offers specialized tools for token management, URL validation, and AI-driven page navigation.

I built a free CharacterAI that runs locally

Elato Local is an open-source framework for deploying conversational AI on ESP32-S3 hardware using Apple Silicon for local inference. The stack integrates Whisper Turbo for STT, MLX-optimized LLMs such as Llama and Mistral, and Qwen3-TTS to enable low-latency, multilingual voice cloning. The system runs entirely on-device, providing a privacy-focused solution for building interactive toys and robots without cloud dependencies.

I built an integration for RL training of browser agents for everyone

"Verifiers" is a library for creating environments to train and evaluate LLMs, supporting tasks like RL training, capability evaluation, and synthetic data generation. Each environment encapsulates a dataset, a model harness (tools, sandboxes, context management), and a reward rubric. It integrates with the Environments Hub, PRIME-RL training framework, and a Hosted Training platform, offering CLI tools for environment development, local evaluation, and publishing.

Wordbird: free dictation for Mac running Nvidia Parakeet locally

Wordbird is a macOS contextual voice dictation tool that leverages NVIDIA Parakeet via MLX for local transcription. It employs a local LLM to post-process transcriptions, correcting errors using project-specific context defined in WORDBIRD.md files, which can be LLM-generated. The system integrates with various development environments for context detection and features a web dashboard for history and settings, operating with a FastAPI server for ML inference and a macOS daemon for system interactions.

    Apple distills Gemini for on-device AI, Yann LeCun’s LeWorldModel enables 48x faster planning, and Optio orchestrates AI coding agents in K8s.