Monday April 6, 2026

Gemma 4 26B-A4B MoE delivers larger model performance at 4B inference cost locally via LM Studio, Meta-Harness automates LLM harness design, and Gemma Gem enables on-device Gemma 4 inference in a Chrome extension.

Interested in AI engineering? Let's talk

News

Eight years of wanting, three months of building with AI

The author developed syntaqlite, a comprehensive SQLite developer toolset, by leveraging AI coding agents to overcome years of technical inertia. An initial "vibe-coding" phase resulted in unmaintainable spaghetti code, necessitating a rewrite where the author reclaimed architectural control while using AI for implementation, refactoring, and rapid domain research. The project demonstrates that while AI is a massive force multiplier for code generation and "last-mile" features like editor extensions, it remains a poor substitute for high-level design, API ergonomics, and maintaining a coherent mental model of the codebase.

Gemma 4 on iPhone

Google AI Edge Gallery is an open-source mobile sandbox for running high-performance LLMs, including the Gemma 4 family, with 100% on-device inference. Key features include "Thinking Mode" for visualizing reasoning processes, "Agent Skills" for tool-augmented capabilities like Wikipedia grounding, and a Prompt Lab for testing models with granular parameter control. The app also supports multimodal tasks via Ask Image and Audio Scribe, alongside benchmarking tools to evaluate model performance on local hardware.

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

Gemma 4 26B-A4B utilizes a MoE architecture with 8 active experts to deliver performance comparable to larger dense models at the inference cost of a 4B model. Using LM Studio 0.4.0’s headless CLI and Anthropic-compatible API, developers can run tools like Claude Code locally on Apple Silicon with zero latency and no data egress. This setup supports vision, tool calling, and up to 256K context, making it a viable private alternative to cloud-based LLM APIs.

In Japan, the robot isn't coming for your job; it's filling the one nobody wants

Japan is pivoting toward Physical AI to address severe labor shortages, leveraging its 70% global market share in industrial robotics hardware. The focus is shifting from high-precision components to full-stack integration, utilizing vision-language models, digital twins, and orchestration software for autonomous real-time control. Backed by $6.3 billion in government funding, the ecosystem is evolving into a hybrid model where startups provide software innovation and incumbents provide manufacturing scale for deployments in logistics, manufacturing, and defense.

Writing Lisp is AI resistant and I'm sad

LLMs exhibit significant performance degradation and higher token costs when writing Lisp compared to data-rich languages like Python or Go. This "AI resistance" stems from a lack of training data and a fundamental mismatch between the high-latency, batch-oriented nature of LLM APIs and the iterative REPL-driven development workflow. Consequently, the economic advantages of agentic AI are currently tethered to language popularity, reinforcing a "Worse is Better" outcome that marginalizes niche functional languages.

Research

Harnessing Hype to Teach Empirical Thinking with AI

A one-semester seminar leveraged the hype surrounding AI coding assistants to teach empirical methods and hypothesis-driven inquiry to software engineering students. By combining hands-on development with student-led empirical studies, the course fostered critical evaluation of AI limitations and research skills. The results demonstrate that integrating popular LLM-based tools can lower barriers to learning abstract research methodologies.

Signals – finding the most informative agent traces without LLM judges

Agentic LLM applications struggle with post-deployment optimization due to the voluminous and non-deterministic nature of interaction trajectories, making review slow and costly. This work introduces a lightweight, signal-based framework that computes cheap, broadly applicable signals from live interactions to triage trajectories. These signals, categorized across interaction, execution, and environment, identify informative interactions without affecting online agent behavior. Evaluation on $\tau$-bench demonstrates that signal-based sampling achieves an 82% informativeness rate and a 1.52x efficiency gain per informative trajectory, offering a practical solution for post-deployment optimization and preference data construction.

TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks (2023)

To address the lack of standardized benchmarking for LLMs on complex, ill-defined tasks, this paper proposes a general taxonomy for prompt design. This framework mitigates performance variance caused by prompt styling, enabling consistent reporting and meaningful cross-study comparisons of LLM capabilities.

Fair coins tend to land on the same side they started

An empirical analysis of 350,757 coin flips validates the DHM physics model, identifying a same-side bias of 0.508 (95% CI [0.506, 0.509]). The study highlights significant inter-subject variability and a practice effect that diminishes bias over time, while confirming a 0.500 probability for heads-tails outcomes when the initial state is randomized.

Meta-Harness: End-to-End Optimization of Model Harnesses

Meta-Harness is an outer-loop optimization system that automates the design of LLM harnesses by searching over source code using an agentic proposer. By leveraging execution traces and historical performance data, it significantly outperforms hand-engineered baselines in text classification, RAG-based math reasoning, and agentic coding while improving token efficiency. The system demonstrates that providing richer access to prior experience enables effective automated harness engineering across diverse applications.

Code

I built a tiny LLM to demystify how language models work

GuppyLM is an 8.7M parameter vanilla transformer designed to demonstrate end-to-end LLM development, including synthetic data generation, tokenization, and training. The architecture features 6 layers, 6 heads, and a 4,096 BPE vocabulary with a 128-token context window. It is trained on 60K synthetic single-turn conversations, prioritizing architectural simplicity by using standard LayerNorm and ReLU instead of modern variants like RoPE or SwiGLU.

Gemma Gem – AI model embedded in a browser – no API keys, no cloud

Gemma Gem is a Chrome extension that enables on-device inference of Google’s Gemma 4 models (E2B/E4B) using WebGPU and @huggingface/transformers. It operates as an autonomous agent capable of DOM manipulation, JavaScript execution, and screenshot capture through a decoupled architecture of offscreen documents and service workers. The tool utilizes q4f16 quantized ONNX models with a 128K context window to provide private, local LLM capabilities directly within the browser.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Parlor is an open-source framework for on-device, real-time multimodal AI, utilizing Gemma 4 E2B for speech and vision understanding and Kokoro for TTS. The architecture employs a FastAPI backend with WebSockets, Silero VAD for hands-free interaction, and supports barge-in via sentence-level streaming. It is optimized for Apple Silicon (MLX) and Linux (ONNX), delivering end-to-end latencies of 2.5-3.0 seconds on M3 Pro hardware.

TermHub – Open-source terminal control gateway built for AI Agents

termhub is an AI-native terminal control tool enabling AI agents to programmatically manage and interact with terminal sessions. It allows AI to inspect open sessions, open new windows or tabs, send commands (text or key events), and capture output, including a specialized delta output feature to retrieve only new content produced after a command. Available via CLI and SDK, it supports macOS (iTerm2, Terminal) and Windows (Windows Terminal, CMD) backends.

Zero-infra AI agent memory using Markdown and SQLite

memweave is a zero-infrastructure Python library that provides AI agents with persistent, searchable memory stored as Markdown and indexed via SQLite. It features a hybrid search pipeline combining BM25 keyword ranking with sqlite-vec semantic search, enhanced by temporal decay and MMR re-ranking. The library supports LLM-based fact extraction through a flush mechanism and offers a pluggable architecture for custom search strategies and embedding providers.

    Gemma 4 26B-A4B MoE delivers larger model performance at 4B inference cost locally via LM Studio, Meta-Harness automates LLM harness design, and Gemma Gem enables on-device Gemma 4 inference in a Chrome extension.