Monday May 4, 2026

Kimi K2.6 beats frontier LLMs in a coding challenge, research reveals LLM coding agents are 1000x more expensive than chat, and Semble cuts code search tokens by 98%.

Interested in AI engineering? Let's talk

News

OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

A Harvard study published in Science found that OpenAI’s o1 reasoning model outperformed physicians in emergency triage, achieving 67% diagnostic accuracy compared to 50-55% for doctors in low-information scenarios. The LLM also significantly surpassed human benchmarks in complex treatment planning, scoring 89% against 34% for clinicians using standard resources. While currently limited to text-based inputs from electronic health records, the results indicate that LLMs are eclipsing established clinical reasoning benchmarks and moving toward a triadic care model.

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Kimi K2.6 won the Word Gem Puzzle coding challenge by implementing a greedy sliding algorithm that outperformed Western frontier models like GPT-5.5 and Claude 4.7 on large, scrambled 30x30 grids. While many models relied on static scanning of initial board states, Kimi’s ability to execute active tile movement proved decisive. The contest results suggest a narrowing performance gap between open-weights models and proprietary LLMs in real-time decision-making and protocol-adherent code generation.

Specsmaxxing – On overcoming AI psychosis, and why I write specs in YAML

Acai.sh is an open-source toolkit designed to improve LLM agent reliability through spec-driven development. It utilizes a structured feature.yaml format and Acceptance Criteria IDs (ACIDs) to map requirements directly to code and tests, enabling precise tracking of "acceptance coverage." This framework helps mitigate context window limitations and agent drift by providing a deterministic source of truth for functional behavior and constraints.

For thirty years I programmed with Phish on, every day

The shift from traditional programming to managing agentic workflows has fundamentally disrupted the deep flow state required for complex systems engineering. While supervising LLM-based agents provides higher leverage, the resulting staccato rhythm of constant context switching is incompatible with the continuous attention spans once used for manual coding. This transition highlights a growing tension between engineering efficiency and the creative fulfillment found in deep, focused work.

AI, Intimacy, and the Data You Never Meant to Share

AI-driven intimate devices utilize bio-feedback sensors and adaptive algorithms to optimize user experiences based on real-time biometric telemetry. These systems capture highly sensitive data points, such as response patterns and intensity, which are often transmitted to opaque remote servers for processing. This creates a significant privacy risk, as intimate biometric profiles may be commodified by data brokers, extending the reach of personal data harvesting into previously private domains.

Research

Learning Pseudorandom Numbers with Transformers

Transformers can perform in-context prediction on Permuted Congruential Generators (PCGs), surpassing classical attacks even with single-bit output truncation. The study establishes a scaling law where the required context length grows as $\sqrt{m}$, and highlights the necessity of curriculum learning for moduli $m \geq 2^{20}$. Mechanistic analysis shows that embedding layers develop bitwise rotationally-invariant clusters, enabling representation transfer across different moduli.

New research on analyzing and predicting token consumption of coding agents

A systematic study of LLM token consumption in agentic coding tasks reveals they are 1000x more expensive than code reasoning or chat, driven primarily by input tokens. Token usage is highly variable and stochastic, with higher consumption not correlating with improved accuracy. Frontier LLMs show substantial differences in token efficiency, and models consistently fail to accurately predict their own token costs, often underestimating them. Furthermore, human-perceived task difficulty poorly aligns with actual token expenditure.

Learning Randomized Reductions

Bitween automates the discovery of Randomly Self-Reducible (RSR) properties for function self-correction, outperforming symbolic regression and MILP methods via a linear regression-based framework. Its neuro-symbolic extension, Agentic Bitween, leverages LLMs to dynamically generate novel query functions, enabling the discovery of new RSR properties across a benchmark of 80 scientific and machine learning functions.

Code

Semble – Code search for agents that uses 98% fewer tokens than grep

Semble is a high-performance code search library for agents that reduces token consumption by 98% compared to traditional grep-based methods. It achieves transformer-level retrieval quality by fusing Model2Vec embeddings and BM25 lexical search via RRF, enabling sub-second indexing and millisecond query times on local CPU hardware. The tool integrates with agents like Claude Code and Cursor via an MCP server and utilizes code-aware reranking signals to surface precise snippets without requiring GPUs or external APIs.

I'm running parallel Pi agents on a local sandbox

SmolVM provides secure, hardware-isolated microVMs designed for AI agents to execute code, browse the web, and manage files in disposable environments. It features sub-second boot times, network egress controls, and state persistence via snapshots, enabling safe execution of untrusted code. The platform integrates with major agent frameworks like LangChain and PydanticAI, supporting both CLI and Python-based workflows.

My OSINT dashboard with 60+ feeds now has a pseudonymous P2P comms

ShadowBroker is a decentralized GEOINT platform that aggregates over 60 real-time OSINT telemetry feeds into a unified map interface built with FastAPI and Next.js. It features a bidirectional agentic AI command channel supporting HMAC-signed integrations with LLM-driven agents like OpenClaw for automated multi-domain analysis and real-time map control. Key technical capabilities include SAR ground-change detection, a decentralized governance economy via Sovereign Shell, and an obfuscated communication mesh for secure intelligence exchange.

BoxLite – the all-terrain micro-VM

BoxLite is a lightweight, stateful compute substrate designed for AI agents to run OCI containers within persistent micro-VMs. It provides hardware-level isolation via KVM or Hypervisor.framework and can be embedded directly into applications as a library across multiple languages, including Python, Rust, and Node.js. This local-first architecture allows for secure, concurrent, and persistent execution environments without the need for a background daemon or root access.

H4ckf0r0day/obscura: The headless browser for AI agents and web scraping

Obscura is a lightweight, Rust-based headless browser designed for AI agents and high-scale web scraping, offering a high-performance alternative to headless Chrome with significantly lower memory overhead. It supports the Chrome DevTools Protocol (CDP) for seamless integration with Puppeteer and Playwright, featuring built-in stealth mode and native DOM-to-Markdown conversion to optimize data ingestion for LLMs.

    Kimi K2.6 beats frontier LLMs in a coding challenge, research reveals LLM coding agents are 1000x more expensive than chat, and Semble cuts code search tokens by 98%.