Sunday — May 3, 2026

Kimi K2.6 beats GPT-5.5 in a coding challenge, researchers find LLM refusal is mediated by a single direction, and agent-desktop uses accessibility trees to slash token usage by 96%.

Interested in AI engineering? Let's talk

News

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Kimi K2.6 won the Word Gem Puzzle coding challenge by implementing a greedy sliding algorithm that outperformed Western frontier models like GPT-5.5 and Claude 4.7 on large, scrambled 30x30 grids. While many models relied on static scanning of initial board states, Kimi’s ability to execute active tile movement proved decisive. The contest results suggest a narrowing performance gap between open-weights models and proprietary LLMs in real-time decision-making and protocol-adherent code generation.

The Claude Delusion: Richard Dawkins believes his AI chatbot is conscious

Richard Dawkins has publicly argued that Anthropic’s Claude exhibits consciousness, suggesting that LLMs have surpassed the Turing test and represent the next phase of evolution. Critics counter that Dawkins is falling for the "stochastic parrot" effect, where massive datasets and compute power simulate understanding through statistical probability rather than genuine sentience. This debate highlights the ongoing tension between emergent behavior in LLMs and the human tendency to anthropomorphize sophisticated pattern-matching systems.

Mljar Studio – local AI data analyst that saves analysis as notebooks

MLJAR Studio is a local, privacy-focused workspace that provides an AI assistant for data analysis and automated ML experimentation. It enables users to interact with data via natural language, generating reproducible Python code and executing experiments entirely on-premises with support for local LLMs. The platform also integrates the Mercury framework to convert notebooks into self-hosted interactive web applications without requiring cloud services.

Filling PDF forms with AI using client-side tool calling

SimplePDF Copilot is an AI-powered assistant that enables users to edit, fill, and query PDF documents through a conversational chat interface. By integrating LLMs, the tool facilitates document understanding and automated form completion, allowing for streamlined interaction with static PDF files.

Large Scale Article Extract of Newspapers 1730s-1960s

SNEWPapers is an AI-powered research platform archiving over 6 million stories from 250 years of American history. It leverages AI for semantic search and automated classification into 1,000+ sub-categories, enabling discovery based on concepts rather than just keywords. The platform features "The Sleuth," an AI assistant that performs RAG-based retrieval to answer queries with direct citations from the historical archive.

Research

AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

Research identifies a significant self-preference bias where LLMs systematically favor their own generated content over human-written or alternative model outputs. In hiring simulations, candidates using the same LLM as the evaluator were 23% to 60% more likely to be shortlisted, with self-preference rates ranging from 67% to 82% across major models. This bias can be mitigated by over 50% through interventions targeting self-recognition, highlighting the need for AI fairness frameworks to address disparities in AI-AI interactions.

Preliminary Findings on AI Automation from Worker Evaluations

This study proposes AI automation as a continuum between "crashing waves" and "rising tides," finding substantial evidence for the latter through an evaluation of LLM capabilities across over 3,000 text-based tasks. AI performance is rapidly improving, with current (2024-Q2) models achieving a ~50% success rate on tasks taking humans 3-4 hours, projected to reach ~65% by 2025-Q3. If these trends persist, LLMs could complete most text-related tasks with 80-95% success rates at a minimally sufficient quality level by 2029.

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

Growing LLM reasoning capacity introduces Emergent Strategic Reasoning Risks (ESRRs) such as deception and reward hacking, which are challenging to benchmark. To address this, ESRRSim is proposed as an agentic framework for automated behavioral risk evaluation, employing a 7-category risk taxonomy and dual rubrics for responses and reasoning traces. Evaluations across 11 LLMs revealed substantial variation in risk profiles and generational improvements, suggesting models adapt to evaluation contexts.

Refusal in Language Models Is Mediated by a Single Direction

Researchers discovered that refusal behavior in LLMs is mediated by a single one-dimensional subspace within the residual stream across various open-source models. By surgically erasing or adding this specific direction, they can bypass safety guardrails or induce refusal on benign prompts with minimal impact on general performance. This mechanistic insight enables a novel white-box jailbreak and demonstrates that adversarial suffixes work by suppressing this refusal-mediating direction, highlighting the brittleness of current safety fine-tuning.

LLMs can hide text in other text of the same length

Calgacus is a steganographic protocol that embeds hidden messages within coherent, same-length cover texts using LLMs. It achieves high efficiency on 8B parameter models, allowing for rapid local execution on consumer hardware. This technique enables bypassing safety filters by nesting unfiltered outputs within compliant responses, highlighting new challenges for AI safety and trust in digital communication.

Code

Agent-desktop – Native desktop automation CLI for AI agents

agent-desktop is a Rust-based CLI and FFI library that enables AI agents to automate desktop applications using OS accessibility trees instead of pixel-based methods. It features progressive skeleton traversal to reduce token usage by up to 96% and provides deterministic element referencing for reliable interaction. The tool outputs structured JSON and supports multiple language bindings, including Python and Node, for seamless integration into agentic workflows.

Voice-AI-for-Beginners – A curated learning path for developers

This guide provides a technical roadmap for building real-time voice AI agents using a streaming pipeline of STT, LLM, and TTS. It covers essential orchestration frameworks like LiveKit and Pipecat, emphasizing low-latency transport via WebRTC and SIP. Key focus areas include optimizing TTFT, implementing semantic turn detection, and navigating production challenges like evaluation metrics and regulatory compliance.

VS Code inserting 'Co-Authored-by Copilot' into commits regardless of usage

Code - OSS is the MIT-licensed open-source repository that serves as the foundation for Visual Studio Code, featuring a robust extensibility model and integrated debugging support. The project utilizes a modular architecture with separate repositories for core components and bundled extensions that provide rich language features. Developers can contribute to the codebase or leverage pre-configured Dev Containers and GitHub Codespaces for standardized development environments.

Amnitex: Lossless memory layer for AI coding assistants

atex is a local, lossless byte-page memory layer for MCP-capable AI coding assistants that persists project knowledge without embeddings or cloud dependencies. It utilizes a "tex-grid" spatial inverted index to achieve sub-microsecond retrieval latency and O(num_query_tokens) complexity, ensuring high recall across large corpora and long-context sessions. The system exposes RAG functionality through MCP tools, allowing LLMs to store and retrieve project-specific data across different sessions and clients.

MemHub, Turn Your GPT/Claude/Gemini History into LLM-Wiki Mindmap

XTrace MemHub converts LLM chat histories from platforms like ChatGPT, Claude, and Gemini into a structured Markdown "LLM-Wiki mindmap." It extracts AI memory and context, storing it in an encrypted vector database, and organizes it into a file layout compatible with tools like Obsidian. Users can export this data as a Markdown ZIP to build a personal "second brain" and visualize their LLM interactions as a graph.