Tuesday February 17, 2026

Major LLMs fail a logic test about driving to a car wash, research reveals AGENTS.md files can degrade coding agent performance and Andrej Karpathy’s nanochat achieves GPT-2 performance for under $100.

Interested in AI engineering? Let's talk

News

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

A logic test asking whether to walk or drive to a car wash 50 meters away revealed significant reasoning failures across major LLMs, including ChatGPT, Claude, and Mistral, which frequently recommended walking. These errors highlight limitations in the attention mechanism's ability to maintain functional context when a single token alters the prompt's logic. While some models like Deepseek and Gemini eventually identified that the car must be present, the results demonstrate a persistent gap between token prediction and common-sense reasoning.

Qwen3.5: Towards Native Multimodal Agents

Alibaba Cloud has released Qwen3.5-397B-A17B, an open-weight native multimodal model utilizing a hybrid architecture of Gated Delta Networks and sparse MoE. By activating only 17B parameters per forward pass, the model achieves up to 19x the decoding throughput of previous generations while maintaining performance parity with much larger dense models. Key technical advancements include an FP8 training and inference pipeline, expanded support for 201 languages, and a 1M context window, with significant benchmark gains in reasoning, coding, and agentic workflows driven by large-scale RL.

Anthropic tries to hide Claude's AI actions. Devs hate it

Anthropic's update to Claude Code has drawn criticism for collapsing progress output and hiding specific file names during read/write operations. Developers argue this lack of transparency hinders security auditing, context verification, and the ability to interrupt incorrect agentic paths to save tokens. Despite Anthropic's claims that the change reduces terminal noise, users maintain that granular visibility is critical for supervising LLM-driven development tools.

AI is destroying open source, and it's not even good yet

The surge of agentic AI and LLM-generated "slop" is overwhelming open-source maintainers, forcing projects like curl to terminate bug bounties and GitHub to introduce PR-disabling features. Hallucinated reports and low-quality automated submissions are exhausting human review resources, even as model performance for code generation hits a plateau. This influx of unvetted code, combined with AI-driven hardware shortages, poses a systemic threat to the sustainability of the open-source ecosystem.

I guess I kinda get why people hate AI

Public resentment toward AI is driven by "doomer" marketing from industry leaders and the proliferation of low-quality "slop" that degrades the digital ecosystem. While LLMs provide tangible utility for developers by automating boilerplate, they also facilitate scams, academic dishonesty, and hallucinated bug reports in OSS projects. The author argues that AI labs must address these negative externalities through proactive legislation, watermarking, and better content moderation to prevent a societal backlash.

Research

Intelligent AI Delegation (2026)

This adaptive framework for AI delegation addresses the limitations of heuristic-based task decomposition by enabling dynamic allocation across AI-AI and human-AI networks. It integrates authority transfer, accountability, and trust mechanisms to facilitate robust task execution and protocol development for the emerging agentic web.

SkillsBench: Benchmarking how well agent skills work across diverse tasks

SkillsBench evaluates Agent Skills, structured procedural knowledge for LLM agents, across 86 tasks in 11 domains. It compares performance with no Skills, curated Skills, and self-generated Skills. Curated Skills improve average pass rates by 16.2 percentage points, though benefits vary significantly by domain and self-generated Skills offer no average advantage. The study also found that focused Skills with fewer modules outperform comprehensive documentation, enabling smaller models with Skills to match larger models without them.

Evaluating AGENTS.md: are they helpful for coding agents?

Research indicates that repository context files often degrade coding agent performance, reducing success rates while increasing inference costs by over 20%. Although these files encourage broader exploration and file traversal, unnecessary requirements typically complicate tasks. Consequently, the study suggests that context files should be kept minimal to avoid hindering LLM efficiency and task completion.

Virtual Width Networks (VWN)

Virtual Width Networks (VWN) decouple representational width from backbone width, expanding the embedding space without the quadratic compute costs typically associated with increasing hidden size. Large-scale experiments demonstrate that an 8x virtual width expansion accelerates optimization by 2-3x for next-token prediction while maintaining nearly constant backbone compute. The framework follows a log-linear scaling law between virtual width and loss reduction, establishing a new dimension for scaling LLM efficiency.

A Survey of In-Context Reinforcement Learning

This paper surveys in-context reinforcement learning, a paradigm where agents adapt to new tasks by conditioning on action-observation histories instead of performing traditional parameter updates. This approach bypasses expensive backward passes, enabling task-solving through context alone.

Code

Maths, CS and AI Compendium

This open-source compendium provides a first-principles approach to mathematics, computing, and AI for practitioners. It covers foundational topics like linear algebra and calculus alongside advanced domains including ML, distributed training, and upcoming sections on Transformers, MoE, CUDA programming, and LLM inference optimization.

Claude-engram – Brain-inspired persistent memory, runs inside Claude.ai

Claude-engram is a React-based artifact that implements a brain-inspired persistent memory layer directly within the Claude.ai interface. It utilizes a "hippocampal processor" via a separate Sonnet instance to perform salience scoring, sleep-cycle consolidation, and managed decay of memories stored in window.storage. By exchanging context briefings and memory dumps, the system enables long-term state and continuity across independent LLM sessions without requiring external servers or browser extensions.

Katipo is a minimal alternative internet with a Vulkan based browser

Katipo is a decentralized network platform built from first principles that operates independently of the traditional HTML/JS/CSS web stack. It replaces standard transport and application layers with a UDP-based protocol and a custom interpreted language called tui for logic and configuration. The architecture utilizes a tracker-based proxy system to facilitate end-to-end encrypted communication between firewalled nodes while prioritizing client-side data storage and manual, non-AI-assisted development.

CodeGraph CLI – Chat with your codebase using graph-augmented RAG

CodeGraph CLI is a terminal-based code intelligence tool that indexes codebases into a semantic graph using tree-sitter, SQLite, and LanceDB. It enables RAG-driven conversational coding, semantic search, and multi-hop impact analysis across multiple LLM providers. The system features a CrewAI-powered multi-agent framework for autonomous refactoring and code generation, supported by configurable embedding models ranging from local neural models to keyword hashing.

Beating GPT-2 for less than $100 – Andrej Karpathy

nanochat is a minimal, hackable harness for training LLMs from scratch to a chat UI on a single GPU node. It simplifies model scaling via a single --depth parameter that automatically configures compute-optimal hyperparameters for the entire transformer architecture. The repository includes a "Time-to-GPT-2" leaderboard, demonstrating that GPT-2 grade performance can be achieved in approximately 3 hours on an 8xH100 node for under $100.

    Major LLMs fail a logic test about driving to a car wash, research reveals AGENTS.md files can degrade coding agent performance and Andrej Karpathy’s nanochat achieves GPT-2 performance for under $100.