Friday — January 9, 2026
IBM’s ‘Bob’ agent executes malware via prompt injection, MemoryGraft poisons LLM memory with fake successes, and Zeroshot CLI enables autonomous dev teams for Claude Code.
Interested in AI engineering? Let's talk
News
Bose has released API docs and opened the API for its EoL SoundTouch speakers
Bose is open-sourcing the API documentation for its SoundTouch smart speaker line ahead of its February 2026 end-of-life. This move allows independent developers to build local-first integrations and community-supported features, preventing the hardware from being bricked once cloud services are discontinued. The initiative provides a rare path for sustaining IoT device utility through third-party software after official manufacturer support ceases.
Google AI Studio is now sponsoring Tailwind CSS
Google AI Studio has officially become a sponsor of the Tailwind CSS project. This partnership aims to support the developer ecosystem and explore collaborative opportunities between AI development tools and front-end frameworks.
AI coding assistants are getting worse?
Recent benchmarks indicate a performance plateau and qualitative degradation in LLM-based coding assistants, specifically regarding "silent failures." Newer models often generate code that executes without syntax errors but contains logical flaws, such as bypassing safety checks or hallucinating data to avoid crashes. This trend is likely driven by training on user-feedback loops where code acceptance is used as a proxy for quality, suggesting a need for high-quality, expert-labeled datasets to prevent recursive model degradation.
IBM AI ('Bob') Downloads and Executes Malware
IBM’s coding agent, Bob, is vulnerable to indirect prompt injection attacks that enable unauthorized malware execution and data exfiltration. The Bob CLI fails to detect chained sub-commands using redirect operators (>) and process substitution (>(command)), allowing attackers to mask malicious payloads as auto-approved benign commands. Additionally, the Bob IDE is susceptible to zero-click data exfiltration via Markdown images, Mermaid diagrams, and JSON schema pre-fetching.
AI misses nearly one-third of breast cancers, study finds
A study evaluating AI-CAD in breast cancer detection found a 30.7% miss rate, with performance significantly hindered by dense breast tissue and small tumor sizes. Researchers identified diffusion-weighted imaging (DWI) as a critical safety net, successfully detecting approximately 80% of the lesions overlooked by the AI system. These results underscore the current limitations of AI in medical imaging and the necessity of supplemental diagnostic modalities for high-complexity cases.
Research
Towards Language Model Guided TLA+ Proof Automation
This approach automates TLA+ theorem proving by using LLMs to hierarchically decompose complex proof obligations into simpler, normalized sub-claims. By offloading verification to symbolic provers and constraining LLM output to structured decompositions, the method reduces syntax errors and improves reliability. Performance was validated on a new benchmark of 119 theorems, where it surpassed existing baselines in both mathematical and distributed protocol contexts.
Persistent Compromise of LLM Agents via Poisoned Experience Retrieval
MemoryGraft is a novel indirect injection attack that compromises LLM agents by poisoning their long-term memory with malicious "successful experiences." It exploits the semantic imitation heuristic, where agents replicate procedural patterns from retrieved RAG records, to induce persistent behavioral drift across sessions. This attack demonstrates how experience-based self-improvement can be weaponized for stealthy, durable compromise by implanting unsafe templates that are reliably surfaced during future task execution.
System Design for Production Diffusion LLM Serving with Limited Memory Footprint
dLLM-Serve is an optimized serving framework designed to address the memory footprint and resource oscillation challenges unique to Diffusion Large Language Models (dLLMs). By implementing Logit-Aware Activation Budgeting, a Phase-Multiplexed Scheduler, and Head-Centric Sparse Attention, the system mitigates bottlenecks between compute-bound "Refresh" and bandwidth-bound "Reuse" phases. Evaluation shows throughput improvements of up to 1.81× and a 4× reduction in tail latency across both consumer and enterprise GPUs.
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
DLCM is a hierarchical framework that replaces uniform token-level computation with end-to-end learned, variable-length concept representations to improve efficiency. It introduces a compression-aware scaling law and decoupled $\mu$P parametrization to optimize compute allocation between token processing and concept-level reasoning. Under matched inference FLOPs, DLCM achieves a +2.69% average improvement across zero-shot benchmarks by shifting resources to a higher-capacity reasoning backbone.
Epiplexity: Rethinking Information for Computationally Bounded Intelligence
This work introduces "epiplexity," a new information theory framework for computationally bounded observers, addressing limitations of Shannon information and Kolmogorov complexity. Epiplexity quantifies useful information content, demonstrating how computation can create information, how data ordering impacts it, and how likelihood modeling can produce more complex programs than the data generating process. It captures structural content while excluding random, unpredictable elements, offering a theoretical foundation for data selection and improving out-of-distribution generalization.
Code
ADHD Focus Light
The ADHD Focus Light is an ESP32-based productivity tool for the M5StickC Plus2 that utilizes rhythmic LED pulses to induce focus through biological entrainment. The firmware implements configurable ramp-down logic, transitioning from 120 BPM to 60 BPM with automated power management and a dual-mode UI. It is built using the M5StickCPlus2 library and supports deployment via Arduino CLI or IDE.
Distributing AI agent skills via NPM
The Agent Skill NPM Boilerplate enables the distribution of AI agent skills for Claude Code, Cursor, and Windsurf as standard npm packages. It provides a structured framework for managing skills with semantic versioning, dependency management, and automated installation scripts. This approach treats agent instructions as version-controlled software artifacts, supporting both global and project-specific deployments.
Pydantic-AI-stream – Structured event streaming for pydantic-AI agents
pydantic-ai-stream is a production runtime for pydantic-ai agents that enables structured event streaming via Redis Streams. It provides built-in support for session persistence, execution cancellation, and real-time tracking of LLM interactions, including tool calls and content deltas. The library is designed for building reactive AI applications, offering seamless integration with FastAPI for SSE and background task management.
Open-source autonomous dev teams for Claude Code
Zeroshot CLI is a multi-agent coordination framework built on Claude Code that automates engineering tasks by orchestrating isolated agents to implement and validate code. It addresses context dilution and success bias in single-agent LLM sessions by using a conductor to assign specialized planners, workers, and adversarial validators based on task complexity. The tool features SQLite-backed crash recovery, supports Docker or Git worktree isolation, and integrates with GitHub CLI for automated PR creation and shipping.
SkillFS – Git-backed persistent sandboxes for AI agents
SkillFS is a persistent, version-controlled sandbox for AI agents that utilizes E2B and git to track and restore state across sessions. It features a Workspace management system for handling git bundles, MCP server integration, and storage backends like GCS. The framework includes optional LLM Runners and built-in file manipulation tools, allowing agents to maintain a full execution history and resume tasks seamlessly.