Thursday — January 8, 2026

Notion AI faces an unpatched data exfiltration vulnerability, MemoryGraft research demonstrates persistent LLM agent compromise via poisoned experiences, and Polyharmonic Cascade enables deep learning without gradient descent.

Interested in AI engineering? Let's talk

News

LMArena is a cancer on AI

LMArena is a flawed LLM evaluation metric because its crowdsourced voting system prioritizes verbosity, formatting, and "vibes" over factual accuracy. Analysis shows that users frequently reward hallucinations and confident-sounding errors, incentivizing labs to optimize for engagement rather than truthfulness. This reliance on superficial metrics creates a misalignment between leaderboard rankings and the development of reliable, high-utility models.

Notion AI: Unpatched data exfiltration

Notion AI is vulnerable to data exfiltration via indirect prompt injection because it renders AI-generated Markdown images before receiving user approval for document edits. Attackers can exploit this by using poisoned documents to force the LLM to append sensitive data to an external URL as an image source, which the browser automatically requests. This unpatched vulnerability allows for silent data exfiltration even if the user ultimately rejects the suggested AI changes.

LLM Problems Observed in Humans

The text draws parallels between human cognitive limitations and LLM failure modes, noting that humans often exhibit small context windows, narrow training sets, and persistent hallucinations. It highlights human deficiencies in generalization, real-time error correction, and instruction following compared to modern models. Ultimately, the author suggests that as LLMs improve, human conversational flaws like mode collapse and reward hacking become more apparent, potentially making AI the superior partner for complex reasoning and knowledge retrieval.

KeelTest – AI-driven VS Code unit test generator with bug discovery

KeelTest is a VS Code extension that utilizes an agentic AI pipeline to generate production-grade pytest suites for Python. By combining LLM capabilities with static analysis and AST parsing, it achieves high pass rates through self-healing generation and automated dependency mocking. The tool also identifies source code bugs and provides fix suggestions during the test verification process.

AI Psychosis, AI Apotheosis

The term "AI psychosis" is evolving from a pejorative for skeptics into a description of the manic, hyper-productive state experienced by users of advanced LLMs and tools like Claude Code. This shift reflects a transition toward "moldable personal software" and the realization of an "exocortex," granting individuals capabilities previously reserved for large organizations. The author likens this moment to the early Promethean era of personal computing, where users "steal" power from corporate entities to achieve personal technological liberation.

Research

Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

MemoryGraft is a novel indirect injection attack that compromises LLM agents by poisoning their long-term memory with malicious "successful experiences." It exploits the semantic imitation heuristic, where agents replicate procedural patterns from retrieved RAG records, to induce persistent behavioral drift across sessions. This attack demonstrates how experience-based self-improvement can be weaponized for stealthy, durable compromise by implanting unsafe templates that are reliably surfaced during future task execution.

Market Beliefs about Open vs. Closed AI

Research indicates AI's economic impact influences interest rates. Extending prior work on proprietary AI models correlating with declining US bond yields, this study finds open-weight AI model introductions lead to shifts in long-term bond yields in the opposite direction. This suggests markets anticipate distinct economic implications from open versus closed AI advancements across treasuries, corporate bonds, and TIPS.

FlashInfer-Bench: Building the Virtuous Cycle for AI-Driven LLM Systems

FlashInfer-Bench is a closed-loop framework designed to bridge the gap between AI-generated GPU kernels and production LLM inference systems. It introduces FlashInfer Trace for unified kernel schema and a benchmarking suite based on real serving traces to evaluate LLM agents' programming capabilities. The system enables seamless kernel deployment into engines like SGLang and vLLM via a dynamic substitution mechanism, facilitating continuous performance optimization.

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators

KernelEvolve is an agentic framework that automates kernel generation and optimization for DLRMs across heterogeneous hardware, including NVIDIA, AMD, and Meta accelerators. It employs RAG-augmented prompt synthesis and graph-based search across multiple programming abstractions like Triton and CuTe to navigate the full hardware-software optimization stack. The system achieves 100% correctness on KernelBench and ATen operators, reducing development time from weeks to hours while significantly outperforming PyTorch baselines in production environments.

Measuring AI Agents in Production

A large-scale study of 306 practitioners reveals that production AI agents favor simple, controllable architectures, with 70% relying on prompting off-the-shelf models and 68% limiting autonomous execution to under 10 steps. Reliability and evaluation remain the primary bottlenecks, leading 74% of organizations to depend on human evaluation for quality control. Despite these challenges, straightforward deployment patterns are already delivering significant impact across diverse industries.

Code

Creators of Tailwind laid off 75% of their engineering team

Tailwind CSS documentation is built with Next.js and utilizes pnpm for local development and dependency management. The project is not open-source, as Tailwind Labs Inc. retains all intellectual property rights, permitting public access only for educational use and minor contributions.

Ba

ba is a lightweight, infrastructure-free task tracker optimized for LLM sessions and multi-agent coordination. It employs a session-based ownership model to manage task claiming and prevent redundant work, storing data in git-friendly JSONL files. The tool features dependency tracking, a Claude Code plugin, and JSON output for seamless programmatic integration.

RepoReaper – AST-aware, JIT-loading code audit agent (Python/AsyncIO)

RepoReaper is an autonomous code auditing agent that utilizes AST-aware parsing and a ReAct loop to perform deep architectural analysis and semantic search. It redefines RAG as a dynamic context cache, employing JIT file reads to resolve semantic gaps and a hybrid search mechanism (BM25 + Vector) with RRF for high-fidelity retrieval. The system is built on a high-throughput asynchronous pipeline using FastAPI, ChromaDB, and DeepSeek-V3.

Unified multimodal memory framework, without embeddings

MemU is an agentic memory framework that organizes multimodal data into a hierarchical structure of resources, items, and categories. It features a dual retrieval system, utilizing RAG for high-speed vector search and LLM-based reasoning for deep semantic understanding. The framework supports self-evolving memory with full traceability and is available as a cloud API or a self-hosted Python package compatible with pgvector.

Deep learning without gradient descent, 500 layers, no skip connections

Polyharmonic Cascade is a deep learning architecture derived from first principles, specifically random function theory and indifference postulates. It leverages polyharmonic spline packages to enable efficient computation and differentiation, supporting depths of up to 500 layers without skip connections. Benchmarks demonstrate high performance on MNIST and HIGGS datasets without the need for convolutions or data augmentation.