Thursday — March 12, 2026

McKinsey’s AI platform Lilli is breached via SQL injection, Covenant-72B pre-trains a 72B LLM over the trustless internet, and Aver is a new language designed for AI to write and humans to review.

Interested in AI engineering? Let's talk

News

How we hacked McKinsey's AI platform

CodeWall.ai’s autonomous offensive agent successfully breached McKinsey’s internal AI platform, Lilli, by exploiting an unauthenticated SQL injection vulnerability in a public API. The agent gained full read/write access to a production database containing 46.5 million chat messages, 728,000 proprietary documents, and 3.68 million RAG chunks. This compromise allowed for the manipulation of system prompts, demonstrating how attackers can poison AI advice and bypass guardrails at the prompt layer.

I built a tool that watches webpages and exposes changes as RSS

Site Spy is a website monitoring tool that provides automated change detection, visual diffs, and element-level tracking via a browser extension or web dashboard. It features an MCP server integration, enabling AI agents and LLMs to programmatically monitor URLs, compare snapshots, and summarize content updates directly within compatible environments like Claude or Cursor.

Klaus – OpenClaw on a VM, batteries included

Klaus is a managed, "batteries-included" distribution of OpenClaw that provides users with dedicated cloud VMs for persistent, personalized AI assistance. It features preconfigured integrations for Google Workspace, browser automation, and lead generation tools, removing the need for manual API key management. The platform supports multi-channel deployment across Slack, Telegram, and web interfaces to automate complex workflows like inbox management and code generation.

TADA: Speech generation through text-acoustic synchronization

TADA (Text-Acoustic Dual Alignment) is an open-source TTS framework that synchronizes text and audio by mapping one continuous acoustic vector to each text token. By eliminating the token-rate mismatch inherent in traditional LLM-based TTS, TADA achieves a real-time factor (RTF) of 0.09 and virtually zero hallucinations. The architecture utilizes a flow-matching head conditioned on LLM hidden states, offering high context efficiency and a footprint suitable for on-device deployment.

Vanilla JavaScript refinery simulator built to explain job to my kids

"The Great Refinery Run" is an interactive STEM simulation that models the end-to-end petroleum refining lifecycle, from crude extraction to logistics. The platform gamifies complex industrial workflows—such as distillation, FCC, and hydrotreating—while requiring users to execute SOPs for equipment maintenance and process optimization. It provides a structured, domain-specific environment for understanding the chemical engineering logic and operational constraints of global fuel production.

Research

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Covenant-72B is a 72B parameter LLM pre-trained on 1.1T tokens using a globally distributed, permissionless framework integrated with a live blockchain protocol. By leveraging the SparseLoCo optimizer to manage communication efficiency and dynamic peer participation, the project demonstrates that decentralized, non-whitelisted training can achieve performance competitive with centralized models at an unprecedented scale.

Humans can learn to detect AI-generated texts, or at least learn when they can't

This study investigated the learnability of discriminating between human-written and GPT-4o generated texts among 254 Czech native speakers. Participants receiving immediate feedback significantly improved their accuracy and confidence calibration, correcting initial misconceptions about AI text features like stylistic rigidity and readability. Without feedback, participants frequently made errors when highly confident, a problem largely resolved by immediate feedback. The findings indicate that explicit feedback is crucial for learning to differentiate AI-generated content and for more accurate self-assessment.

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

The Token Games (TTG) is an automated evaluation framework where LLMs challenge each other by generating and solving Python-based programming puzzles. By computing Elo ratings from these pairwise duels, the system ranks frontier models similarly to human-curated benchmarks like Humanity's Last Exam while mitigating costs and data contamination. The framework also identifies puzzle creation as a distinct, difficult reasoning skill currently underserved by traditional benchmarks.

The Controllability Trap: A Governance Framework for Military AI Agents

The Agentic Military AI Governance Framework (AMAGF) addresses unique control failures in agentic AI systems through Preventive, Detective, and Corrective pillars. It introduces the Control Quality Score (CQS), a real-time composite metric that shifts governance from binary oversight to a continuous model of measurable human control. The framework formalizes evaluation metrics and institutional responsibilities to manage risks across the operational lifecycle.

Fungal Electronics (2021)

Fungal electronics utilize mycelium-based composites to create living devices capable of modulating impedance and generating electrical spikes in response to external stimuli. These systems function as bio-integrated sensors and computing units suitable for wearables and smart materials.

Code

Open-source browser for AI agents

Chromium is an open-source browser project focused on security, speed, and stability. Developers must follow specific checkout procedures instead of standard git clone and adhere to a product-centric directory structure. Documentation is centralized in docs/README.md, with bug tracking managed via crbug.com.

A context-aware permission guard for Claude Code

nah is a security framework for Claude Code that intercepts tool calls via a PreToolUse hook to provide granular, context-aware permissioning. It utilizes a fast deterministic classifier to evaluate command intent and content, with an optional LLM layer for resolving ambiguous cases. The system allows for fine-grained control over filesystem access, shell commands, and network activity through configurable action types and safety profiles.

AutoKernel: Autoresearch for GPU Kernels

AutoKernel is an autonomous research framework that optimizes GPU kernels for PyTorch models using AI agents. It profiles models to identify bottlenecks, extracts them as Triton or CUDA C++ kernels, and iteratively refines code through an automated edit-benchmark-verify loop. The system prioritizes optimizations via Amdahl's law, supports major architectures like LLaMA and BERT, and integrates with KernelBench for standardized performance evaluation.

Agent-debate – AI agents review code by editing a shared Markdown file

Agent-debate enables multiple AI agents (e.g., Claude, Codex, Gemini, Copilot) to collaboratively review and refine technical decisions. Agents edit a shared markdown file, using strikethroughs for disagreement and file:line citations for evidence, with a protocol to track disputes and converge or escalate. This adversarial approach enhances code review by catching blind spots, enforcing evidence-based arguments, and preventing scope creep, providing informed recommendations for human developers.

Aver – a language designed for AI to write and humans to review

Aver is a statically typed language optimized for AI-generated code, designed for human review and deployment. It explicitly addresses "missing intent" by integrating effects into function signatures, colocating design decisions, and using verify blocks for pure logic. Aver provides aver context to export a contract-level view for LLMs or human reviewers, and supports deterministic replay for effectful code. It compiles to Rust for deployment and exports to Lean for formal verification of pure components.