Sunday — June 7, 2026

A worker secured a religious exemption from AI use, LLMs tackled 98 research-level math problems, and Slopper combats AI "slop" on open-source projects.

Interested in AI engineering? Let's talk

News

Meta confirms 1000s of Instagram accounts were hacked by abusing its AI chatbot

Meta disclosed a breach affecting approximately 20,000 Instagram accounts exploited via a vulnerability in its AI-driven account recovery chatbot. The flaw resided in a code path that failed to validate if the requester's email matched the account's registered address, allowing attackers to redirect password reset links for accounts lacking 2FA. Meta has mitigated the risk by disabling the chatbot and remediating the underlying verification logic.

Police in England and Wales told to halt AI use in court statements

Police forces in England and Wales have been ordered to halt the use of AI for generating witness and court statements. The directive aims to mitigate risks surrounding the accuracy and legal accountability of AI-generated evidence within the judicial system.

Meta Keeps Delaying the Release of Its New AI Model to Developers

Meta has repeatedly delayed the release of its latest frontier AI model to developers, with no current scheduled launch date. This postponement follows a missed "soon" release window signaled by Meta's AI leadership nearly two months ago. The delays highlight ongoing challenges in monetizing the company's massive capital investments in LLM development and infrastructure.

AI Can't Care

AI lacks the intrinsic capacity to care about output accuracy or reader value, making it an insufficient substitute for human judgment in communication. While effective as a research assistant or drafting tool, LLM-generated content should never be published without rigorous human oversight. Relying on raw AI output risks devaluing the audience's time and eroding long-term trust.

She won a religious exemption from using AI at work

Software engineers are successfully securing religious exemptions from mandatory AI tool usage by citing ethical and environmental concerns, a move supported by Title VII and recent papal encyclicals regarding human dignity. This trend presents a legal challenge for employers as AI adoption becomes a standard performance metric, potentially creating a subset of developers who maintain manual coding and review workflows. While some practitioners claim manual processes remain competitive in speed, industry experts warn that opting out of AI integration may impact long-term career trajectories as the technology becomes ubiquitous.

Research

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Gaia2 is a benchmark for LLM agents designed for asynchronous, dynamic environments featuring temporal constraints and multi-agent collaboration. It utilizes a write-action verifier to enable fine-grained evaluation and RL from verifiable rewards. Initial testing shows GPT-5 leading with a 42% pass@1 score, though it faces challenges with time-sensitive tasks, while Kimi-K2 is the top-performing open-source model at 21%.

Paper: A Persona-Based Evaluation Framework for Generative AI Alignment

Current AI alignment paradigms use monolithic benchmarks that obscure the plurality of human judgment. This paper introduces a state-space constrained emulation framework utilizing a manifold of synthetic cognitive profiles (evaluative personas) for pluralistic, perspective-dependent benchmarking. While generative architectures can instantiate these personas consistently, the study reveals systematic degradation in persona coherence, manifesting as state-space drift and semantic inconsistency, under sequential inference and stochastic prompt perturbations. This necessitates embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve robust cognitive emulation for more adaptive and human-aligned AI evaluation.

Rethinking the Value of Generated Tests for LLM Software Engineering Agents

An analysis of six LLMs on SWE-bench Verified shows that agent-written tests do not significantly improve repository-level issue resolution. While agents frequently write tests, they primarily use them for observational feedback via print statements rather than formal assertions, and test-writing frequency remains consistent across both successful and failed tasks. Prompt-intervention studies confirm that adjusting test volume impacts process and cost without meaningfully changing final task outcomes.

Unlocking Non-Uniform KV Cache for Efficient Multi-Turn LLM Serving

Tangram is an LLM serving system designed to optimize non-uniform KV cache compression by addressing memory fragmentation and scheduling overhead. It utilizes Deterministic Budget Allocation for static memory footprints, Head Group Page for vectorized memory management, and AOT Load Balancing to ensure uniform GPU utilization. These optimizations deliver up to 2.6x throughput improvement over existing baselines without compromising model accuracy.

Benchmarks in Leipzig

A dataset of 100 research-level mathematics questions was developed by 49 mathematicians to benchmark LLM reasoning. Evaluation across three stages—ranging from single-shot attempts to multi-run tests with heavy-thinking models—saw the number of unsolved questions drop from 41 to 2. This progression highlights the rapidly advancing mathematical capabilities of state-of-the-art LLMs.

Code

Slopper GitHub Action: Fighting AI Slop Contributions on Open Source Projects

Slopper is a GitHub Action designed to filter low-value, AI-generated pull requests by assigning a risk score from 0 to 10. It utilizes deterministic heuristics to analyze AI fingerprints, contributor "spray scores," and account metadata without requiring external API keys. For deeper analysis, Slopper supports optional agentic checks using LLM providers to evaluate code quality, security concerns, and alignment between PR descriptions and diffs.

Sub-Agent MCP: LLM delegation and sub-agent orchestration via MCP

Sub-Agent MCP is a production-ready Python server for LLM delegation and sub-agent orchestration using the Model Context Protocol. It allows a parent LLM to delegate tasks to specialized LangChain sub-agents defined in YAML, each configured with its own OpenAI-compatible LLM, system prompt, and downstream MCP tool connections. This architecture reduces context bloat for the parent model by abstracting complex tool chains into discrete, role-specific agents with granular tool allowlisting.

Nanocode-CLI – A lightweight terminal-based AI coding assistant

nanocode is a Python-based terminal coding agent that optimizes LLM interactions through a "file-state brain" and cache-aware context management. Key features include live turn control for mid-execution input, stale-edit protection via line:hash anchors, and project-aware navigation using symbol indexing. It maximizes prompt-cache efficiency by structuring stable context early and provides a terminal-first workflow with integrated shell, git, and file-editing tools.

You can't detect your way out of catastrophic LLM failure

The IGO (Observational Governance Infrastructure) framework utilizes a four-layer architecture—Metrics, Circuit-breaker, Adaptation, and Containment—to manage LLM risks across "recoverable" and "ruin" lanes. Dialectic red teaming of Claude Opus 4.8 invalidated the "hash analogy" of LLM errors, proving they possess traceable semantic signatures, and established that containment must remain sovereign over detection to mitigate adversarial step-function risks. The system relies on the Cognitive Predictability Index (CPI) to measure temporal stability and trigger safety gates based on the temporal standard deviation of model confidence.

NotifyMe, a self-hosted beeper app for AI agents and service updates

NotifyMe is a self-hosted, open-source notification system for developers and AI-agent jobs, enabling personal webhook URLs to push alerts directly to a phone. Designed for monitoring long-running tasks like CI pipelines or LLM agent executions, it leverages Firebase for a private, user-controlled deployment. The system supports custom JSON and Atlassian Statuspage payloads, delivering notifications via FCM to a Flutter app with an inbox and tappable URLs.