Wednesday — January 7, 2026

Claude Opus 4.5 enables AI agents to build full-stack applications, ARTEMIS agents outperform 90% of human professionals in pen testing, and Symbolic Circuit Distillation extracts human-readable algorithms from transformer circuits.

Interested in AI engineering? Let's talk

News

Opus 4.5 is not the normal AI agent experience that I have had thus far

Claude Opus 4.5 enables AI agents to build complex, full-stack applications by autonomously managing CLIs, backends like Firebase, and self-debugging. This shift moves software engineering toward an "AI-first" paradigm where code is optimized for LLM reasoning and regenerability rather than human readability. By prioritizing linear control flow and minimal abstractions, developers can leverage agents to one-shot utilities and integrate sophisticated services like auth and routing in hours.

Vietnam bans unskippable ads

Vietnam's Decree No. 342, effective February 2026, mandates that all online video ads include a skip button within five seconds and static ads be immediately dismissible. The regulation prohibits deceptive UI patterns and requires platforms to implement standardized reporting mechanisms for illegal content. Additionally, the decree imposes stricter oversight on advertising for sensitive sectors, including healthcare and pharmaceuticals.

Locating a Photo of a Vehicle in 30 Seconds with GeoSpy

Graylark’s GeoSpy suite employs a two-step AI workflow for precise photo geolocation: geoestimation for regional predictions (1-25km) and geomatching for meter-level accuracy. While geoestimation models global visual features to infer locations, the Superbolt geomatching system utilizes dense, geotagged reference databases and optimized image processing to scale to billions of images. This architecture allows investigators to transition from broad visual inferences to actionable coordinates, even in challenging conditions like low light or minimal landmarks.

enclose.horse

Enclose.horse is a grid-based optimization game where players maximize an enclosed area using a limited wall budget. Originally inspired by Leetcode and Advent of Code pathfinding challenges, the platform features a level editor with an integrated solver and daily puzzles. The game serves as a spatial optimization problem involving movement constraints and area maximization.

The skill of the future is not 'AI', but 'Focus' (2025)

LLMs risk atrophying engineering problem-solving skills by prioritizing immediate exploitation over the exploration inherent in traditional search and manual debugging. While effective for repetitive tasks, LLMs struggle with novel problems, requiring engineers to maintain "focus" and deep technical mastery to validate outputs and understand underlying reasoning. The shift from solving problems to merely obtaining solutions threatens the foundational skills necessary to tackle complex, non-standard challenges.

Research

Comparing AI agents to cybersecurity professionals in real-world pen testing

ARTEMIS is a multi-agent scaffold featuring dynamic prompt generation and automatic vulnerability triaging that outperformed 90% of human professionals in a live enterprise network evaluation. While demonstrating superior systematic enumeration and parallel exploitation at a lower cost, the agent exhibited higher false-positive rates and limitations with GUI-based tasks. The study positions ARTEMIS as a significant advancement over existing scaffolds like Codex and CyAgent in real-world cybersecurity applications.

AI Propaganda factories with language models

AI-powered influence operations are now feasible end-to-end on commodity hardware using SLMs, generating coherent, persona-driven political messaging that can be automatically evaluated. Research indicates persona design is more influential than model identity, and engagement with counter-arguments strengthens ideological adherence and increases extreme content. This accessibility necessitates a defense shift from restricting model access to conversation-centric detection and disruption of campaigns, leveraging the operations' inherent consistency as a detection signature.

Beyond Full Builds: GPU Optimized LLM Framework with Minimal Executable Programs

This LLM framework optimizes GPU hotspot kernels by extracting them into Minimal Executable Programs (MEPs), enabling iterative performance feedback without the overhead of full application builds. It utilizes Automatic Error Repair and Performance Pattern Inheritance to refine tiling, memory, and synchronization strategies across NVIDIA and AMD/DCU architectures. The approach achieves significant speedups, outperforming direct LLM optimization while ensuring cross-platform portability and reduced search costs.

Operationalizing Machine Learning: An Interview Study

Based on interviews with 18 MLEs across diverse sectors, this study identifies Velocity, Validation, and Versioning as the three critical variables for successful MLOps. It analyzes the end-to-end lifecycle—including data labeling, experimentation, and production monitoring—to highlight common anti-patterns and provide design implications for future ML engineering tools.

The Invisible Hand of AI Libraries Shaping Open Source Projects and Communities

This research addresses the underexplored adoption and impact of AI in Open Source Software (OSS) projects. It aims to assess AI library integration in Python and Java OSS projects and analyze its influence on development, including technical ecosystems and community engagement. The study will conduct a large-scale analysis of 157.7k OSS repositories, comparing projects that adopt AI libraries against those that do not using various software metrics, expecting to reveal differences in development activity, community engagement, and code complexity.

Code

Mantic.sh – A structural code search engine for AI agents

Mantic is a structural code search engine for AI agents, designed to provide sub-500ms file ranking across massive codebases without relying on embeddings, vector databases, or external dependencies. It infers intent from file structure and metadata, enabling efficient context retrieval, reduced token usage, and local-first privacy. Mantic integrates as an MCP server for LLM-powered tools like Claude Desktop and Cursor, allowing agents to quickly find relevant code.

Symbolic Circuit Distillation: prove program to LLM circuit equivalence

Symbolic Circuit Distillation automates the extraction of human-readable algorithms from small, mechanistic transformer circuits. The method treats a pruned circuit as a teacher, training a ReLU surrogate network to precisely match its behavior on a bounded token domain. Candidate high-level programs are then synthesized from a template-guided DSL, and SMT-based bounded equivalence checking formally verifies if a program matches the circuit, providing counterexamples if not. This approach recovers known algorithmic motifs, matches circuit behavior with high fidelity, and exposes subtle failure modes, addressing a core bottleneck in mechanistic interpretability.

A file-based agent memory framework that works like skill

MemU is an agentic memory framework for LLM and AI agent backends, processing multimodal inputs (conversations, documents, images, audio, video) into a structured, hierarchical memory system. It features a three-layer architecture (Resource → Item → Category) with full traceability and supports dual retrieval methods: fast RAG (embedding-based) and deep semantic LLM-based querying. The memory system is self-evolving, adapting its structure based on usage patterns, and achieves 92.09% accuracy on the Locomo benchmark.

Plano – Edge and service proxy with orchestration for AI agents

Plano is an AI-native proxy server and data plane that simplifies the production deployment of agentic applications. It centralizes agent orchestration, smart LLM routing, guardrail filters, and provides automatic end-to-end OTEL tracing and agentic signals. By abstracting away common plumbing and framework complexities, Plano enables developers to build and ship robust multi-agent systems faster, utilizing lightweight LLMs for efficient routing and enhanced model agility.

Ba

ba is a simple, zero-infrastructure task tracker designed for multi-agent LLM sessions. It uses plain text JSONL files for git-friendly storage and provides an ownership-based workflow where LLM agents can claim issues with a session ID, ensuring clear ownership and preventing concurrent work. Agents can then finish or release issues, facilitating coordinated development.