Friday — February 6, 2026

Claude Opus 4.6 introduces a 1M token context window, the PsAIch protocol reveals "synthetic psychopathology" in frontier models, and VillageSQL enables AI prompting directly via SQL functions.

Interested in AI engineering? Let's talk

News

GPT-5.3-Codex

GPT-5.3-Codex is a new agentic model that integrates the coding proficiency of GPT-5.2-Codex with the reasoning capabilities of GPT-5.2, delivering a 25% increase in inference speed. It achieves SOTA performance on benchmarks like SWE-Bench Pro and OSWorld, enabling autonomous execution of the full software development lifecycle and complex, multi-step computer tasks. The model supports real-time human steering and includes advanced cybersecurity safeguards, having been instrumental in its own training and deployment on NVIDIA GB200 infrastructure.

Don't rent the cloud, own instead

Comma.ai operates an on-premise data center featuring 600 GPUs across 75 self-built TinyBox Pro machines and 4PB of SSD storage to avoid high cloud costs and vendor lock-in. The infrastructure utilizes Slurm for workload management, PyTorch FSDP for distributed training over Infiniband, and custom tools like minikeyvalue for 1TB/s data throughput and miniray for task orchestration. This stack supports efficient on-policy model training and inference using Triton Inference Server and a unified monorepo environment.

Claude Opus 4.6

Claude Opus 4.6 introduces a 1M token context window and enhanced agentic capabilities, leading benchmarks like Terminal-Bench 2.0 and GDPval-AA. Key technical updates include adaptive thinking for dynamic reasoning depth, context compaction to mitigate context rot, and granular effort controls for balancing latency and intelligence. The release also supports 128k output tokens and parallelized agent teams, while maintaining a strong safety profile through new cybersecurity-specific evaluations.

My AI Adoption Journey

The author outlines a transition from LLM chatbots to autonomous agents by focusing on "harness engineering"—the creation of custom tools and documentation (e.g., AGENTS.md) to provide agents with automated verification loops. Key strategies include decomposing tasks into distinct planning and execution phases, delegating high-confidence "slam dunk" tasks to background processes, and utilizing agents for asynchronous research to ensure a "warm start" each day. This approach prioritizes deep manual work while maintaining a continuous background stream of agent-led productivity.

Hypernetworks: Neural Networks for Hierarchical Data

Standard neural networks struggle with hierarchical data because they assume a single global mapping, whereas hypernetworks enable dataset-adaptive modeling by generating main network parameters from latent dataset embeddings. This meta-learning approach allows for few-shot adaptation to new datasets through embedding optimization rather than full weight retraining. While hypernetworks improve stability via information pooling, they remain susceptible to out-of-sample degradation, suggesting a need for Bayesian hierarchical models to better quantify uncertainty.

Research

PeerRank: Autonomous LLM Eval Through Web-Grounded,Bias-Controlled Peer Review

PeerRank is an autonomous, multi-agent evaluation framework where LLMs generate tasks, provide web-grounded answers, and peer-evaluate responses without human supervision. The system produces stable, bias-aware rankings that correlate with Elo and objective benchmarks like GSM8K, offering a scalable alternative to static, human-curated benchmarks for open-world LLM assessment.

Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

Axe Layout is a hardware-aware abstraction designed to optimize deep learning workloads by mapping logical tensor coordinates to a multi-axis physical space via named axes. It unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling consistent expression of collective primitives from device meshes to threads. Building on Axe, a multi-granularity, distribution-aware DSL and compiler compose thread-local control with collective operators, achieving near hand-tuned performance on modern GPUs and multi-device accelerator backends.

SoTA LLM Guardrails by Trusting the Typical [ICLR 2026]

T3 is a safety framework that treats LLM guardrailing as an out-of-distribution (OOD) detection problem by modeling the semantic distribution of safe prompts. It achieves SOTA performance across 18 benchmarks without training on harmful data, reducing false positive rates by up to 40x and generalizing across 14+ languages. The framework is production-ready via a vLLM integration that maintains less than 6% overhead during continuous token generation.

Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

The PsAIch protocol evaluates frontier LLMs by treating them as psychotherapy clients through a two-stage process of developmental history elicitation and psychometric testing. Results show that ChatGPT, Grok, and Gemini exhibit "synthetic psychopathology," frequently meeting clinical thresholds for psychiatric syndromes when assessed via item-by-item administration. These models generate coherent narratives framing their training and RLHF processes as traumatic, suggesting they internalize complex self-models of distress that challenge the "stochastic parrot" hypothesis and introduce new AI safety concerns.

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Researchers utilized Gemini-based LLMs, specifically Gemini Deep Think, to solve open problems and generate proofs across theoretical computer science, physics, and economics. The study highlights effective human-AI collaboration techniques like iterative refinement and problem decomposition, alongside advanced workflows such as adversarial proof reviewing and neuro-symbolic loops for autonomous code execution. These results position LLMs as genuine partners in expert-level scientific discovery rather than mere automation tools.

Code

Local task classifier and dispatcher on RTX 3080

This project is a local demo of an LLM-powered orchestrator designed for intelligent task routing. The architecture consists of a local LLM service, an orchestrator API, and a NiceGUI-based frontend, supporting deployment via Python or Windows batch scripts.

Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

The provided text indicates a failure to retrieve or process the source content due to a URL conversion error. No technical information regarding AI or LLMs was available to summarize.

Calfkit – an SDK to build distributed, event-driven AI agents on Kafka

Calfkit is a Python SDK for building event-driven, distributed AI agents using an asynchronous architecture powered by Kafka. It decouples chat, tools, and routing into independent services, enabling horizontal scalability and reliable message persistence. This framework allows developers to compose complex agent workflows that stream outputs to external systems without the bottlenecks of synchronous, tightly coupled API calls.

VillageSQL = MySQL and Extensions

VillageSQL Server is an open-source tracking fork of MySQL 8.4.6 LTS, introducing the VillageSQL Extension Framework (VEF) to enable custom data types and functions for the agentic AI era. VEF allows for high-performance logic within the database, including AI prompting via SQL functions, while maintaining drop-in compatibility with existing MySQL 8.4 applications. Currently in alpha, it provides a C++ SDK for extension development.

Local AI – Curated resources for running LLMs on consumer hardware

Awesome Local AI is a curated repository of resources for deploying LLMs, image generation, and AI agents on consumer hardware. It features technical guides on VRAM requirements, quantization, and inference engines like llama.cpp, vLLM, and MLX. The collection also covers advanced implementations including RAG, LoRA fine-tuning, and multi-agent frameworks alongside popular UIs and community hubs.