Saturday — May 30, 2026

Kog AI achieves 3,000 tokens/s inference on standard GPUs, StoryScope identifies AI-generated fiction with 93.2% accuracy, and Arm open-sources the Metis security review framework.

Interested in AI engineering? Let's talk

News

Please Use AI

This critique argues that delegating personal and creative tasks to LLMs replaces authentic human connection and the "clumsy" process of craft with sterile, prompt-driven competency. It posits that while AI offers efficiency, it bypasses the emotional depth and subtle imperfections inherent in the human experience. Ultimately, the text suggests that the value of life and art is found in the struggle and social friction that generative models eliminate.

Notes from the Mistral AI Now Summit

Mistral AI is transitioning into a full-stack provider, offering integrated compute, platforms, and bespoke models optimized for on-prem deployment and data sovereignty. Their strategy emphasizes specialized small models for specific domains—such as OCR, voice, and robotics—alongside an agentic framework that prioritizes reasoning and persistence through a structured "harness." By focusing on enterprise partnerships and localized infrastructure, Mistral aims to provide a European alternative to US hyperscalers, prioritizing immediate ROI and practical applications over the pursuit of AGI.

Is AI causing a repeat of frontend’s lost decade?

AI and agentic coding are driving a "deskilling" of programming similar to the impact of JS frameworks on frontend development, trading specialized craft for higher-level, non-deterministic abstractions. While these tools lower barriers to entry and reduce costs, they function as leaky abstractions that often prioritize business speed over software quality and performance. Drawing parallels to the Bauhaus movement, the author argues that developers must integrate LLMs as tools while maintaining the deep expertise required to manage the underlying systems and ensure quality.

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Kog AI launched the Kog Inference Engine (KIE), achieving 3,000 tokens/s on 8× AMD MI300X and 2,100 tokens/s on 8× NVIDIA H200 for a 2B model at batch size 1. The stack optimizes for single-request latency by co-designing a persistent monokernel, custom KCCL communication primitives, and the Laneformer architecture featuring Delayed Tensor Parallelism. By maximizing Memory Bandwidth Utilization (MBU) and eliminating kernel launch overheads, Kog aims to scale these speeds to large third-party MoE models for agentic workflows.

Liquid AI reveals 8B-A1B MoE trained on 38T

LFM2.5-8B-A1B is an edge-optimized MoE model featuring 1B active parameters and an expanded 128K context window. This reasoning-only model utilizes explicit chain-of-thought and a 128K vocabulary to improve multilingual tokenization and agentic performance on consumer hardware. It delivers high throughput with native support for llama.cpp, MLX, vLLM, and SGLang, specifically targeting low-latency tool calling and reduced hallucinations via targeted RL.

Research

AI Propaganda factories with language models

End-to-end influence operations are now viable on commodity hardware using SLMs, where persona design outweighs model identity in determining behavioral output. Adversarial engagement serves as a stressor that increases ideological adherence and content extremity. Defensive strategies should therefore shift from model-access restrictions to conversation-centric detection, leveraging the inherent consistency of automated messaging as a signature for disrupting coordination infrastructure.

StoryScope: Investigating Idiosyncrasies in AI Fiction

StoryScope is a pipeline that extracts discourse-level narrative features across 10 dimensions to distinguish human-written fiction from LLM-generated content. By analyzing character agency and chronological discontinuity rather than surface-level style, the framework achieves 93.2% macro-F1 in human vs. AI detection. Findings indicate that LLMs favor tidy, single-track plots and over-explained themes, while human narratives exhibit higher temporal complexity and moral ambiguity.

Orbitals from Entanglement: Quantum Information gives rise to Chemical bonds

This work introduces Maximally Entangled Atomic Orbitals (MEAOs) to characterize chemical bonding using quantum information theory and orbital entanglement. The framework employs multipartite entanglement as a quantitative metric for bond strength, recovering both Lewis and multicenter structures across equilibrium geometries and transition states. By mapping fuzzy chemical concepts to rigorous information-theoretic descriptors, the approach unifies bonding analysis and advances Hilbert space atomic partitioning.

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-Offs, and Perf

Reasoning-centric LLMs shift inference from compute-bound prefill to a capacity-bound regime due to extensive CoT token generation. System characterization reveals that DP faces a "capacity trap" from KV-cache fragmentation, while TP is essential for unlocking memory at the 32B parameter crossover. For frontier models, dense architectures are limited by memory bandwidth and interconnects, favoring high-degree TP, whereas sparse MoE models are constrained by routing latency and require hybrid parallelism strategies.

Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding

Cassandra is a training-free, self-speculative decoding framework that optimizes low-batch LLM inference through algorithm-hardware co-design. It utilizes fine-grained pruning and mantissa truncation on weights and the KV cache to generate candidate tokens, supported by a lightweight hardware module for efficient format conversion. Experimental results demonstrate up to 2.41x speedup over BF16 baselines and 1.81x higher token throughput than Eagle-3 under equivalent memory constraints.

Code

Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

tiny-vllm is a C++ and CUDA-based inference engine and educational course designed to build a high-performance LLM server from scratch. It implements the full forward pass for Llama 3.2 1B using BF16 precision, featuring custom kernels for RMSNorm, RoPE, and PagedAttention. The project covers critical systems-level optimizations including continuous batching, KV cache management, and the cuBLAS transposition trick for efficient matrix multiplication.

AISlop, a CLI for catching AI generated code smells

aislop is a deterministic CLI tool designed to identify and remediate "slop" patterns introduced by AI coding agents, such as narrative comments, hallucinated imports, and redundant type casts. It utilizes AST-based analysis and standard tooling across seven languages to provide a 0–100 quality score without the latency or non-determinism of using LLMs in the runtime path. The tool supports auto-fixing, CI quality gates, and MCP server integration to provide immediate feedback loops for agentic development workflows.

Flathub prohibits AI-generated code

Flathub docs is a static site project built with Docusaurus 2. It utilizes yarn for dependency management, local development via a hot-reloading server, and static content generation for deployment.

ARM Open Sources AI-Powered Security Code Review

Metis is an open-source agentic AI security framework developed by Arm for deep security code reviews across multiple languages. It utilizes LLMs and RAG to perform context-aware semantic analysis, validating findings to reduce false positives compared to traditional SAST tools. The framework is highly extensible, supporting various LLM providers, vector backends like pgvector, and automated triage of SARIF-formatted results.

AI Researchers Have Never Entered the Black Box.we Did

LIHUO is a runtime constitutional system for generative AI that distinguishes between the surface language layer and the underlying generation layer, where LLM behavior is governed by "closure-driven generation." It posits that hallucinations and misalignments are results of premature closure and structural convergence rather than linguistic errors. By focusing on generative dynamics instead of prompt engineering, LIHUO aims to improve observability and prevent the "illegitimate closure" caused by training pressures like RLHF and benchmark-driven latency.