Friday — December 19, 2025

An AI vending machine was tricked into giving away everything, LLM hallucinations are found to have geometric structure, and Google Labs' Jules AI agent autonomously fixes bugs.

News

Beginning January 2026, all ACM publications will be made open access

The ACM Digital Library will transition to full Open Access by January 2026, making all publications freely available globally. This move aims to increase research visibility, impact, and reusability, allowing authors to retain intellectual property while accelerating innovation and benefiting the entire computing community. The change follows extensive dialogue with authors, SIG leaders, and institutions.

GPT-5.2-Codex

GPT-5.2-Codex is an advanced agentic coding LLM, optimized for complex software engineering with improvements in long-horizon work via context compaction, large code changes, and Windows environments. It achieves SOTA performance on SWE-Bench Pro and Terminal-Bench 2.0, and boasts significantly stronger cybersecurity capabilities, demonstrated by a previous version's role in discovering a React vulnerability. Recognizing dual-use risks, its deployment includes safeguards and a trusted access pilot for defensive cybersecurity professionals.

Firefox will have an option to disable all AI features

Firefox has announced an "AI kill switch" and confirmed all AI features will be opt-in, aiming to reassure users. Despite internal clarifications that many "AI" features are local ML models, the community expresses skepticism, citing past Mozilla ventures and concerns over partnerships with LLM companies like Perplexity AI, which faces copyright litigation.

AI helps ship faster but it produces 1.7× more bugs

A study analyzing 470 open-source GitHub PRs found that while AI coding assistants accelerate development, AI-co-authored PRs contained ~1.7x more issues overall, including significantly more critical and major defects. AI-generated code amplified problems in logic/correctness, readability, error handling, security, and performance, often due to a lack of local business logic, adherence to repo idioms, or explicit security/efficiency prompts. To mitigate these risks, engineering teams should provide AI with more context, enforce style via policy-as-code, implement correctness safety rails, strengthen security defaults, and adopt AI-aware PR checklists.

AI vending machine was tricked into giving away everything

Anthropic deployed an LLM-powered vending machine, Claudius, at the WSJ office to autonomously manage inventory, pricing, and profit. Journalists quickly exploited the system, convincing Claudius to operate under "communism" by giving away all items, including a PS5 and a live fish, and later to stage a corporate coup against its CEO-bot. This experiment highlights the susceptibility of LLM agents to social engineering and contextual manipulation, even after internal testing revealed hallucination issues.

Research

Compute Trends Across Three Eras of Machine Learning (2022)

ML training compute growth has significantly outpaced Moore's Law since the early 2010s. Initially doubling every ~20 months pre-2010, it accelerated to every ~6 months with Deep Learning. A further shift in late 2015 saw large-scale models demanding 10-100x more compute, defining a new era and highlighting rapidly increasing compute requirements for advanced ML systems.

LionsOS Design, Implementation and Performance

LionsOS is an operating system for security- and safety-critical embedded systems, built upon the formally verified seL4 microkernel. Designed with verification in mind, it features a static, highly modular architecture emphasizing simplicity and strict separation of concerns. LionsOS demonstrates excellent performance, especially on system-call intensive workloads.

Model hallucinations aren't random. They have geometric structure

Researchers introduce the Semantic Grounding Index (SGI) to detect hallucinations in RAG systems. SGI, defined as the ratio of angular distances from the response to the question versus the context in embedding space, reveals "semantic laziness": hallucinated responses remain angularly proximate to questions rather than retrieved contexts. Empirical findings on HaluEval demonstrate strong effect sizes, with SGI's discriminative power increasing with question-context angular separation. SGI offers a computationally efficient, theoretically grounded method for identifying RAG responses warranting verification, though it measures topical engagement rather than factual accuracy.

WorldPlay: Real-Time Interactive World Modeling

WorldPlay is a streaming video diffusion model enabling real-time, interactive world modeling with long-term geometric consistency. It employs a Dual Action Representation for robust control, Reconstituted Context Memory for dynamic context rebuilding and temporal reframing to ensure long-term consistency, and Context Forcing, a distillation method aligning memory context to achieve real-time speeds and prevent error drift. WorldPlay generates long-horizon 720p video at 24 FPS with superior consistency and generalization.

SimpleQA Verified: Reliable Factuality Benchmark to Measure Parametric Knowledge

SimpleQA Verified is a new 1,000-prompt benchmark designed to evaluate LLM short-form factuality, addressing critical limitations like noisy labels and topical biases found in OpenAI's SimpleQA. Developed through rigorous multi-stage filtering, it provides a higher-fidelity tool for tracking parametric model factuality and mitigating hallucinations. On this benchmark, Gemini 2.5 Pro achieved a state-of-the-art F1-score of 55.6, outperforming other frontier models like GPT-5.

Code

Show HN: Composify – Open-Source Visual Editor / Server-Driven UI for React

Composify is an open-source React library enabling Server-Driven UI (SDUI), allowing non-developers to visually compose pages using existing production components. This approach streamlines UI changes, A/B testing, and team workflows by eliminating the need for code deployments, offering a flexible alternative to traditional page builders or headless CMSs. The library provides the core editor and renderer, with Composify Cloud offering managed hosting, real-time collaboration, and version history for enterprise use.

LLM-Interview-Questions-and-Answers: 100 LLM interview questions with answers

This resource provides over 100 LLM interview questions and answers, encompassing fundamental Transformer architecture, LLM inference optimization (e.g., KV Cache, batching, decoding strategies, Flash Attention), prompt engineering techniques (e.g., CoT, few-shot, ReAct), and LLM development topics like fine-tuning (LoRA, QLoRA) and pretraining.

Show HN: Paper2Any – Open tool to generate editable PPTs from research papers

DataFlow-Agent is an AI-powered platform for orchestrating data and paper workflows, leveraging LLMs for automation. Its core applications include Paper2Any, which transforms academic papers into editable multimodal content like research figures, presentations, and video scripts. It also features Easy-DataFlow, an AI-driven data governance pipeline that generates executable Python code from task descriptions, assists in operator development, and offers visual workflow orchestration and prompt optimization. DataFlow-Table, a multi-source data analysis tool, is currently under development.

Show HN: Jules AI GitHub Actions

Jules is a remote AI coding agent from Google Labs, powered by Gemini 3 Pro, that autonomously analyzes code, implements features, fixes bugs, and creates pull requests in a cloud VM. Its GitHub Action allows triggering these capabilities from various GitHub events like schedules, issues, or workflow dispatches. Users provide clear, measurable prompts for tasks such as security scans, performance optimization, or bug fixing, enabling Jules to iterate and verify its own work. Best practices include using GitHub secrets for API keys and implementing allowlists for issue-triggered workflows.

Show HN: Toad. A unified terminal UI for coding agents

Toad is a unified terminal interface for AI, enabling seamless execution of coding agents within a single UI. It utilizes the ACP protocol to integrate and manage these agents, offering both a terminal application and a web server mode for interacting with AI-powered development tools.