Wednesday — November 12, 2025

An AI agent provides conversational documentation for any GitHub repo, a new tool uses LLMs to prevent architectural drift, and a review of 445 LLM benchmarks finds many lack construct validity.

News

AI documentation you can talk to, for every repo

DeepWiki provides a conversational AI interface for understanding GitHub repositories, functioning as "Deep Research for GitHub." It uses the AI agent Devin to index codebases, allowing developers to query any repo and receive up-to-date, interactive documentation.

We ran over 600 image generations to compare AI image models

A comparative analysis of OpenAI's gpt-image-1, Gemini's gemini-2.5-flash-image, and seedream-4-0-250828 across 600+ image generations reveals distinct specializations. OpenAI excels at creative style transfers but often introduces unwanted hallucinations. Gemini is superior for photorealistic edits by preserving details, yet it is overly conservative with artistic prompts, particularly on human subjects. Seedream emerges as a fast, cost-effective "jack of all trades," positioning it as a strong middle-ground alternative, leading the authors to consider a prompt classifier to route tasks to the optimal model.

The 'Toy Story' You Remember

The original release of Toy Story was a hybrid process where digital renders were transferred to 35mm analog film for theatrical distribution. This analog step was a critical part of the creative pipeline, with artists calibrating digital colors and textures specifically for the film stock's characteristics. Modern direct-to-digital versions of the movie bypass this process, resulting in a fundamentally different look that lacks the softness, grain, and color profile of the intended theatrical experience.

AI adoption in US adds ~900k tons of CO₂ annually, study finds

A study in Environmental Research Letters projects that AI adoption in the US will add approximately 900,000 tons of CO₂ annually. This represents a modest 0.02% of total national emissions, with energy use in some industries increasing by up to 12 petajoules, comparable to the electricity consumption of 300,000 homes. Researchers conclude that while the impact is relatively small, it underscores the need to integrate energy efficiency and sustainability into AI development as adoption scales.

Show HN: Gametje – A casual online gaming platform

Canvas Clash is a multiplayer game for 3-16 players that features an integrated AI agent. Players can invite the AI to join their game session and compete directly against it.

Research

Discovering archetypes of French detective novels using NLP

This research applies character-level embeddings and a supervised model to analyze the detective archetype in 150 years of French fiction. The model successfully captures the archetype's unity over time while also tracing its evolution from a central "reasoning machine" to a more complex figure navigating moral ambiguity, influenced by the hardboiled tradition.

Show HN: CellARC Measuring Intelligence with Cellular Automata

CellARC is a new synthetic benchmark for abstraction and reasoning based on 1D cellular automata, designed for rapid iteration with small models. It provides a highly controllable task space to test few-shot rule inference and generalization decoupled from anthropomorphic priors. Baselines show a 10M parameter transformer is a strong small-model baseline, a large LLM achieves better extrapolation, and a neuro-symbolic ensemble demonstrates complementarity by reaching the highest interpolation accuracy.

Measuring What Matters: Construct Validity in Large Language Model Benchmarks

A systematic review of 445 LLM benchmarks found that many lack construct validity, with common patterns in measured phenomena, tasks, and scoring metrics undermining their claims. The study identifies these shortcomings and provides eight key recommendations to guide researchers in developing more robust and valid benchmarks.

Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

This work introduces EGG-SR, a framework that uses equality graphs (e-graphs) to address the inefficiency in symbolic regression caused by exploring syntactically different but semantically equivalent expressions. By compactly representing equivalence classes, EGG-SR enhances various algorithms like MCTS, DRL, and LLMs by pruning redundant search, aggregating rewards, and enriching feedback prompts. This approach is shown to tighten theoretical bounds and achieve state-of-the-art results on challenging benchmarks.

Orbital Characterization of a Newly Discovered Small Satellite Around Quaoar

A new satellite around Quaoar was discovered via a single stellar occultation, but its orbital parameters were quickly lost. Attempts to recover the object in JWST data failed, a non-detection attributed to its extreme faintness and limitations in the accuracy of current PSF models. The satellite's existence supports a collisional disk origin hypothesis for Quaoar's rings, but confirmation will require next-generation 30-meter-class telescopes and improved hydrodynamical modeling.

Code

Adk-go: code-first Go toolkit for building, evaluating, and deploying AI agents

The Agent Development Kit (ADK) for Go is an open-source, code-first framework for building, deploying, and orchestrating AI agents. It allows developers to define agent logic, tools, and multi-agent systems directly in Go, leveraging the language's performance and concurrency for cloud-native applications. The toolkit is model-agnostic and designed to provide flexibility and control over agent workflows.

Show HN: Tusk Drift – Open-source tool for automating API tests

Tusk Drift is a Node.js SDK for deterministic API testing using a record-and-replay mechanism. It captures real-world API traffic, including interactions with databases and other services, and replays these traces as tests. During replay, all outbound requests are mocked with the captured data to ensure fast, consistent, and side-effect-free regression testing.

Show HN: Linnix – eBPF observability that predicts failures before they happen

Linnix is an eBPF-powered observability tool for Linux that monitors process lifecycle events and system telemetry at the kernel level with low overhead. It features a built-in rules engine for detecting common issues like fork storms and offers an optional, experimental LLM integration for natural language incident analysis. The AI component can connect to any OpenAI-compatible API, including local models via Ollama or vLLM, to provide insights beyond simple threshold-based alerts.

Show HN: Gerbil – an open source desktop app for running LLMs locally

Gerbil is a cross-platform desktop app that provides a GUI wrapper for running LLMs locally using KoboldCpp. It automates the management of the KoboldCpp binary, including updates and process control, to optimize performance and resource usage. The app supports both text and image generation and includes integrations for popular frontends like SillyTavern and OpenWebUI. A CLI mode is also available to proxy commands directly to the underlying KoboldCpp engine for advanced use cases.

Show HN: SpecMind – AI architecture tool for vibe coding

SpecMind is an open-source tool that uses LLMs to prevent architectural drift in AI-assisted development. It integrates with coding assistants through a spec-driven workflow: /analyze uses tree-sitter to parse a codebase and generate architecture diagrams, /design plans a feature's architectural impact, and /implement uses the resulting spec as context for generating aligned code. This keeps architecture documentation, stored as version-controllable Markdown and Mermaid files, synchronized with the implementation from the first commit.