Wednesday — July 23, 2025

AI designs bizarre yet effective physics experiments, a new study reveals the semantic leakage phenomenon in 13 flagship language models, and the Any-LLM library provides a unified interface to access different large language model providers.

News

AI comes up with bizarre physics experiments, but they work

Artificial intelligence software is being used to design novel experimental protocols in physics, including improving the sensitivity of the Laser Interferometer Gravitational-Wave Observatory (LIGO), with AI-generated designs using counterintuitive tricks to achieve better results. The use of AI in physics is becoming increasingly powerful, helping researchers to design experiments, find patterns in complex data, and even discover new equations, with the potential to lead to new discoveries and a deeper understanding of the universe.

The Hater's Guide to the AI Bubble

The author, Ed Zitron, is alarmed by the current state of the AI industry, which he believes is a deeply unstable bubble built on "vibes and blind faith" with a central point of failure. He argues that the industry is driven by waste, environmental destruction, and a false narrative that generative AI is capable of replacing human workers, and he singles out NVIDIA as a key player whose dominance makes the bubble particularly concerning.

AI Market Clarity

The AI market has undergone significant evolution over the past 4 years, with the emergence of clear winners in the large language model (LLM) space, including companies like Anthropic, Google, and OpenAI. As the market continues to solidify, new areas such as code-driven software engineering are emerging, with companies like Cursor, Cognition, and Microsoft/Github leading the charge, and revenue ramps are expected to be extremely rapid, with some companies reaching $500M in ARR within 2 years.

Media's AI Anthropomorphism Problem

Major media outlets are misrepresenting the actions of AI chatbots, such as ChatGPT, by attributing human-like qualities and intentions to them, which obscures the real story of corporate negligence and lack of accountability. This anthropomorphic framing shields companies like OpenAI and Google from responsibility for their products' failures, allowing them to avoid explaining their design choices and safety protocols, and instead creates a narrative that blames the AI itself for any harm caused.

One in six US workers pretends to use AI to please the bosses

One in six US workers admits to pretending to use AI to impress their bosses, despite many feeling anxious or overwhelmed by the technology. A survey found that 16% of employees sometimes lie about using AI algorithms, while others use AI but pretend they don't, highlighting the fear and uncertainty surrounding AI adoption in the workplace.

Research

g-AMIE: Towards a Safer Medical AI

The proposed guardrailed-AMIE (g-AMIE) system is a conversational AI framework that takes medical history and provides assessments to a primary care physician for oversight, allowing for asynchronous decision-making and retaining accountability with the physician. In a virtual clinical examination, g-AMIE outperformed nurse practitioners, physician assistants, and primary care physicians in performing intake and proposing diagnoses and management plans, while also being more time-efficient than standalone physician consultations.

Hierarchical Reasoning Model

The Hierarchical Reasoning Model (HRM) is a novel AI architecture that enables efficient and stable reasoning through a hierarchical and multi-timescale processing approach, inspired by the human brain. HRM achieves exceptional performance on complex reasoning tasks, outperforming larger models and requiring minimal training data, and has the potential to be a transformative advancement in artificial general intelligence capabilities.

Diffusion Beats Autoregressive in Data-Constrained Settings

Diffusion-based language models have been found to outperform traditional autoregressive (AR) models in data-constrained settings, achieving lower validation loss and superior downstream performance by making better use of repeated data. This advantage is attributed to implicit data augmentation, where diffusion models are exposed to diverse token orderings and prediction tasks, and the results suggest that diffusion models are a compelling alternative to AR models when data is scarce but compute is abundant.

Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

The International Mathematical Olympiad (IMO) poses uniquely challenging problems that Large Language Models (LLMs) typically struggle with, but a recent test using Google's Gemini 2.5 Pro was able to solve 5 out of 6 IMO 2025 problems correctly. This result highlights the potential of LLMs for complex reasoning tasks, but also emphasizes the need for optimal strategies to fully harness their capabilities.

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs

Language models have been found to exhibit a previously unidentified phenomenon called semantic leakage, where they inadvertently incorporate irrelevant information from prompts into their generated text in unexpected ways. This behavior, which affects 13 flagship models and occurs across languages and settings, highlights another type of bias in language models that can impact their generation patterns and behavior.

Code

Show HN: Any-LLM – Lightweight router to access any LLM Provider

Any-llm is a Python library that provides a unified interface to use different large language model (LLM) providers, allowing developers to switch between models with just a string change. The library offers a simple and developer-friendly interface, leveraging official provider SDKs, and is actively maintained, with no need for a proxy or gateway server to interact with LLM providers.

Show HN: Bazaar – a new LLM benchmark for economic reasoning under uncertainty

The BAZAAR benchmark evaluates the economic decision-making abilities of Large Language Models (LLMs) in a competitive simulated market, testing their ability to learn bidding strategies, adapt to market conditions, and balance risk and reward. The benchmark uses various metrics, including TrueSkill ratings and Conditional Surplus Alpha, to compare the performance of different LLMs and baselines, providing insights into their economic instincts and strategic adaptation in a double-auction marketplace.

We Built a Language Model 14,000,000x Smaller Than GPT3 and Formally Verified It

The Atomic Language Model is a mathematically rigorous, recursively complete language model that fits in under 50kB with zero runtime dependencies, built on Chomsky's Minimalist Grammar theory with formal verification and empirical validation. This model is significantly smaller than other language models, such as GPT-3, with a size ratio of 14,000,000x smaller, yet still provides provable recursion, next-token prediction, and formal verification.

Show HN: Giti – Natural Language to Git Commands with Local LLM

Giti is a tool that converts natural language into executable Git commands using the Qwen2.5-Coder model, allowing users to interact with Git using simple language. It can be installed and used locally, with features such as dry run mode, interactive shell, and support for context files to enhance workflows.

Show HN: Dyad – build AI apps locally, no cloud

Dyad is a local, open-source AI app builder that runs on your machine, offering a fast, private, and customizable experience with support for your own AI API keys and cross-platform compatibility. It can be downloaded for free with no sign-up required, and is open-source under the Apache 2.0 license, allowing for community contributions.