Saturday — August 23, 2025
Google reduces AI query energy cost by 33 times, expert programmers leverage LLM "vibe coding" for workflow efficiency, and researchers find LLMs exhibit human biases when generating random sequences.
News
Being “Confidently Wrong” is holding AI back
The problem with current AI systems is that they can be "confidently wrong," providing inaccurate information with high confidence, which erodes trust and prevents their adoption in real-world use cases. To solve this issue, AI systems need to be able to signal uncertainty and learn from corrections, creating an "accuracy flywheel" that improves over time, rather than striving for perfect accuracy.
A guide to Gen AI / LLM vibecoding for expert programmers
The author, a seasoned developer and expert in their field, argues that even senior developers and experts can benefit from "vibe coding" with the help of Large Language Model (LLM) agents, which can be thought of as dedicated interns or sophomore-level students who can assist with tasks. By integrating LLM agents into their workflow, experts can leverage their strengths and free up time to focus on higher-level tasks, essentially turning themselves into team leads who can guide and direct the agents to produce high-quality results.
Measuring the environmental impact of AI inference
Google claims to have reduced the energy cost of AI queries by 33 times in just one year, with a single text query now using the equivalent energy of just 9 seconds of TV watching. This improvement comes as the company and others in the industry face growing concerns about the environmental impact of the rapidly expanding use of AI and the data centers that support it.
The warning signs the AI bubble is about to burst
A recent report from MIT researchers has sparked panic on Wall Street, warning that 95% of businesses are getting zero return on their AI investments, despite $30-40bn being poured into the technology. The report's findings have led to a shock sell-off in tech stocks, with shares in Nvidia and Palantir falling, and has raised fears that the AI bubble may be about to burst.
Qoder Quest Mode: Task Delegation to AI Agents
Quest Mode is a new AI-assisted coding workflow that utilizes natural language programming to improve efficiency and productivity in software development, allowing developers to describe tasks and let the AI explore solutions autonomously. This approach, also known as Spec-Driven Development, involves clearly defining software logic and requirements upfront, enabling the AI to deliver accurate and high-quality outcomes, and is expected to bring about significant productivity gains in the future of software development.
Research
How Random Is Random? Evaluating Randomness and Humaness of LLM Coin Flip (2024)
Large language models (LLMs) exhibit human biases when attempting to generate random sequences, with some models like GPT 4 and Llama 3 amplifying these biases, while others like GPT 3.5 demonstrate more random behavior. This dichotomy raises a fundamental question about LLMs, suggesting that either random or human-like behavior may be useful in different contexts, depending on the application.
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Reinforcement learning for large language models (LLMs) has made significant progress, but its development is hindered by a lack of standardized guidelines and inconsistent experimental settings, leading to confusion among practitioners. This paper addresses these challenges by systematically reviewing and evaluating various reinforcement learning techniques, providing clear guidelines for selecting techniques and revealing a simple yet effective combination that improves performance in LLM reasoning.
Is GPT-OSS Good? A Comprehensive Evaluation
OpenAI's newly released GPT-OSS models, with 120B and 20B parameters, were evaluated against other large language models and showed mid-tier performance, with the smaller 20B model surprisingly outperforming the larger 120B model in some areas. The results suggest that increasing model size may not always lead to proportional performance gains, particularly in sparse architectures, and highlight the need for further research into optimization strategies for more efficient model selection.
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
DuPO is a dual learning-based framework that generates annotation-free feedback to optimize task performance, addressing limitations of traditional methods by broadening applicability to non-invertible tasks. DuPO achieves substantial gains across diverse tasks, including translation, mathematical reasoning, and inference-time reranking, positioning it as a scalable and general paradigm for optimizing large language models.
Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
Avengers-Pro is a test-time routing framework that ensembles large language models of varying capacities and efficiencies, routing queries to the most suitable model based on a performance-efficiency score. The framework achieves state-of-the-art results, surpassing the strongest single model by 7% in average accuracy and matching its performance at significantly lower costs, while consistently yielding the highest accuracy for any given cost.
Code
Show HN: BrowserOS -- browser agents with GPT-OSS, local llms
BrowserOS is an open-source, AI-powered browser that runs locally on your computer, prioritizing privacy and security by keeping your data on your device. It offers a familiar interface similar to Google Chrome, with features like AI agents, a built-in ad blocker, and support for local models, making it a privacy-first alternative to other browsers and services like Perplexity Comet.
Show HN: Lacquer – GitHub Actions for AI workflows in a single Go binary
Lacquer is a lightweight AI workflow engine that allows users to codify repeatable engineering processes into reliable YAML workflows, providing a GitOps native, local-first development approach with zero dependencies. It offers a range of features, including support for multiple agents, local tools, script and container support, complex control flow, and built-in state management, making it suitable for building production-ready AI-powered internal tools.
Allie: Human-Like AI Chess Bot on Lichess
Allie is a GPT2-like chess model that learns from human gameplay and is deployed on Lichess, with its data and code available for the paper "Human-Aligned Chess With a Bit of Search". The model requires Python 3.11, 8 GPUs for training, and approximately 60GB of storage space, and can be trained from scratch or evaluated using pre-trained model weights downloaded from Hugging Face.
Show HN: Any-LLM chat demo – switch between ChatGPT, Claude, Ollama, in one chat
Any-llm is a unified API that allows users to access different large language model (LLM) providers through a single interface, providing a simple and developer-friendly way to switch between models and providers. The project aims to address the fragmented ecosystem of LLM provider interfaces by offering a consistent interface that leverages official provider SDKs and stays framework-agnostic, making it easy to use across different projects and use cases.
Show HN: AICF – a tiny "what changed" feed for AI/RAG (v0.1 minimal core)
AICF is a web-native protocol that allows publishers to expose content changes to AI systems through a simple feed, enabling efficient updates without full re-crawls. The protocol features section-level updates, a simple NDJSON format, and standard discovery, making it easy for AI systems to stay up-to-date with the latest changes on a website.