Sunday September 28, 2025

Unsloth's GPT-OSS Reinforcement Learning achieves 3x faster inference and 50% less VRAM usage, researchers introduce Metacognitive Reuse to improve LLM reasoning, and developers launch Pluely, a privacy-first AI assistant with real-time conversation assistance.

News

GPT-OSS Reinforcement Learning

Unsloth now offers the fastest inference, lowest VRAM usage, and longest context for training OpenAI's gpt-oss with reinforcement learning (RL) via GRPO, with no accuracy degradation. This is achieved through innovations such as custom algorithms, weight sharing, and Flex Attention, allowing for 3x faster inference and 50% less VRAM usage compared to other implementations, and enabling training on lower-end GPUs, including free training on Colab.

The (economic) AI apocalypse is nigh

The author believes that an economic apocalypse is imminent due to the AI bubble, which is driven by monopolists claiming that AI can replace human workers, despite having no profitable business model. The author argues that the bubble will eventually burst, causing widespread economic damage, and that the only way to mitigate this is to puncture the bubble as soon as possible by challenging the notion that AI can do human jobs, but it may already be too late to prevent the coming crash.

LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

Building and debugging AI agents in production can be challenging due to the lack of standardization in LLM observability, with competing standards like OpenTelemetry and OpenInference causing fragmentation and compatibility issues. To navigate these challenges, developers are advised to pick a single telemetry backbone, such as OpenTelemetry, and stick to it, even for LLMs, to ensure consistency and avoid siloed observability.

AI Investment Is Starting to Look Like a Slush Fund

Here is a summary of the text in a couple of sentences:

Nvidia and OpenAI, two key players in the AI industry, have announced a strategic partnership where Nvidia will invest up to $100 billion in OpenAI, which will use the funds to purchase Nvidia's systems and deploy its next-generation AI infrastructure. This arrangement has raised concerns about circular financing, where companies are essentially handing money back and forth, with Nvidia investing in OpenAI, which then uses the funds to buy Nvidia's products, creating a potentially unsustainable and opaque financial relationship.

Cost of AGI Delusion:Chasing Superintelligence US Falling Behind in Real AI Race

The United States is prioritizing the development of artificial general intelligence (AGI), or superintelligence, over other artificial intelligence (AI) applications, which may cause it to fall behind in the real AI race. By chasing AGI, the US is diverting resources away from more practical and achievable AI applications that could provide significant benefits in areas such as industry, healthcare, and education.

Research

Metacognitive Reuse: Turning Recurring LLM Reasoning into Concise Behaviors

Large language models can improve their reasoning abilities by converting recurring intermediate steps into concise, reusable "behaviors" that are stored and reused, reducing token usage and latency. This approach, which involves providing the model with relevant behaviors during inference or fine-tuning, achieves improved test-time reasoning in various settings, including behavior-conditioned inference, self-improvement, and supervised fine-tuning.

Teaching LLMs to Plan

A novel instruction tuning framework, PDDL-Instruct, enhances large language models' symbolic planning capabilities by teaching them to rigorously reason about action applicability and plan validity through logical inference steps. The framework has shown significant promise, with instruction-tuned models achieving planning accuracy of up to 94% on standard benchmarks, a 66% improvement over baseline models.

The Transparent Earth: A Multimodal Foundation Model for the Earth's Subsurface

The Transparent Earth is a transformer-based architecture that reconstructs subsurface properties from diverse datasets with varying sparsity, resolution, and modality, and can scale to incorporate new modalities. The model achieves improved performance, including reducing errors in predicting stress angle by more than a factor of three, and aims to become a foundational model for predicting subsurface properties anywhere on Earth.

Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Autoregressive language models are limited by their serial nature, while Diffusion Language Models, although parallelizable, require numerous model evaluations to achieve high quality. The introduced FS-DFM model addresses this by training a discrete flow-matching model to be consistent across varying step budgets, resulting in fast and accurate sampling with up to 128 times faster performance than baseline models.

Context-Aware Membership Inference Attacks Against Pre-Trained LLMs

Membership Inference Attacks (MIAs) on Large Language Models (LLMs) aim to determine if a data point was used in the model's training set. A new approach has been developed that adapts MIAs to the generative nature of LLMs, significantly outperforming previous methods and revealing context-dependent memorization patterns in pre-trained LLMs.

Code

Pluely – a privacy first AI assistant

Pluely is an open-source, privacy-first AI assistant that provides real-time assistance during meetings, interviews, and conversations, offering a lightweight and customizable alternative to Cluely. It features various tools, including keyboard shortcuts, system audio capture, microphone audio capture, interactive input, and screenshot capture, allowing users to transform audio and text into intelligent AI assistance with instant responses and contextual help.

Show HN: macOS Local AI Dictation Software

WhisperMac is a local, extensible, and privacy-friendly dictation app for macOS that supports various transcription services, including WhisperCPP, Vosk, and cloud services. The app is currently in heavy beta, offering features like real-time transcription, plugin support, and configurable actions, but requires manual installation by cloning the repo and running build commands.

Show HN: Llumen – Lightweight LLM chat app that runs in <1s with OpenRouter

Llumen is a lightweight, self-hostable LLM chat application that requires only a single OpenRouter API key to use its features, offering a simple and out-of-the-box experience for users. It provides various features, including markdown rendering, multiple chat modes, and deep-research modes, and can be easily set up using a Windows executable, Docker image, or Linux binary.

Show HN: Open-Source Semantic AI Chat Search – 100% Locally

The Index AI Chat Search extension is a Chrome extension that allows users to search their AI chats with context and meaning, using semantic search and local processing to ensure privacy and security. The extension supports multiple AI platforms, including ChatGPT, Claude, and Perplexity, and stores conversation data and embeddings locally in IndexedDB for fast retrieval.

Automated Repair of Ambiguous Problem Descriptions for LLM-Based Code Generation

SpecFix is a novel approach that automatically repairs ambiguity in programming problem descriptions to improve the accuracy of large language model (LLM)-based code generation. It works by minimally modifying the requirements to reduce code generation uncertainty and better align natural language with input-output examples, resulting in significant increases in Pass@1 of the modified requirements across multiple models and benchmarks.

    Unsloth's GPT-OSS Reinforcement Learning achieves 3x faster inference and 50% less VRAM usage, researchers introduce Metacognitive Reuse to improve LLM reasoning, and developers launch Pluely, a privacy-first AI assistant with real-time conversation assistance.