Thursday — July 31, 2025

Researchers discover a major AI training data set contains millions of examples of personal data, a new coding tool called Crush integrates large language models into terminals, and a study reveals that knowledge work occupations have the highest "AI applicability score" indicating a strong potential for AI impact.

News

Fast

Fast software changes behavior, enabling developers to ship code more often and users to be more productive, and it signals simplicity, discipline, and focus, making it feel magical and respectful of users' time. As technology advances, the priority will shift from adding capabilities to optimizing for performance, low latency, and reliability, unlocking new use cases and changing the way we live our lives.

A major AI training data set contains millions of examples of personal data

Researchers have found that a large AI training set, DataComp CommonPool, likely contains hundreds of millions of images with personally identifiable information, including passports, credit cards, and birth certificates. The data set, which has been downloaded over 2 million times, was scraped from the web and its curators, despite taking some measures to preserve privacy, were unable to effectively filter out sensitive information, posing significant privacy risks.

Critical vulnerability in AI coding platform Base44 allowing unauthorized access

Wiz Research discovered a critical vulnerability in the Base44 vibe coding platform, which allowed unauthorized access to private applications built by its users, effectively bypassing authentication controls, including Single Sign-On (SSO). The vulnerability, which was quickly fixed by the platform's owner Wix after responsible disclosure, highlights the shared-risk model of vibe coding platforms, where a single flaw in the platform's core can jeopardize every application built upon it.

Show HN: An AI agent that learns your product and guides your users

Frigade is an AI-powered platform that helps users navigate and adopt products effortlessly, increasing retention and customer success by providing personalized support and guidance throughout the user journey. The platform offers a range of features, including onboarding, customer support, feature adoption, and product marketing, all of which can be customized to match the product and provide valuable insights into user behavior.

Why do some AI chatbot subscriptions cost more than $200?

Some AI chatbot subscriptions, like OpenAI's ChatGPT Pro, cost over $200 per month, not because they are immediately profitable, but because of a "vibe-based pricing" trend set by industry leaders. This pricing strategy has been followed by competitors, such as Anthropic and Google, who have also released high-priced subscription plans for their AI tools, despite the fact that these services are often resource-intensive and costly to run.

Research

Working with AI: Measuring the Occupational Implications of Generative AI

Researchers analyzed 200,000 conversations between users and a generative AI system to understand how AI is being used in work activities, finding that tasks like gathering information, writing, and providing assistance are most common. The study computed an "AI applicability score" for each occupation, revealing that knowledge work occupations, such as computer and mathematical, office support, and sales, have the highest scores, indicating a strong potential for AI impact.

Measuring the Occupational Implications of Generative AI

Combolutional Neural Networks [pdf]

The combolutional layer, a learned-delay IIR comb filter and envelope detector, is proposed as a means to extract harmonic features in audio signals, offering an effective replacement for convolutional layers in tasks requiring precise harmonic analysis. This layer provides several benefits, including low parameter count, efficient CPU inference, and improved interpretability, making it a suitable choice for audio tasks such as piano transcription, speaker classification, and key detection.

Working with AI: Measuring the Occupational Implications of Generative AI

K^4: Online Log Anomaly Detection via Unsupervised Typicality Learning

The $K^4$ framework is a high-performance, unsupervised method for detecting log anomalies that overcomes the limitations of existing methods by being parser-independent and highly efficient. $K^4$ achieves state-of-the-art results, outperforming baselines by large margins, while being significantly faster with training times under 4 seconds and inference times as low as 4 microseconds.

Code

Crush: Glamourous AI coding agent for your favourite terminal

Crush is a terminal-based coding tool that integrates with large language models (LLMs) to enhance productivity, allowing users to choose from various LLMs, switch between them, and maintain multiple work sessions. It can be installed using package managers, downloaded as binaries, or installed with Go, and supports customization through configuration files and environment variables.

Show HN: Sourcebot, the self-hosted Perplexity for your codebase

Sourcebot is a self-hosted tool that helps users understand their codebase by allowing them to ask questions and receive detailed answers, search and navigate across all their repositories and branches, and explore files with syntax highlighting and code navigation. It can be easily deployed using an official Docker image and configured to index custom repositories and connect to language models, with more information available in the documentation and a public demo available for testing.

Show HN: Subtle Failure Modes I Keep Seeing in Production‑Grade AI Systems

The WFGY Engine is a semantic reasoning engine designed to solve core AI problems such as hallucination, context drift, and logic collapse, providing a full-stack solution for building a new semantic layer. It offers various modules, including TXT OS, Blah Blah Blah, Blur Blur Blur, and others, which run natively as .txt apps and provide features like semantic Q&A, image generation, and reasoning games.

Show HN: AgentGuard – Auto-kill AI agents before they burn through your budget

AgentGuard is a tool that prevents AI agents from making excessive API calls and incurring unexpected costs by automatically stopping the process when a set budget limit is reached. It provides real-time monitoring and cost tracking, allowing developers to set a budget limit and receive notifications when the limit is exceeded, thereby preventing financial losses.

Show HN: We create visual codebase maps that scale (static analysis and LLMs)

CodeBoarding is an open-source tool that generates high-level diagram representations of codebases using static analysis and LLM agents, supporting onboarding, documentation, and comprehension for large, complex systems. It can be integrated with various tools and platforms, including VS Code, GitHub, and MCP servers, to provide a unified representation of codebases for both humans and AI agents.