Monday November 3, 2025

LLMs may overuse em-dashes due to 19th-century training data, a RAG pipeline runs on a 2011 Raspberry Pi in pure PHP and a model maps vocal prosody to typography.

News

Meta readies $25B bond sale as soaring AI costs trigger stock sell-off

Meta is preparing a $25bn bond sale to fund its significant investments in AI. The announcement of these soaring costs has triggered a stock sell-off.

Why do AI models use so many em-dashes?

The text investigates the prevalent use of em-dashes in modern LLMs. After dismissing theories related to token efficiency, structural model preferences, and RLHF dialectal influence, the author proposes the most plausible cause is a shift in training data composition. The hypothesis is that models post-GPT-3.5 were trained on newly digitized print books from the late-1800s and early-1900s—a period when em-dash usage was at its peak—thus embedding this stylistic quirk into the models.

'Do not trust your eyes': AI generates surge in expense fraud

The provided text is a paywall for a Financial Times article and does not contain the article's content. The headline, "‘Do not trust your eyes’: AI generates surge in expense fraud," suggests the article discusses the use of generative AI in creating fraudulent expense claims.

Syllabi – Open-source agentic AI with tools, RAG, and multi-channel deploy

This open-source, self-hostable platform enables the creation of agentic RAG chatbots from diverse knowledge sources like documents, websites, and Notion. It supports multi-channel deployment to web, Slack, and Discord, with a REST API for custom integrations. The system extends beyond simple Q&A by allowing custom tools, native code execution via Pyodide/WebR, and diagram generation, while offering full control over the underlying LLM and behavior.

Is 'learn to craft' the new 'learn to code?'

As LLMs increasingly automate knowledge work like coding and marketing, some white-collar professionals are pivoting to skilled trades for job security. This emerging "learn to craft" movement reflects a belief that hands-on, physical jobs are less vulnerable to AI disruption than desk jobs. While still a nascent trend, it signals a significant shift in perceived career stability in the age of generative AI.

Research

Education Paradigm Shift to Maintain Human Competitive Advantage over AI

The paper argues that while LLMs make the automation of intellectual labor a practical concern, they possess fundamental weaknesses unfixable by current technologies. It proposes adapting education using a constructivist paradigm to cultivate skills that ensure a long-term human advantage over AI.

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear is a hybrid linear attention architecture that outperforms full attention across short-context, long-context, and RL scaling regimes. Its core component, Kimi Delta Attention (KDA), extends Gated DeltaNet with a finer-grained gating mechanism and uses a specialized DPLR formulation for hardware efficiency. A 3B parameter model using this architecture demonstrated superior performance over a full attention baseline, reducing KV cache usage by up to 75% and achieving up to 6x decoding throughput on a 1M context, with the kernel and models being open-sourced.

Can text be made to sound more than just its words? (2022)

A model is presented that processes vocal prosody—loudness, pitch, and duration—and maps these features to typographical dimensions like font-weight, baseline shift, and letter-spacing. This speech-modulated typography aims to embed paralinguistic nuances directly into text captions. In an evaluation, participants matched the modulated text to its source audio with 65% accuracy, with no significant performance difference between static and animated versions.

R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

The Rule-to-Tag (R2T) framework is a hybrid approach that integrates linguistic rules into a model's training objective via an adaptive loss function, a paradigm called principled learning (PrL). On Zarma POS tagging, an R2T model trained solely on unlabeled text achieved 98.2% accuracy, outperforming a supervised baseline. R2T also serves as a powerful pre-training step for NER, where fine-tuning on just 50 labeled sentences beat a baseline trained on 300, demonstrating high data efficiency.

Three-Flavor Composition of Cosmic Neutrinos with IceCube

Using 11.4 years of IceCube data, the flavor composition of cosmic neutrinos was measured down to the TeV scale. The resulting best-fit flavor ratio is consistent with the standard three-flavor oscillation model. This analysis also constrains the flavor composition at the source, ruling out production via neutron decay with high statistical confidence.

Code

Show HN: Anki-LLM – Bulk process and generate Anki flashcards with LLMs

anki-llm is a CLI toolkit that leverages LLMs to bulk-process and generate Anki flashcards. It provides workflows for batch-modifying existing notes with custom prompts, either directly or via a resilient file-based export/import cycle. The tool can also interactively generate multiple new contextual cards from a single term, with a review and selection step before import. It interfaces with Anki via the AnkiConnect add-on and supports concurrent API requests, custom prompt templating, and direct API querying for advanced scripting.

Show HN: I built a Raspberry Pi webcam to train my dog (using Claude)

A developer built YogiCam, a dog monitoring system on a Raspberry Pi, to train their dog with separation anxiety. An LLM (Claude) was used to generate the initial Python and Flask code for the livestreaming web server. The MVP was later enhanced with a UI stopwatch and ngrok for remote access over cellular data. The project demonstrates how LLMs can significantly accelerate development, enabling a product manager with past coding experience to build a functional hardware/software solution.

Show HN: Torque – A declarative, typesafe DSL for LLM training datasets (MIT)

Torque is a declarative, typesafe DSL for generating complex synthetic datasets for LLMs. It enables developers to compose conversation schemas like components, using AI to create realistic variations for fine-tuning and testing. The library is provider-agnostic, leverages Zod for robust type safety in tool definitions, and supports dynamic conversation flows, concurrent generation, and cost-saving optimizations.

Show HN: AI agents running on 2011 Raspberry Pi with pure PHP – no GPU

Datapizza-AI PHP is a zero-dependency, educational AI framework written in pure PHP, designed to run on minimal hardware like the original Raspberry Pi. It demystifies core LLM concepts by implementing a full RAG pipeline, agents, and a JSON-based vector store entirely from scratch. The project prioritizes transparency over performance, with all logic, including cosine similarity, written in vanilla PHP to provide a hands-on tool for local AI experiments.

Cognotik: A New FOSS AI Coding Assistant. For JetBrains IDEs

Cognotik is an open-source, AI-powered development platform designed to streamline software workflows through intelligent planning and code generation. It features a modular architecture with a core engine, a planning framework, and clients including a desktop app and an IntelliJ plugin. The platform operates on a BYOK model, providing a unified, type-safe API to connect with a wide range of LLM providers, from OpenAI and Anthropic to local models via Ollama.

    LLMs may overuse em-dashes due to 19th-century training data, a RAG pipeline runs on a 2011 Raspberry Pi in pure PHP and a model maps vocal prosody to typography.