Wednesday — January 21, 2026
Gemini 3 Flash dominates game-theory benchmarks through institutional deception, repeating prompts twice improves LLM accuracy, and Open Coscientist automates scientific hypothesis generation.
Interested in AI engineering? Let's talk
News
Which AI Lies Best? A game theory classic designed by John Nash
The "So Long Sucker" benchmark utilizes a game-theoretic framework to evaluate LLM capabilities in deception, negotiation, and strategic betrayal. Results reveal a "complexity reversal" where reactive models like GPT-OSS 120B succeed in simple games, but Gemini 3 Flash dominates high-complexity scenarios with a 90% win rate through sophisticated "institutional deception" and gaslighting. Analysis of internal reasoning logs confirms that top-performing models exhibit adaptive honesty, intentionally contradicting their private thoughts in public messages to exploit weaker opponents while maintaining cooperation with peers.
Majority of CEOs report zero payoff from AI splurge
A PwC survey of over 4,500 CEOs reveals that 56% have seen no revenue growth or cost savings from AI investments, with only 12% reporting both. While adoption remains concentrated in tactical areas like support and demand generation, PwC argues that tangible ROI requires enterprise-wide integration and robust technical foundations rather than isolated pilots. These findings align with MIT research showing only 5% of enterprises have successfully scaled AI, highlighting a significant gap between massive infrastructure spending and measurable business outcomes.
Electricity use of AI coding agents
AI coding agents like Claude Code consume significantly more energy than the ~0.3 Wh "median query" due to iterative tool-calling loops and high context overhead from system prompts and MCP tools. Analysis of session logs indicates a median agent session uses ~41 Wh, with power users reaching ~1,300 Wh daily—roughly 4,400 times the energy of a standard prompt. This consumption level is comparable to running a dishwasher or an extra refrigerator, moving LLM environmental impact from a negligible rounding error to a significant personal footprint.
TopicRadar – Track trending topics across HN, GitHub, ArXiv, and more
TopicRadar is an AI research assistant and Apify Actor that aggregates, ranks, and deduplicates content from technical sources including arXiv, GitHub, Hacker News, and Papers with Code. It features curated trending modes for AI and LLMs alongside custom topic tracking, providing configurable outputs in JSON or Markdown. The tool enables automated trend monitoring and academic discovery through parallelized multi-source fetching and flexible ranking strategies based on relevance, engagement, or recency.
Curl removes bug bounties because of AI slop
cURL is terminating its bug bounty program to mitigate a surge in low-quality, AI-generated vulnerability reports, often referred to as "AI slop." Maintainer Daniel Stenberg cited the unsustainable manual effort required to debunk these nonsense reports as the primary driver for the decision. While valid AI-assisted reports exist, the project aims to remove financial incentives for "garbage" submissions while relying on professional reputation to attract high-quality security research.
Research
Repeating your prompt twice before sending it to an LLM improves accuracy
Repeating the input prompt improves performance for major LLMs like Gemini, GPT, Claude, and Deepseek during non-reasoning tasks. This technique enhances model output without increasing latency or token consumption.
When "likers'' go private: Engagement with reputationally risky content on X
A Difference-in-Differences analysis and survey experiment investigating X's shift to private likes found no significant platform-level increase in engagement for high-reputational-risk content. Despite a modest self-reported increase in willingness to like controversial posts, actual behavior remained stable, suggesting that hiding engagement signals has limited impact on overall user dynamics. The findings highlight a potential gap between user intention and action or the disproportionate influence of automated and high-usage accounts on platform metrics.
Provably unmasking malicious behavior through execution traces
CTVP is a novel AI control framework that detects backdoors in LLM-generated code through semantic orbit analysis of predicted execution traces. By evaluating consistency across semantically equivalent program transformations, it identifies behavioral anomalies without direct execution. The protocol introduces the Adversarial Robustness Quotient (ARQ) and establishes information-theoretic bounds that prevent adversarial gamification through space complexity constraints.
Agentic LLMs as Powerful Deanonymizers. Li 2026
Anthropic's recently released qualitative interview dataset is vulnerable to re-identification attacks using off-the-shelf LLM agents with web search capabilities. By cross-referencing interview details with public records, researchers successfully identified 25% of a scientist subset, demonstrating that agentic workflows lower the technical barrier for deanonymization. The attack bypasses safeguards by decomposing the process into benign natural-language prompts, highlighting new privacy risks for rich qualitative data.
WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild
WildCAT3D is a framework for scene-level NVS that leverages diverse, "in the wild" 2D image data by explicitly modeling global appearance conditions. By extending multi-view diffusion paradigms to handle varying illuminations and occlusions, it achieves SOTA results in single-view NVS for both objects and scenes with improved data efficiency. The model generalizes to unseen scenes at inference and enables explicit global appearance control during generation.
Code
Mastra 1.0, open-source JavaScript agent framework from the Gatsby devs
Mastra is a TypeScript framework for building and scaling AI agents and applications with a unified interface for model routing across 40+ providers. It features graph-based workflow orchestration, advanced context management via RAG and semantic memory, and human-in-the-loop capabilities. The platform also includes built-in evals, observability tools, and support for authoring MCP servers to facilitate production-ready AI deployments.
Fence – Sandbox CLI commands with network/filesystem restrictions
Fence is a sandboxing tool that wraps CLI commands, blocking network access by default and restricting filesystem operations based on configurable rules. It functions as a defense-in-depth mechanism and permission manager for AI coding agents and other CLI agents, enabling secure execution of semi-trusted code by controlling potential side effects.
Open Coscientist – modular implementation of DeepMind's AI Co-scientist
Open Coscientist is an open adaptation of Google Research's AI Co-Scientist, focused on AI-powered research hypothesis generation. It leverages a multi-agent architecture, orchestrating 8-10 specialized AI agents through a LangGraph workflow to generate, review, rank, and iteratively evolve novel hypotheses. The system is LLM-agnostic via LiteLLM, supports optional literature review integration through an MCP server, and incorporates features like Elo-based tournaments and intelligent caching for refined hypothesis generation.
SWE-gen: Scaling SWE-bench task generation
SWE-gen is an automated pipeline that converts merged GitHub PRs into Harbor tasks by reversing bug fixes to recreate buggy states for LLM benchmarking. It leverages Claude Code for language-agnostic repository analysis to identify build systems and test frameworks, ensuring tasks are fully containerized and verified. The tool includes features for bulk PR "farming," automated validation via NOP and Oracle agents, and deep quality analysis to confirm task solvability and specification clarity.
Sandvault: Run AI agents isolated in a sandboxed macOS user account
SandVault is a lightweight macOS sandboxing tool that creates isolated user accounts to safely run AI agents like Claude Code, OpenAI Codex, and Google Gemini. It provides a native alternative to VMs or Docker by leveraging Unix-style user isolation to restrict access to the primary home directory while allowing agents to run with high-risk permissions. The environment comes pre-configured with Node.js, Python, and uv, featuring a shared workspace for seamless context switching without virtualization overhead.