Saturday March 28, 2026

Symbolica’s Agentica SDK hits 36.08% on ARC-AGI-3, prompt repetition improves non-reasoning LLM performance, and a new benchmark ranks GPT-5.4 and Claude Opus 4.6 as the most persuasive models.

Interested in AI engineering? Let's talk

News

AI got the blame for the Iran school bombing. The truth is more worrying

The 2026 Iranian school strike highlights a "AI psychosis" where public debate focused on LLM hallucinations and alignment, while the actual failure occurred within Project Maven’s computer vision and sensor fusion infrastructure. By compressing the kill chain to 1,000 targets per hour, the Palantir-built system removed the human discretion necessary to catch stale database entries, such as the school's misclassification. This shift from LLM-centric concerns to the realities of high-tempo algorithmic warfare demonstrates how "charismatic" technologies can obscure the lethal consequences of automated bureaucracy and the intentional removal of human judgment.

I am leaving the AI party after one drink

After experimenting with Claude Code for project bootstrapping and UI development, the author found it effective for repetitive tasks but lacking in CSS precision and architectural flexibility. Despite the initial speed, concerns regarding redundant code, AI dependency, and the erosion of developer craftsmanship led to the decision to uninstall the tool. The author ultimately prioritizes manual trial-and-error and human-centric learning over AI-driven productivity and its environmental impact.

Day 1 of ARC-AGI-3

Symbolica's Agentica SDK achieved a 36.08% unverified score on the ARC-AGI-3 benchmark, passing 113 out of 182 levels and completing 7 of 25 games. This performance significantly outperforms CoT baselines from models like Opus 4.6 and GPT 5.4, which scored below 0.4%. Additionally, Agentica demonstrated superior cost efficiency, reaching its 36.08% score for $1,005 compared to the $8,900 required for Opus 4.6 to achieve a 0.25% score.

Some uncomfortable truths about AI coding agents

AI coding agents pose critical risks to production software development, including developer skill atrophy and the erosion of code quality through complacent oversight. The current economic model for LLMs is likely an unsustainable bubble, while fundamental vulnerabilities to prompt injection and "promptware" create severe security vectors. Additionally, since AI-generated code is currently ineligible for copyright protection in the US, relying on agents introduces significant intellectual property and competitive risks.

Why are executives enamored with AI, but ICs aren't?

The divide in AI adoption stems from differing professional environments: executives manage non-deterministic human systems and view LLMs as relatively predictable agents with well-defined failure modes. ICs are evaluated on deterministic precision, making AI's stochastic nature a liability that increases overhead and shifts their role from execution to oversight. Ultimately, AI aligns with executive system-level management but conflicts with the IC's focus on correctness and domain expertise.

Research

Why AI may increase competition but not success rates

The paper formalizes the "Builder Saturation Effect," arguing that while AI reduces production costs, finite human attention leads to diluted returns per producer. By modeling quality heterogeneity and reinforcement dynamics, the authors predict that democratized AI production will result in power-law distributions and winner-take-most outcomes rather than widespread entrepreneurial success.

Security-by-Design for LLM-Based Code Generation

CodeLLMs often generate insecure code despite possessing internal representations that distinguish between security subconcepts during inference. This work introduces SCS-Code, a lightweight steering mechanism that guides a model's internal states toward secure and functional outputs during token generation. By leveraging these internal security insights, SCS-Code outperforms existing black-box methods across multiple secure coding benchmarks.

Prompt Repetition Improves Non-Reasoning LLMs

Repeating the input prompt improves performance across major LLMs, including Gemini, GPT, Claude, and DeepSeek, for non-reasoning tasks. This optimization enhances model output without increasing latency or token consumption.

Reasoning-Based Personalized Generation for Users with Sparse Data

GraSPer is a framework designed to improve LLM personalization in sparse context scenarios by augmenting limited user histories with predicted future interactions. It utilizes reasoning alignment to generate synthetic texts for these interactions, enriching the context used for final output generation. Experiments demonstrate that conditioning on this combined real and synthetic history significantly enhances performance in cold-start and sparse data settings.

The illusion of illusions: There are no optical corrections in the Parthenon

Scientific analysis refutes the historical claim that the Parthenon’s curvatures were designed as optical corrections to ensure perceived straightness. The study demonstrates that the purported illusions are either non-existent or imperceptible, debunking a long-standing architectural myth lacking empirical evidence.

Code

LLM Persuasion Benchmark: Multi-Turn Persuasion Between Models

The LLM Persuasion Benchmark evaluates the ability of models to shift an opponent's position across 8-turn dialogues using triple-probed hidden evaluations on a seven-point scale. GPT-5.4 and Claude Opus 4.6 emerge as the most effective persuaders, while Grok 4.20 (Reasoning) is the most resistant target. The results highlight a divergence between offensive persuasion and defensive susceptibility, emphasizing that high-reasoning models succeed by identifying specific conversational hinge points rather than through mere linguistic fluency.

LLM-Gateway – Zero-Trust LLM Gateway

llm-gateway is an OpenAI-compatible API proxy that routes requests to OpenAI, Anthropic, and local backends like vLLM or Ollama. It features zero-trust networking via zrok for secure connectivity to inference servers behind NAT and a three-layer semantic routing cascade (heuristics, embeddings, and LLM classification) for automated model selection. The gateway also provides weighted load balancing with passive failover, Prometheus metrics, and virtual API key management in a single Go binary.

Agent Forge, an agent framework, heuristic routing and graph execution

@minns/agent-forge is a framework for building autonomous agents featuring an adaptive execution engine that routes tasks between a fast agentic loop and a complex graph pipeline. It provides graph-native memory, temporal tables via MinnsQL, and multi-agent coordination through shared knowledge graphs. The toolkit includes composable middleware and advanced reasoning engines like MCTS, reflexion, and world model simulations for high-complexity LLM workflows.

Toolcast – Turn any API into an AI agent tool with one command

toolcast is a CLI tool that instantly converts OpenAPI specifications into functional MCP servers. It automates the generation of tool definitions, typed input schemas, and authentication logic, allowing AI agents to interact with any API. The utility supports a registry of popular services and integrates directly with MCP-compatible clients like Claude Code and Cursor.

Tool to reduce AI tokens and generate more detailed and accurate PRDs

Corbell is a local-first tool that constructs a multi-repo architecture graph from source code and documentation to streamline backend engineering workflows. It leverages LLMs and embedding similarity to automate spec generation, service discovery, and architecture reviews while ensuring consistency with established design patterns. Key features include an interactive graph UI, MCP support for IDE integration, and a roadmap toward a fully agentic architecture for autonomous system analysis.

    Symbolica’s Agentica SDK hits 36.08% on ARC-AGI-3, prompt repetition improves non-reasoning LLM performance, and a new benchmark ranks GPT-5.4 and Claude Opus 4.6 as the most persuasive models.