Tuesday — June 18, 2024

Amazon's AI cameras gauge train passengers' emotions, DeepSeek-Coder-V2 outshines GPT-4-Turbo in coding tasks, and a new framework boosts LLM confidence estimation accuracy by 10%.

News

Getting 50% (SoTA) on Arc-AGI with GPT-4o

For those who aren’t in the loop the ARC Prize is a $1,000,000+ public competition to beat and open source a solution to the ARC-AGI benchmark. Recently, however, it’s become possible to achieve 50% accuracy on the ARC-AGI public test set with GPT-4o simply using basic feature engineering and prompting… The previous SOTA was 34%. Check out the article for some great illustrations poking fun at neuro-symbolic AI.

What policy makers need to know about AI

The SB 1047 bill in California aims to regulate AI by ensuring that large models are confirmed safe before release, intending to (somehow) simultaneously support open-source development. However, by failing to differentiate between "deployment" and "release" of AI models, it could inadvertently block all large open-source model advancements. Deployment includes systems such as APIs and applications using the models, where potential harm is realized, rather than the release of model weights and code.

Amazon-powered AI cameras used to detect emotions of unwitting train passengers

Network Rail (in the UK) has been testing AI systems focusing on trespass detection and passenger flow management at multiple sites, including Leeds station with 350 CCTV cameras. Gregory Butler from Purple Transform highlighted AI's role in swiftly identifying trespassing incidents and aiding human operators. Privacy experts, however, point to a lack of public consultation and the risk of increased surveillance impinging on personal freedoms. Similar AI surveillance implementations are planned for the Paris Olympic Games.

Research

Large Language Model Confidence Estimation via Black-Box Access

This paper address the challenge of estimating confidence in LLMs using only black-box access. A fairly common scenario nowadays. A framework is introduced that uses engineered features and logistic regression to estimate confidence. It shows significant improvements, outperforming current methods by over 10% in AUROC on datasets like TriviaQA, SQuAD, CoQA, and Natural Questions. The interpretable model also reveals transferable features that generalize across different LLMs with zero-shot capability on the same dataset.

A Robot Walks into a Bar: Can LLMs Serve as Creativity Support Tools for Comedy?

At the Edinburgh Festival Fringe and online, twenty professional comedians were involved in 3-hour AI x Comedy'' workshops utilizing LLMs for comedy writing. The study assessed AI as a creativity support tool via a Creativity Support Index questionnaire and focus groups exploring AI usage motivations and ethical concerns like bias, censorship, and copyright. Comedians noted that AI moderation strategies reinforced mainstream views by erasing minority perspectives. Not good. They also criticized LLMs for generating uncreative and biased comedy, or "cruise ship" material.

An Image Is Worth 32 Tokens for Reconstruction and Generation

TiTok (not TikTok), a new Transformer-based 1D tokenizer, optimizes image tokenization by converting images into compact 1D latent sequences. Significantly reducing token count compared to traditional 2D methods like VQGAN, TiTok achieves efficient and effective representations. It reduces a 256x256x3 image to 32 tokens and surpasses state-of-the-art (SOTA) models, achieving a gFID of 1.97 on ImageNet 256x256, outperforming MaskGIT. At higher resolutions, TiTok excels, with a gFID of 2.74 and a 64x token reduction, speeding up generation by 410x. In comparison to DiT-XL/2, TiTok’s best variant scores gFID 2.13, generating quality samples 74x faster.

Code

Token price calculator for 400+ LLMs

TokenCost is a tool for estimating the USD cost of using LLM APIs by calculating the cost of prompts and completions clientside. It offers accurate token counting for LLM interactions, maintains a current list of prices from major LLM providers, and integrates easily into workflows with simple function calls.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code

DeepSeek-Coder-V2 is an advanced MoE code language model designed for code-specific tasks. It achieves high performance in coding and mathematical reasoning, outperforming many closed-source models like GPT-4-Turbo and Claude 3 Opus. It supports 338 programming languages and extends context length up to 128K tokens. The model is available with 16B and 236B parameters and offers functionalities like code generation, code completion, code fixing, and chat completion.

LLM instruction finetuning from-scratch tutorial

This repository provides code for building, pretraining, and finetuning a GPT-like LLM from scratch. It covers implementing attention mechanisms, pretraining on unlabeled data, and performing finetuning for tasks like text classification and instruction following. Bonus materials include details on efficient multi-head attention implementations, pretraining optimizations, and experimenting with different models. This is a fantastic resource to get a first-principles understanding of finetuning.