Connect Bench

AI Benchmarking Tool - Summer 2025

Project Overview

Connect Bench is a benchmarking tool designed to evaluate Large Language Models (LLMs) on the New York Times Connections game. The system tests how well different models can group 16 words into 4 categories based on shared themes, simulating expert gameplay.

The pipeline loads daily puzzles, prompts each model through the OpenRouter API, parses the responses, and scores performance against the ground truth groupings. This reveals capabilities in semantic understanding, pattern recognition, and reasoning under uncertainty.

Performance Visualization

Models Tested

256

Puzzles Evaluated

Technologies

Python

Requests

JSON

OpenRouter API

Benchmarking

LLMs

NLP

Future Development

Next steps include expanding the dataset, adding richer metrics (e.g., error types, near-miss partial credit), and shipping a web dashboard for interactive drill-downs.