Connect Bench

AI Benchmarking Tool - Summer 2025

Project Overview

Connect Bench is a benchmarking tool designed to evaluate Large Language Models (LLMs) on the New York Times Connections game. The system tests how well different models can group 16 words into 4 categories based on shared themes, simulating expert gameplay.

The pipeline loads daily puzzles, prompts each model through the OpenRouter API, parses the responses, and scores performance against the ground truth groupings. This reveals capabilities in semantic understanding, pattern recognition, and reasoning under uncertainty.

Performance Visualization

13
Models Tested
256
Puzzles Evaluated

Technologies

Python
Requests
JSON
OpenRouter API
Benchmarking
LLMs
NLP

Future Development

Next steps include expanding the dataset, adding richer metrics (e.g., error types, near-miss partial credit), and shipping a web dashboard for interactive drill-downs.