Connect Bench
Project Overview
Connect Bench is a benchmarking tool designed to evaluate Large Language Models (LLMs) on the New York Times Connections game. The system tests how well different models can group 16 words into 4 categories based on shared themes, simulating expert gameplay.
The pipeline loads daily puzzles, prompts each model through the OpenRouter API, parses the responses, and scores performance against the ground truth groupings. This reveals capabilities in semantic understanding, pattern recognition, and reasoning under uncertainty.
Performance Visualization
Technologies
Future Development
Next steps include expanding the dataset, adding richer metrics (e.g., error types, near-miss partial credit), and shipping a web dashboard for interactive drill-downs.