A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis
Researchers introduce a benchmark that assesses LLM open-ended text generation using n-gram statistics and rules, avoiding reliance on human or LLM-based judgments. It closely correlates with GPT-4o evaluations while being computationally efficient.