Part 7/8:
While the prospects are promising, numerous limitations are acknowledged. Paperbench’s dataset is currently limited to only 20 papers, restricting its breadth and generalizability. Moreover, factors such as contamination from existing codebases and the extensive labor involved in rubric development pose significant challenges.
The paper also indicates that although LLM-based judges can automate assessment processes, they still lag behind human experts in accuracy. Cost remains a factor too, with significant resource expenditure required to run extensive evaluations and replicate experimental results adequately.