A benchmark for evaluating LLM agents' capabilities in manipulating complex real-world spreadsheets.
SpreadsheetBench is a challenging spreadsheet manipulation benchmark that (1) contains 912 questions exclusively derived from real-world scenarios, (2) includes spreadsheet files with tabular data in various formats, (3) features a more reliable evaluation metric akin to online judge platforms.
Overall (Pass@1): The overall accuracy across all 912 questions and 2,729 spreadsheets.
SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets. The benchmark include 912 real questions gathered from online Excel forums, covering a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements.
Visit the Overview page for benchmark details, research paper, code repository, and dataset downloads.
To submit your results to SpreadsheetBench, please contact zeyaoma@gmail.com and provide an API capable of generating your agent's results.
SpreadsheetBench incorporates complex questions based on real-world scenarios and diverse types of tables in spreadsheet files, compared to previous benchmarks. We present four data examples in SpreadsheetBench to illustrate the attributes of real-world problems and the challenging nature of our benchmark.
Example 1: A cell-level manipulation example question involves manipulating a non-standard relational table (missing header on column D and column E).
This example shows a cell-level manipulation data example which aims to extract a specific part of a text string from a column of cells. The instruction contains the demand of the user and an example manipulating action of the question, which rarely occur in synthetic instructions of the previous benchmarks. Furthermore, the table within the spreadsheet file is a non-standard relational table that lacks a complete table header. The final result is required to be filled in the cells from B3 to B14, which ensure the uniqueness of the answer.