1Renmin University of China, 2Tsinghua University, 3Zhipu.AI
Evaluating LLM Agents' Capabilities in Manipulating Complex, Real-World Spreadsheets
SpreadsheetBench is a challenging spreadsheet manipulation benchmark that (1) contains 912 questions exclusively derived from real-world scenarios, (2) includes spreadsheet files with tabular data in various formats, (3) features a more reliable evaluation metric akin to online judge platforms.
Overall (Pass@1): The overall accuracy across all 912 questions and 2,729 spreadsheets.
SpreadsheetBench incorporates complex questions based on real-world scenarios and diverse types of tables in spreadsheet files, compared to previous benchmarks. We present four data examples in SpreadsheetBench to illustrate the attributes of real-world problems and the challenging nature of our benchmark.
Example 1: A cell-level manipulation example question involves manipulating a non-standard relational table (missing header on column D and column E).
This example shows a cell-level manipulation data example which aims to extract a specific part of a text string from a column of cells. The instruction contains the demand of the user and an example manipulating action of the question, which rarely occur in synthetic instructions of the previous benchmarks. Furthermore, the table within the spreadsheet file is a non-standard relational table that lacks a complete table header. The final result is required to be filled in the cells from B3 to B14, which ensure the uniqueness of the answer.
1Renmin University of China, 2Aptura.AI, 3AfterQuery, 4Shortcut.AI
Evaluating LLM Agents' Capabilities in Challenging, Expert-Curated, End-to-End Spreadsheet Tasks
SpreadsheetBench 2 is a benchmark for evaluating agents on end-to-end business spreadsheet workflows. Unlike existing benchmarks that focus on isolated manipulations, SpreadsheetBench 2 requires agents to (1) complete workflow-level goals through multi-step coordinated operations, (2) perform cross-sheet reasoning within complex multi-sheet workbooks, (3) produce deliverable-level outcomes including structured models, repaired spreadsheets, and accurate visualizations.
Tasks are organized into three primary categories that mirror the lifecycle of professional spreadsheet usage:
Financial Modeling & Template & Debugging: A task is considered correct only when all required cell modifications are accurate and all unchanged cells remain unmodified. The final answer must exactly match the ground truth answer.
Visualization: Each task has a reference answer with a checklist of assertions. A Vision-Language Model (VLM) evaluates the generated chart image against each assertion, producing PASS/FAIL judgments. The chart evaluation score is computed as: Acc = Passed Assertions / Total Assertions. A score above 70 is considered correct.
Instruction: Complete the financial model based on the provided assumptions. Ensure the existing structure, layout, and formatting of the model are preserved throughout the process. In the Financials sheet, calculate CAGR for both historical and forecast periods for Revenue, EBITDA, EBIT, PAT, and Total Assets. In the Assumptions sheet, calculate Average Days Inventory for 2014F–2018F based on prior two years' average, reducing by 0.5 days from 2015F–2018F, then derive forecasted inventories. In the Ratio Analysis sheet, calculate Total Asset Turnover, then calculate EV/EBITDA. In the Valuation sheet, calculate Terminal Value.
Example 1: Financial Modeling.
SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets (Version 1) and business spreadsheet workflows (Version 2). Visit the V1 Overview or V2 Overview page for benchmark details, research paper, code repository, and dataset downloads.
To submit your results, please contact spreadsheetbench@gmail.com or spreadsheetbench@aptura.ai, and provide an API capable of generating your agent's results.
We are releasing SpreadsheetBench V2, a new benchmark for evaluating agents on end-to-end business spreadsheet workflows, covering financial modeling, debugging, and visualization in professional scenarios with complex multi-sheet workbooks.