SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Zeyao Ma¹, Bohan Zhang¹, Jing Zhang¹, Jifan Yu², Xiaokang Zhang¹, Xiaohan Zhang³, Sijia Luo¹, Xi Wang¹, Jie Tang²

¹Renmin University of China, ²Tsinghua University, ³Zhipu.AI

Evaluating LLM Agents' Capabilities in Manipulating Complex, Real-World Spreadsheets

SpreadsheetBench is a challenging spreadsheet manipulation benchmark that (1) contains 912 questions exclusively derived from real-world scenarios, (2) includes spreadsheet files with tabular data in various formats, (3) features a more reliable evaluation metric akin to online judge platforms.

Figure 1: The benchmark construction pipeline and OJ-style evaluation of SpreadsheetBench.

Key Features

Real-World Questions:

Diverse Spreadsheets:

Robust Evaluation:

Evaluation Metrics

Overall (Pass@1): The overall accuracy across all 912 questions and 2,729 spreadsheets.

Benchmark Statistics

70.48%

Top Score (OVERALL)

912

Total Tasks

2,729

Total Spreadsheets

SpreadsheetBench incorporates complex questions based on real-world scenarios and diverse types of tables in spreadsheet files, compared to previous benchmarks. We present four data examples in SpreadsheetBench to illustrate the attributes of real-world problems and the challenging nature of our benchmark.

Example 1 Example 2 Example 3 Example 4

Example 1: A cell-level manipulation example question involves manipulating a non-standard relational table (missing header on column D and column E).

This example shows a cell-level manipulation data example which aims to extract a specific part of a text string from a column of cells. The instruction contains the demand of the user and an example manipulating action of the question, which rarely occur in synthetic instructions of the previous benchmarks. Furthermore, the table within the spreadsheet file is a non-standard relational table that lacks a complete table header. The final result is required to be filled in the cells from B3 to B14, which ensure the uniqueness of the answer.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

¹Renmin University of China, ²Aptura.AI, ³AfterQuery, ⁴Shortcut.AI

Evaluating LLM Agents' Capabilities in Challenging, Expert-Curated, End-to-End Spreadsheet Tasks

GitHub → Dataset → Paper →

SpreadsheetBench 2 is a benchmark for evaluating agents on end-to-end business spreadsheet workflows. Unlike existing benchmarks that focus on isolated manipulations, SpreadsheetBench 2 requires agents to (1) complete workflow-level goals through multi-step coordinated operations, (2) perform cross-sheet reasoning within complex multi-sheet workbooks, (3) produce deliverable-level outcomes including structured models, repaired spreadsheets, and accurate visualizations.

Figure 1: Overview of SpreadsheetBench 2 task categories.

Key Features

End-to-End Workflows:

Professional Financial Domain:

Complex Workbook Structures:

Task Categories

SpreadsheetBench 2 includes 321 tasks. Tasks are organized into three primary categories that mirror the lifecycle of professional spreadsheet usage:

Financial Modeling/Template: Constructing structured spreadsheet artifacts including template completion and multi-step modeling scenarios.
Debugging: Identifying and correcting logical, structural, or reference errors in existing spreadsheets.
Visualization: Transforming tabular data into charts and pivot-based summaries for analysis and presentation.

Evaluation Metrics

Financial Modeling & Template & Debugging: A task is considered correct only when all required cell modifications are accurate and all unchanged cells remain unmodified. The final answer must exactly match the ground truth answer.
Visualization: Each task has a reference answer with a checklist of assertions. A Vision-Language Model (VLM) evaluates the generated chart image against each assertion, producing PASS/FAIL judgments. The chart evaluation score is computed as: Acc = Passed Assertions / Total Assertions. A score above 70 is considered correct.

Benchmark Statistics

34.89%

Top Score (OVERALL)

11.8

Average Sheets

593.5

Average Modified Cells

Contributors

Renmin University of China: Jian Zhu, Yuzheng Zhang, Zeyao Ma, Bohan Zhang, Jing Zhang
Aptura.AI: Armin Schoepf, Daniel Woloch
AfterQuery: Sam Jacob, Siddharth Nagisetty, Abhiram Chundru, Jean Lin, Spencer Mateega
Shortcut.AI: Robert Yang, Nico Christie, Peter Wang, Richard Pham

Results

Table 1: Performance of open- and closed-source models across the four SpreadsheetBench 2 task categories. Modif. reports the average fraction of target cells whose computed values match the golden file. Acc. reports the fraction of tasks for which every output cell matches the golden file; for Visualization tasks, it instead denotes the average rubric pass rate, i.e., the fraction of expert-designed criteria satisfied per task. The Overall column aggregates all applicable tasks for each metric.

Figure 2: Performance on a representative 30-example subset spanning Financial Modeling, Debugging, and Visualization tasks. We compare models under our agent scaffold with four LLM-based spreadsheet products. The results show that these spreadsheet products do not outperform scaffolded models; among the spreadsheet products, Claude for Excel achieves the strongest performance.

Table 2: Scaffold evaluation with GLM-5 as the fixed backbone on a 50-sample subset. We compare our SWE-agent-based scaffold with three general-purpose coding-agent scaffolds, using the same Modif. and Acc. metrics as in Table 1.

Overall performance scores across representative models and spreadsheet products — **Figure 2:** Performance on the representative subset.

Agent scaffold comparison with modification and accuracy scores — **Table 2:** Agent scaffold comparison.

Example 1 Example 2 Example 3

Instruction: Complete the financial model based on the provided assumptions. Ensure the existing structure, layout, and formatting of the model are preserved throughout the process. In the Financials sheet, calculate CAGR for both historical and forecast periods for Revenue, EBITDA, EBIT, PAT, and Total Assets. In the Assumptions sheet, calculate Average Days Inventory for 2014F–2018F based on prior two years' average, reducing by 0.5 days from 2015F–2018F, then derive forecasted inventories. In the Ratio Analysis sheet, calculate Total Asset Turnover, then calculate EV/EBITDA. In the Valuation sheet, calculate Terminal Value.

Example 1: Financial Modeling.

SpreadsheetBench Leaderboard

SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets (Version 1) and business spreadsheet workflows (Version 2). Visit the V1 Overview or V2 Overview page for benchmark details, research paper, code repository, and dataset downloads.

To submit your results, please review the Submission Guidelines

We are releasing SpreadsheetBench V2, a new benchmark for evaluating agents on end-to-end business spreadsheet workflows, covering financial modeling, debugging, and visualization in professional scenarios with complex multi-sheet workbooks.

RANK

MODEL

SCAFFOLD

STATUS

SCORE

DATE

ORG

*Verified results are evaluated internally through APIs provided by the respective organizations. Unverified results are evaluated by external third parties, such as OpenAI and Microsoft.