SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Zeyao Ma1, Bohan Zhang1, Jing Zhang1, Jifan Yu2, Xiaokang Zhang1, Xiaohan Zhang3, Sijia Luo1, Xi Wang1, Jie Tang2

1Renmin University of China, 2Tsinghua University, 3Zhipu.AI

Evaluating LLM Agents' Capabilities in Manipulating Complex, Real-World Spreadsheets

SpreadsheetBench is a challenging spreadsheet manipulation benchmark that (1) contains 912 questions exclusively derived from real-world scenarios, (2) includes spreadsheet files with tabular data in various formats, (3) features a more reliable evaluation metric akin to online judge platforms.

SpreadsheetBench Automated Pipeline
Figure 1: The benchmark construction pipeline and OJ-style evaluation of SpreadsheetBench.

Key Features

    1. Real-World Questions: Built from 912 authentic user questions sourced from online Excel forums, reflecting genuine and complex spreadsheet manipulation needs.
    2. Diverse Spreadsheets: Includes real spreadsheets with intricate structures such as multiple tables, non-standard relational layouts, and abundant non-textual elements, mirroring actual user environments.
    3. Robust Evaluation: Introduces a reliable, online-judge-style evaluation metric using multiple test-case spreadsheets per instruction to assess the generalization and robustness of model-generated solutions.

Evaluation Metrics

Overall (Pass@1): The overall accuracy across all 912 questions and 2,729 spreadsheets.

Benchmark Statistics

70.48%
Top Score (OVERALL)
912
Total Tasks
2,729
Total Spreadsheets

SpreadsheetBench incorporates complex questions based on real-world scenarios and diverse types of tables in spreadsheet files, compared to previous benchmarks. We present four data examples in SpreadsheetBench to illustrate the attributes of real-world problems and the challenging nature of our benchmark.

Example 1

Example 1: A cell-level manipulation example question involves manipulating a non-standard relational table (missing header on column D and column E).


This example shows a cell-level manipulation data example which aims to extract a specific part of a text string from a column of cells. The instruction contains the demand of the user and an example manipulating action of the question, which rarely occur in synthetic instructions of the previous benchmarks. Furthermore, the table within the spreadsheet file is a non-standard relational table that lacks a complete table header. The final result is required to be filled in the cells from B3 to B14, which ensure the uniqueness of the answer.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

1Renmin University of China, 2Aptura.AI, 3AfterQuery, 4Shortcut.AI

Evaluating LLM Agents' Capabilities in Challenging, Expert-Curated, End-to-End Spreadsheet Tasks

SpreadsheetBench 2 is a benchmark for evaluating agents on end-to-end business spreadsheet workflows. Unlike existing benchmarks that focus on isolated manipulations, SpreadsheetBench 2 requires agents to (1) complete workflow-level goals through multi-step coordinated operations, (2) perform cross-sheet reasoning within complex multi-sheet workbooks, (3) produce deliverable-level outcomes including structured models, repaired spreadsheets, and accurate visualizations.

SpreadsheetBench 2 Overview
Figure 1: Overview of SpreadsheetBench 2 task categories, including Debugging, Financial Modeling, Template, and Visualization.

Key Features

    1. End-to-End Workflows: Tasks are designed as self-contained, multi-stage objectives requiring sequences of coordinated spreadsheet operations, rather than atomic formula generation or local edits.
    2. Professional Financial Domain: Covers real business scenarios including financial modeling with multi-statement integration, systematic error debugging across 10 error types, and data visualization.
    3. Complex Workbook Structures: Tasks involve workbooks with numerous sheets per question and require extensive cell modifications—debugging tasks require hundreds of cell edits on average, and financial modeling tasks require over a thousand.

Task Categories

Tasks are organized into three primary categories that mirror the lifecycle of professional spreadsheet usage:

  • Financial Modeling/Template: Constructing structured spreadsheet artifacts including template completion and multi-step modeling scenarios.
  • Debugging: Identifying and correcting logical, structural, or reference errors in existing spreadsheets.
  • Visualization: Transforming tabular data into charts and pivot-based summaries for analysis and presentation.

Evaluation Metrics

Financial Modeling & Template & Debugging: A task is considered correct only when all required cell modifications are accurate and all unchanged cells remain unmodified. The final answer must exactly match the ground truth answer.
Visualization: Each task has a reference answer with a checklist of assertions. A Vision-Language Model (VLM) evaluates the generated chart image against each assertion, producing PASS/FAIL judgments. The chart evaluation score is computed as: Acc = Passed Assertions / Total Assertions. A score above 70 is considered correct.

Benchmark Statistics

Top Score (OVERALL)
16.4
Average Sheets
822.5
Average Modified Cells

Contributors

  • Renmin University of China: Jian Zhu, Yuzheng Zhang, Zeyao Ma, Bohan Zhang, Jing Zhang
  • Aptura.AI: Armin Schoepf, Daniel Woloch
  • AfterQuery: Sam Jacob, Siddharth Nagisetty, Abhiram Chundru, Jean Lin
  • Shortcut.AI: Robert Yang, Nico Christie, Peter Wang, Richard Pham

Instruction: Complete the financial model based on the provided assumptions. Ensure the existing structure, layout, and formatting of the model are preserved throughout the process. In the Financials sheet, calculate CAGR for both historical and forecast periods for Revenue, EBITDA, EBIT, PAT, and Total Assets. In the Assumptions sheet, calculate Average Days Inventory for 2014F–2018F based on prior two years' average, reducing by 0.5 days from 2015F–2018F, then derive forecasted inventories. In the Ratio Analysis sheet, calculate Total Asset Turnover, then calculate EV/EBITDA. In the Valuation sheet, calculate Terminal Value.

Example 1

Example 1: Financial Modeling.


SpreadsheetBench Leaderboard

SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets (Version 1) and business spreadsheet workflows (Version 2). Visit the V1 Overview or V2 Overview page for benchmark details, research paper, code repository, and dataset downloads.

To submit your results, please contact spreadsheetbench@gmail.com or spreadsheetbench@aptura.ai, and provide an API capable of generating your agent's results.

We are releasing SpreadsheetBench V2, a new benchmark for evaluating agents on end-to-end business spreadsheet workflows, covering financial modeling, debugging, and visualization in professional scenarios with complex multi-sheet workbooks.

RANK
MODEL
SCAFFOLD
STATUS
SCORE
DATE
ORG
1
Claude Opus 4.6
Bash Agent
Verified
34.89%
Template: 52.58%
Fin.Model: 34.00%
Debug: 12.00%
Visualization: 62.50%
Mar 26, 2026
Anthropic
2
GPT-5.2
Bash Agent
Verified
26.79%
Template: 35.05%
Fin.Model: 33.00%
Debug: 8.00%
Visualization: 45.83%
Mar 26, 2026
OpenAI
3
Gemini 3.1 Pro
Bash Agent
Verified
23.68%
Template: 28.87%
Fin.Model: 31.00%
Debug: 7.00%
Visualization: 41.67%
Mar 26, 2026
Google
4
GLM-5.0
Bash Agent
Verified
17.14%
Template: 17.53%
Fin.Model: 22.00%
Debug: 7.00%
Visualization: 37.50%
Mar 26, 2026
Zhipu.AI
5
Deepseek-V3.2
Bash Agent
Verified
15.58%
Template: 25.77%
Fin.Model: 7.00%
Debug: 10.00%
Visualization: 33.33%
Mar 26, 2026
DeepSeek
6
Kimi K2.5
Bash Agent
Verified
14.64%
Template: 18.56%
Fin.Model: 15.00%
Debug: 4.00%
Visualization: 41.67%
Mar 26, 2026
Moonshot
7
Qwen3.5-397B-A17B
Bash Agent
Verified
11.22%
Template: 17.53%
Fin.Model: 10.00%
Debug: 3.00%
Visualization: 25.00%
Mar 26, 2026
Alibaba
8
MiniMax M2.5
Bash Agent
Verified
7.17%
Template: 7.22%
Fin.Model: 8.00%
Debug: 4.00%
Visualization: 16.67%
Mar 26, 2026
MiniMax

*Verified results are evaluated internally through APIs provided by the respective organizations. Unverified results are evaluated by external third parties, such as OpenAI and Microsoft.