SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang

A benchmark for evaluating LLM agents' capabilities in manipulating complex real-world spreadsheets.

SpreadsheetBench is a challenging spreadsheet manipulation benchmark that (1) contains 912 questions exclusively derived from real-world scenarios, (2) includes spreadsheet files with tabular data in various formats, (3) features a more reliable evaluation metric akin to online judge platforms.

SpreadsheetBench Automated Pipeline
Figure 1: The benchmark construction pipeline and OJ-style evaluation of our benchmark.

Key Features

    1. Real-World Questions: Built from 912 authentic user questions sourced from online Excel forums, reflecting genuine and complex spreadsheet manipulation needs.
    2. Diverse Spreadsheets: Includes real spreadsheets with intricate structures such as multiple tables, non-standard relational layouts, and abundant non-textual elements, mirroring actual user environments.
    3. Robust Evaluation: Introduces a reliable, online-judge-style evaluation metric using multiple test-case spreadsheets per instruction to assess the generalization and robustness of model-generated solutions.

Evaluation Metrics

Overall (Pass@1): The overall accuracy across all 912 questions and 2,729 spreadsheets.

Benchmark Statistics

57.2%
Top Score (OVERALL)
912
Total Tasks
2,729
Total Spreadsheets

SpreadsheetBench Leaderboard

SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets. The benchmark include 912 real questions gathered from online Excel forums, covering a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements.

Visit the Overview page for benchmark details, research paper, code repository, and dataset downloads.

To submit your results to SpreadsheetBench, please contact zeyaoma@gmail.com and provide an API capable of generating your agent's results.

RANK
MODEL
STATUS
SCORE
DATE
ORG
1
Shortcut.ai
Verified
59.25%
Oct 16, 2025
Shortcut.ai
2
Copilot in Excel (Agent Mode)
Unverified
57.2%
Sep 29, 2025
Microsoft
3
ChatGPT Agent w/ .xlsx
Unverified
45.5%
Jul 17, 2025
OpenAI
4
Claude Files Opus 4.1
Unverified
42.9%
Sep 29, 2025
Anthropic
5
ChatGPT Agent
Unverified
35.3%
Jul 17, 2025
OpenAI
6
OpenAI o3
Unverified
23.3%
Jul 17, 2025
OpenAI
7
Copilot in Excel
Verified
20.0%
Oct 17, 2024
Microsoft
8
GPT-4o (OSX)
Unverified
16.8%
Jul 17, 2025
OpenAI

*Verified results are evaluated internally through APIs provided by the respective organizations. Unverified results are evaluated by external third parties, such as OpenAI and Microsoft.

Task Examples

SpreadsheetBench incorporates complex questions based on real-world scenarios and diverse types of tables in spreadsheet files, compared to previous benchmarks. We present four data examples in SpreadsheetBench to illustrate the attributes of real-world problems and the challenging nature of our benchmark.

Example 1

Example 1: A cell-level manipulation example question involves manipulating a non-standard relational table (missing header on column D and column E).


This example shows a cell-level manipulation data example which aims to extract a specific part of a text string from a column of cells. The instruction contains the demand of the user and an example manipulating action of the question, which rarely occur in synthetic instructions of the previous benchmarks. Furthermore, the table within the spreadsheet file is a non-standard relational table that lacks a complete table header. The final result is required to be filled in the cells from B3 to B14, which ensure the uniqueness of the answer.