Generative AI seems well-suited to pulling insights from Excel files. However, large language models (LLMs) struggle tremendously with spreadsheets. The columns and rows found in Microsoft Excel or Google Sheets are difficult to convert into an AI prompt. Microsoft researchers want to change that, so a major addition to Excel can be expected in the long run.
The fact that GenAI has problems with spreadsheets may not be all that widely known. After all, Microsoft has integrated the AI assistant Copilot into Excel for some time and Google also offers AI functionality within Sheets. This allows users to automatically create charts or start off their project with an AI-generated template. It’s not what any ambitious organization would actually want, though, which is the ability to turn spreadsheet data into useful insights at the click of a button.
Organizations are using Excel spreadsheets en masse worldwide for their business operations. It shouldn’t be their intention to feed their internal files to AI models. Part of that has to do with data security and privacy; after all, no one should be sharing payrolls or inventory data with ChatGPT. The real bottleneck lies elsewhere, however: converting spreadsheets of a meaningful size into an AI prompt is costly as can be due to requiring a huge amount of tokens. It also often produces an amount of data that pushes it outside of the context window of an LLM, effectively its short-term memory to take in prompts. In other words, feeding larger raw Excel files to an LLM is either too expensive or impossible.
SpreadsheetLLM: not an LLM, but a framework
A team of Microsoft researchers recently presented SpreadsheetLLM, which proposes a new framework for linking LLMs to spreadsheets. As mentioned, the challenge above all is to feed a large amount of Excel or Sheets data to an LLM without breaking the bank or the model. From the study, a conventional approach with data serialization proved ineffective. The limited number of tokens was considered a stumbling block. GenAI presented further challenges for the researchers, as heftier spreadsheets that exceed the maximum number of LLM tokens “degrade accuracy performance as the size increases.”
The solution to this was SheetCompressor, another new framework from the research team. SheetCompressor contains three modules, each of which further compresses Excel or Sheets data. First, it recognizes homogeneous lines and columns i.e. repetitive data, which provides few insights. A compacted “skeleton” remains, defined largely by the detected separations between tables.
Next, the researchers use “lossless inverted-index translation” in a JSON format. Specifically, this involves mixing identical data in different cells without losing data integrity. The third module groups data based on a corresponding format. The bottom line: thanks to SheetCompressor, encoding spreadsheets is 96 (!) percent more economical with AI tokens. Organizations could thus face a 25 times lower fee for their AI spreadsheet ambitions using the methods described in the research paper.
Performance Impressive
The Microsoft team experimented using the closed-source OpenAI models GPT-3.5 and GPT-4 and the open-source offerings Llama 2 and Llama 3 from Meta, Phi-3 from Microsoft and Mistral-v2. GPT-4 detected tables 27 percent better thanks to the new methodology. TableSense-CNN, previously considered the state-of-the-art, was defeated by a 13 percent margin. Larger spreadsheets in particular are significantly better “understood” thanks to SpreadsheetLLM and SheetCompressor.
A next step, called the Chain of Spreadsheet (CoS), refines the methodology even more. This involves two steps: 1) identifying relevant tables so that the right data is included for an AI prompt and 2) generating a response, feeding the chosen table section once again to the LLM. This roadmap divides the work in a way that keeps the intended task manageable for AI models.
Not yet ready for use
The researchers acknowledge that there is still work to be done. For example, the current framework does not include things like background colors and borders, simply because that would require too many tokens. There is also still more context to be gained from the information within spreadsheet cells: “For example, categorizing terms like “China,” “America,” and “France” under a unified label such as “Country” could not only increase the compression ratio but also deepen the semantic understanding of the data by LLMs. Exploring these advanced semantic compression techniques will be a key focus of our ongoing efforts to enhance the capabilities of SpreadsheetLLM.”
There won’t be direct integration within Microsoft Excel of SpreadsheetLLM anytime soon. Still, the innovations show that considerable work is being done on a meaningful feature extension for Excel at some point. The Microsoft researchers also mention Google Sheets, but it will be up to the research team’s employer in Redmond to eventually bring these findings to market. That would go a long way toward making spreadsheets deliver insights faster, something that not only saves time but also leads to new data-driven insights.
Also read: Meta’s MobileLLM optimized for smartphones