Spreadsheets, databases, and beyond: creating a universal AI query layer

Spreadsheets, Databases, and Beyond: Creating a Universal AI Query Layer

In the world of AI agents, the ability to understand and analyze structured data is a game-changer. While large language models excel at understanding natural language, they struggle with quantitative analysis when data is presented as unstructured text. This is where Query Tables comes in – a powerful feature that enables agents to execute SQL queries on structured data, providing precise answers to analytical questions.

At Dust, we've been on a journey to build and evolve our Query Table agent tool from a simple CSV file parser to a sophisticated system that can connect to enterprise data warehouses. This blog post details that journey – the technical challenges we faced, the architectural decisions we made, and how we've maintained a unified abstraction layer that makes it easy for our users to work with structured data regardless of its source.

0:00
/0:17

The problem: why semantic search falls short for quantitative analysis.

Before diving into our solution, it's important to understand the problem we were trying to solve. Semantic search is great for retrieving relevant chunks of information from unstructured text, but it has significant limitations when it comes to quantitative analysis:

  1. Incomplete data access: Semantic search only retrieves chunks of data, not the entire dataset, making it impossible to perform comprehensive analysis.
  2. No computation capabilities: Even if all the data were retrieved, LLMs have no built-in ability to perform calculations or aggregations.
  3. Relevance vs. completeness: Semantic search optimizes for relevance, not completeness, which is problematic for analytical queries that require full datasets.

We saw this limitation firsthand when users tried to ask quantitative questions about their CSV files, Notion databases, or Google Sheets. The answers were often incomplete or inaccurate because the agent couldn't "see" the entire dataset or perform the necessary calculations.

The origin story: CSV files and SQLite.

Early Conceptualization

Our journey began when users started asking about importing CSV files into Dust. Initially, these files were loaded directly into the context window as plain text, which created two major limitations: we could only handle small files due to context window constraints, and even when the data fit, the LLM struggled to perform accurate calculations or analysis on the raw tabular text. The results were often inconsistent and unreliable, especially for anything beyond the most basic arithmetic operations.

We started exploring a more sophisticated approach: using SQLite as an in-memory database to execute SQL queries against structured data. This decision was driven by several factors:

  1. Simplicity: SQLite is lightweight and requires no additional infrastructure
  2. Security: In-memory databases provide strong isolation for user data
  3. Performance: Early tests showed it was fast enough for our use cases
  4. Familiarity: SQL is a well-established language for data analysis

Technical Implementation

The core of our implementation was a system that could:

  1. Parse CSV files and infer their schema
  2. Create an in-memory SQLite database
  3. Load the data into the database
  4. Allow the agent to generate and execute SQL queries
  5. Return the results to the user

Performance was a concern, but our initial tests were promising. For a typical CSV file of ~2MB, the entire process from file reading to query execution took less than a second:

File reading: 14 ms
Schema inference: 162 ms
Table creation and data insertion: ~600 ms
Querying: 15 ms

An important advantage of this approach is that we perform the data loading concurrently while the AI is generating the SQL query—making these operations effectively invisible in terms of user-perceived latency since they happen during time the model would be generating text anyway. For most use cases, this was fast enough, and the in-memory approach meant we didn't need to maintain persistent database instances for each table.

To optimize resource usage while maintaining responsiveness, we implemented a caching mechanism for our in-memory databases. After the initial query completes, the database instance remains alive for several minutes, allowing follow-up questions to be answered without the overhead of recreating the database.

Each active database is tracked with a lightweight heartbeat system, and instances that haven't been queried for a configurable period (typically 5 minutes) are automatically shut down and their resources reclaimed. This approach strikes a balance between performance for multi-turn conversations and efficient resource management, which is particularly important as the number of concurrent users grows.

Query Generation and Execution

With the database set up, the next step is generating and executing the SQL query. This involves:

  1. Providing the agent with the schema information
  2. Letting the agent generate a SQL query based on the user's question
  3. Validating the query to ensure it's safe and well-formed
  4. Executing the query against the in-memory database
  5. Formatting the results for presentation to the user

Expanding to Connected Data Sources

The Connected Data Challenge

We had a solid foundation for handling CSV files, but our users wanted more. They needed to query data that lived in connected systems like Notion databases, Google Sheets, and Office 365 spreadsheets – data that was constantly changing and couldn't be effectively managed through manual uploads.

This presented new challenges:

  1. Data synchronization: How to keep the data up-to-date without manual intervention
  2. Schema discovery: How to automatically detect and map schemas from different sources
  3. Hierarchical structures: How to handle nested structures like sheets within spreadsheets
  4. Performance at scale: How to maintain performance with larger, more complex datasets

Notion Databases

Our first connected data source was Notion. Notion databases are structured collections of pages with consistent properties, making them a natural fit for our Query Tables agent tool.

One challenge was handling Notion's flexible property types, which include text, numbers, dates, selects, multi-selects, and more. We had to develop robust type conversion logic to map these properties to SQL data types.

Google Sheets and Office 365

Next came Google Sheets and Office 365 spreadsheets. These presented their own challenges:

  1. Sheet hierarchy: A single spreadsheet file could contain multiple sheets
  2. Data quality and structure: Many spreadsheets contain loosely structured or completely unstructured data – merged cells, multi-level headers, embedded charts, formatting as data, inconsistent value representations, and free-form notes mixed with tabular data. This "human-friendly but machine-unfriendly" formatting makes a significant percentage of real-world spreadsheets impossible to reliably parse as structured tables without cleanup or preprocessing.
  3. Header detection: Determining which row contained the headers
  4. Type inference: Inferring data types from cell values

Unified Data Source Abstraction

A key architectural decision was maintaining a unified abstraction layer for all data sources. Regardless of whether data came from a CSV file, a Notion database, or a Google Sheet, it would be represented as a table with:

  • A unique ID
  • A name and description
  • A schema (columns and their types)
  • A source URL (for tracking back to the original data)

This abstraction made it possible for agents to work with any data source using the same SQL interface without needing to know the underlying details.

What makes this approach powerful is that it allows users to join tables from completely different sources – combining data from a CSV file with a Notion database and a Google Sheet in a single SQL query, something that would be difficult to accomplish manually.

Extending to data warehouses: Snowflake and BigQuery.

The Enterprise Data Challenge

We were then ready for the next evolution: connecting to enterprise data warehouses like Snowflake and BigQuery. This was driven by strong customer demand, particularly from companies with significant investments in these platforms.

Unlike our previous data sources, enterprise data warehouses:

  1. Can contain terabytes or petabytes of data
  2. Have complex permission models
  3. Use specialized SQL dialects
  4. Often sit behind firewalls with strict access controls

Importing this data into SQLite wasn't feasible – we needed a different approach.

Remote Database Architecture

For data warehouses, we adopted a "remote database" pattern:

  1. Metadata Synchronization: Instead of syncing the data itself, we sync metadata about tables, schemas, and relationships
  2. Remote Query Execution: Queries are executed directly on the remote database, not in SQLite
  3. Result Limiting: We limit the amount of data returned to prevent overwhelming the system
  4. Permission Enforcement: Before executing any query, we run an EXPLAIN command that analyzes the query plan without executing it. This reveals all tables the query will access, allowing us to verify they match the permissions defined in the agent’s configuration. If the query attempts to access unauthorized tables, it's rejected immediately, even if the database role itself has broader permissions. This creates a crucial security layer that prevents agents from accessing data outside of their permissions, regardless of the underlying database permissions.

This approach allows us to leverage the performance and scale of the data warehouse while maintaining our unified abstraction layer.

Just-in-time tables: query results are also tables.

Our JIT tables feature treats query results as first-class tables, enabling powerful multi-step data analysis.

When an assistant executes a query against Snowflake, Google Sheets, or any tabular data source, the results automatically become a queryable table within the conversation. This allows users to join these results with data from other sources—like a Datadog CSV, they just uploaded or metrics from another database—in subsequent queries.

By making every query result a potential input for the next operation, users can build complex analytical workflows without leaving the conversation, seamlessly bridging data across previously isolated sources.

The future: Salesforce and beyond.

Salesforce Integration

Our next frontier is Salesforce, which is currently in development. Salesforce presents unique challenges:

  1. Object-Oriented Data Model: Salesforce uses objects instead of tables, with relationships between objects
  2. SOQL Dialect: Salesforce uses its own query language (SOQL) that differs from standard SQL
  3. Mixed Data Types: Salesforce contains both structured data and long-form text fields

Our approach combines:

  1. Remote database querying for structured data through SOQL
  2. Semantic search for long-form text fields

Enforcing permissions for Salesforce Queries

One of the most interesting challenges with Salesforce integration is "sandboxing" the agent’s access to the Salesforce database to respect our permission model. In our space-based permission system, users can grant access to specific Salesforce objects but not others. However, Salesforce's query language (SOQL) allows for complex relationship traversal that could potentially access unauthorized objects.

Unlike SQL databases where we can use EXPLAIN to analyze which tables a query will access, Salesforce's query plans only reveal the main object being queried, not the related objects accessed through relationship notation (like Comment.Post.content).

After exploring several approaches, including regex-based parsing and custom SOQL parsers, we settled on a JSON-based query format that gives us precise control over which objects are accessed. Instead of having the AI agent generate SOQL directly, it generates a structured JSON representation of the query:

{
  "object": "Account",
  "fields": ["Id", "Name", "Industry"],
  "where": {
    "condition": "AND",
    "filters": [
      { "field": "Industry", "operator": "=", "value": "Technology" },
      { "field": "AnnualRevenue", "operator": ">", "value": 1000000 }
    ]
  },
  "orderBy": [
    { "field": "Name", "direction": "ASC" }
  ],
  "limit": 50
}

This approach offers several advantages:

  1. Complete control: We can validate exactly which objects and fields are being accessed
  2. Easy to parse: JSON is straightforward to validate compared to parsing SOQL
  3. Future-proof: The format can be extended to support additional features

On the backend, we convert this JSON representation to SOQL before sending it to Salesforce. This approach maintains our unified abstraction layer while respecting the space-based permission model that's core to our platform.

Conclusion: the power of a unified abstraction

What makes our approach to the Query Tables agent tool powerful is the unified abstraction layer we've maintained throughout this evolution. Whether data lives in a CSV file, a Notion database, a Google Sheet, or a Snowflake data warehouse, agents interact with it through the same Query Tables tool.

This abstraction shields users from the complexity of the underlying systems and allows us to add new data sources without changing the user experience.

As we continue to evolve our Query Tables agent tool, we remain committed to this principle: making structured data accessible to AI agents, regardless of where it lives or how it's formatted. By doing so, we're enabling a new generation of agents that can not only understand natural language but also perform sophisticated data analysis, bringing the power of data to everyone.