Extract structured data

Pull the same fields from every document into a single spreadsheet. Define a schema, point at a label, get an XLSX.

Data Extractor turns a pile of similar documents into one spreadsheet. You define a schema (the fields you want) and Data Extractor reads each document, fills in the row, and ships an XLSX with one row per source and one column per field.

What you get

A single XLSX. One row per document. One column per field you specified. A "source" column linking each row back to the original document (and to citations for the extracted values).

Run it

  1. 1

    Build the dataset (label, multi-select, or Smart Search).

  2. 2

    Open the workflow runner and pick Data Extractor.

  3. 3

    Define the schema (see below).

  4. 4

    Click Run. Per-document extractions run in parallel; the XLSX assembles when the last one finishes.

Define the schema

Two ways:

Plain-English description. Write what you want, one field per line:

counterparty name
effective date
term length in months
auto-renewal clause: yes or no
governing law
total contract value in USD

Data Extractor reads the description, decides the column types, and fills in the spreadsheet.

JSON schema. For tighter control over types, names, and validation:

{
  "counterparty": "string",
  "effective_date": "date",
  "term_months": "integer",
  "auto_renewal": "boolean",
  "governing_law": "string",
  "value_usd": "number"
}

JSON schema is useful when extracted values feed downstream automation that expects specific types.

Tips for good extractions

  • Field names matter. "Date" is ambiguous; "effective date" or "signing date" is not.
  • Specify units. "Term length in months" extracts cleanly; "term length" might come back in years.
  • Add an "unknown handling" hint. Use custom instructions like "leave fields blank if the document doesn't specify; never guess."
  • Test with a few documents first. Run Data Extractor on three sources before scaling to thirty. Adjust the schema based on what came back.

What comes back

Each extracted value carries citation metadata back to the source location. You can verify any cell by clicking through (in supported viewers) or by opening the source row's document.

If a field can't be confidently extracted from a document, the cell is left blank (or marked depending on your instructions). Data Extractor doesn't guess.

Common patterns

  • Contract intake: counterparty, term, renewal, termination, governing law, value.
  • Research synthesis: authors, year, dataset size, methodology, main finding.
  • Resume screening: name, last role, years of experience, key skills, education.
  • Compliance audit: control owner, last review date, evidence linked, status.

What's next

Was this helpful?