How docAnalyzer reads your documents

The mental model behind references and dataset answers: enough to phrase better questions.

You don't have to know how docAnalyzer reads your documents to use it. But a few minutes of mental model makes it much easier to phrase questions that get specific, well-cited answers.

Files become pivots

When you upload a file, docAnalyzer keeps the original and also converts it to a pivot format that's easier to search and quote. The pivot depends on the document's category:

Category Examples Pivot
Paginated PDF, DOCX, PPTX, EPUB Page-aware text
Flowable Markdown, HTML, plain text, RTF Structured prose with headings
Schematic CSV, JSON, XML, XLSX Tabular data with sheet/path addresses

You don't see the pivot directly. You see the original in the viewer. The pivot is what the chat reads.

Pivots become chunks

The pivot gets split into chunks: overlapping segments of content, each sized for retrieval. Each chunk carries an address back to its origin: a page range for paginated documents, a heading path for flowable ones, a sheet/row/column locator for schematic ones.

This is why citations land where they do. Click a page-12 citation in the chat and the viewer scrolls to page 12, because the chunk that grounded the answer came from page 12. (API consumers see the underlying citation as a PAGE!12 token in the response body.)

Chunks become a search index

The chunks are indexed for two kinds of retrieval:

  • Semantic search: meaning-based. A query like "renewal terms" finds chunks that talk about expiration, renewal periods, auto-rollover, even if the word "renewal" doesn't appear.
  • Text search: exact-match. A query like "§12.3(b)(ii)" finds the literal string.

When a Focus answers a question, the chat engine decides which kind of search fits, runs it, reads the relevant chunks, and composes an answer grounded in what came back. Citations point to the chunks the answer drew from.

Your library can be bigger than the model

Modern chat models have a context window: the amount of text they can hold in mind during a single turn. State-of-the-art models top out around 150,000 words.

Your library is allowed to be many times bigger than that. docAnalyzer builds a compact map of every document (title, table of contents, category, language, length) and uses that map as the starting point for retrieval. Search runs in rounds: the chat pulls passages, reads them, decides whether the question is fully answered, and runs more searches with sharper queries if it isn't. The model never sees the whole library; it sees a working set that grows and refines as the answer takes shape.

A research library of hundreds of long PDFs is fine. A few thousand contracts is fine. The same retrieval loop runs at every size, and citations still land on the exact page or section.

What this means for your questions

A few practical consequences:

  • The model doesn't memorize your documents. It searches them at answer time. So large datasets work well: the question pulls in the relevant chunks, not the whole library.
  • Specific phrasing helps retrieval. "What's the auto-renewal clause?" pulls in better chunks than "tell me about the contract."
  • Citations are not optional. Every claim grounded in your sources should carry a citation token. If an answer doesn't cite, treat it with caution and ask for sources.
  • OCR matters for scans. A scanned page that isn't OCR'd well produces low-quality chunks. See Automatic OCR and Enhanced OCR.

What's next

Was this helpful?