A document is a file that has been indexed by the document search engine. Every time a file is uploaded, updated or deleted in the Files API, it will also be scheduled for processing by the document search engine. After some processing, it will be possible to search for the file in the document search API.
The document search engine is able to extract content from a variety of document
types, and perform classification, contextualization and other operations on the
file. This extracted and derived information is made available in the form of a
Document
object.
The document structure consists of a selection of derived fields, such as the
title
, author
and language
of the document, plus some of the original fields
from the raw file. The fields from the raw file can be found in the
sourceFile
structure. The derived fields are described in more detail below.
Some document types (such as PDFs) contain additional metadata fields. If the document contains its title as part of this metadata, this field will be populated with that title.
Note that we do not currently extract the title from the document content itself. If there is a need for this, we may consider adding such functionality in the future.
Similar to the title
field, the author field is another field that can often be
extracted from the document's metadata.
The producer
field also exists in the document metadata. It contains information
about the software or the system that was used to create the document.
The createdTime
we assign to the document is not exactly the same as the one found
in the Files API. We first try to extract the created time from the document metadata.
If the document does not contain such a timestamp, we fall back to the time set in
the Files API.
If there is a mime type set in on the file in the Files API, this field will be set to the same mime type. If there is no mime type set on the file, we will try to auto-detect it.
This field contains the extension of the file, derived from the file name. For
instance, if the file name is My Document.docx
, the extension
field will contain
docx
.
Contains the number of pages in the document, if possible to determine.
The type
field contains a high level file type, derived from the mime type. Mime
types are not that pleasant to look at, and not always easy to understand. That is
why we map the mime types into more user-friendly types. Below is the list of types
currently returned, but be aware that this list may be extended in the future.
Document
: Document files from Microsoft Word or similar word processing software.PDF
: PDF files.Spreadsheet
: Files from Microsoft Excel or similar spreadsheet software.Presentation
: Slides from Microsoft Powerpoint or similar.Image
: Any kind of image such as PNG or JPG files.Video
: Any kind of video such as MOV or MP4 files.Tabular data
: Csv, tsv and other kinds of tabular data files.Plain text
: Plain text files.Compressed
: ZIP files and other kinds of compressed archive files.Script
: Program code such as python or matlab.Other
: Anything that doesn't fit in any of the above types.This assetIds
field contains a combination of asset ids that are directly assigned
to the file in the Files API, and additional assets whose names are mentioned in the
file itself.
If you want only the asset ids that are explicitly assigned to the file, you can use
the sourceFile.assetIds
property instead.
If there are labels assigned to the file in the Files API, this field will contain the same set of labels.
For files without explicitly assigned labels, it is possible to train an AI classifier to automatically assign labels. This can be used for instance to classify documents into different categories.
If there is a geolocation set on the file in the Files API, then this field will contain the same geolocation. If there is no explicitly assigned geolocation, the document processing system will try to detect a location using two different techniques;
We create a document for each uploaded file, but only derive data from certain files.
The following file types are eligible for further data extraction & enrichment: