Documents

A document is a file that has been indexed by the document search engine. Every time a file is uploaded, updated or deleted in the Files API, it will also be scheduled for processing by the document search engine. After some processing, it will be possible to search for the file in the document search API.

The document search engine is able to extract content from a variety of document types, and perform classification, contextualization and other operations on the file. This extracted and derived information is made available in the form of a Document object.

The document structure consists of a selection of derived fields, such as the title, author and language of the document, plus some of the original fields from the raw file. The fields from the raw file can be found in the sourceFile structure. The derived fields are described in more detail below.

Derived fields

title

Some document types (such as PDFs) contain additional metadata fields. If the document contains its title as part of this metadata, this field will be populated with that title.

Note that we do not currently extract the title from the document content itself. If there is a need for this, we may consider adding such functionality in the future.

author

Similar to the title field, the author field is another field that can often be extracted from the document's metadata.

producer

The producer field also exists in the document metadata. It contains information about the software or the system that was used to create the document.

createdTime

The createdTime we assign to the document is not exactly the same as the one found in the Files API. We first try to extract the created time from the document metadata. If the document does not contain such a timestamp, we fall back to the time set in the Files API.

mimeType

If there is a mime type set in on the file in the Files API, this field will be set to the same mime type. If there is no mime type set on the file, we will try to auto-detect it.

extension

This field contains the extension of the file, derived from the file name. For instance, if the file name is My Document.docx, the extension field will contain docx.

pageCount

Contains the number of pages in the document, if possible to determine.

type

The type field contains a high level file type, derived from the mime type. Mime types are not that pleasant to look at, and not always easy to understand. That is why we map the mime types into more user-friendly types. Below is the list of types currently returned, but be aware that this list may be extended in the future.

  • Document: Document files from Microsoft Word or similar word processing software.
  • PDF: PDF files.
  • Spreadsheet: Files from Microsoft Excel or similar spreadsheet software.
  • Presentation: Slides from Microsoft Powerpoint or similar.
  • Image: Any kind of image such as PNG or JPG files.
  • Video: Any kind of video such as MOV or MP4 files.
  • Tabular data: Csv, tsv and other kinds of tabular data files.
  • Plain text: Plain text files.
  • Compressed: ZIP files and other kinds of compressed archive files.
  • Script: Program code such as python or matlab.
  • Other: Anything that doesn't fit in any of the above types.

assetIds

This assetIds field contains a combination of asset ids that are directly assigned to the file in the Files API, and additional assets whose names are mentioned in the file itself.

If you want only the asset ids that are explicitly assigned to the file, you can use the sourceFile.assetIds property instead.

labels

If there are labels assigned to the file in the Files API, this field will contain the same set of labels.

For files without explicitly assigned labels, it is possible to train an AI classifier to automatically assign labels. This can be used for instance to classify documents into different categories.

geoLocation

If there is a geolocation set on the file in the Files API, then this field will contain the same geolocation. If there is no explicitly assigned geolocation, the document processing system will try to detect a location using two different techniques;

  1. We will extract locations from files that contain embedded GPS locations. Photos and videos often have this kind of metadata.
  2. We will look at related assets that have locations, and assign the same location(s) to the document.

File type support

We create a document for each uploaded file, but only derive data from certain files.

The following file types are eligible for further data extraction & enrichment:

  • PDF files
  • Spreadsheets, documents, and presentations from the Microsoft, Libre Office and macOS office suites
  • Plain text files
  • Images