Retrieve parsed document content

Returns parsed document content for the file.

Each document that is uploaded to CDF is OCR'ed and run through layout analysis, which detects titles, paragraphs and tables in the document. The textual content of the document is returned as a hierarchy of layout elements.

The high-level layout elements are again divided into lines, words and characters. Each character has a bounding box for identifying which page it is found on, and where on the page.

Bounding boxes for paragraphs, lines or words are not provided, but can be easily calculated from the bounding boxes on the individual characters.

Note that this endpoint only supports PDF files for now. Support for more file types will be added in the future.

This endpoint is in alpha and is available only when a cdf-version: YYYYMMDD-alpha header is provided.

Securityoidc-token or oauth2-client-credentials or oauth2-open-industrial-data or oauth2-auth-code
Request
path Parameters
id
required
integer <int64> (CogniteInternalId) [ 1 .. 9007199254740991 ]

A server-generated ID for the object.

header Parameters
cdf-version
string

cdf version header. Use this to specify the requested CDF release.

Example: alpha
Responses
200

Parsed textual content for the document

400

The response for a failed request.

get/documents/{id}/elements
Request samples
Response samples
application/json
{
  • "elements": [
    • {
      }
    ]
}