Retrieve parsed document content

Returns parsed document content for the file.

Each document that is uploaded to CDF is OCR'ed and run through layout analysis, which detects titles, paragraphs and tables in the document. The textual content of the document is returned as a hierarchy of layout elements.

The high-level layout elements are again divided into lines and words. Each word has a bounding box for identifying which page it is found on, and where on the page.

Bounding boxes for paragraphs or lines are not provided, but can be easily calculated from the bounding boxes on the individual words.

Note that this endpoint only supports PDF files for now. Support for more file types will be added in the future.

This endpoint is in alpha and is available only when a cdf-version: YYYYMMDD-alpha header is provided.

Securityoidc-token or oauth2-client-credentials or oauth2-open-industrial-data or oauth2-auth-code
Request
header Parameters
cdf-version
string

cdf version header. Use this to specify the requested CDF release.

Example: alpha
Request Body schema: application/json
required

Fields to be set for the content request.

One of:
id
required
integer <int64> (CogniteInternalId) [ 1 .. 9007199254740991 ]

A server-generated ID for the object.

granularity
string
Default: "WORDS"

Adjust the level of detail in the response.

Enum: "WORDS" "ELEMENTS" "LINES"
Responses
200

Parsed textual content for the document

400

The response for a failed request.

post/documents/elements
Request samples
application/json
{
  • "id": 1,
  • "granularity": "WORDS"
}
Response samples
application/json
{
  • "elements": [
    • {
      }
    ]
}