Cognite Data Fusion API

Retrieve parsed document content

Returns parsed document content for the file.

Each document that is uploaded to CDF is OCR'ed and run through layout analysis, which detects titles, paragraphs and tables in the document. The textual content of the document is returned as a hierarchy of layout elements.

The high-level layout elements are again divided into lines, words and characters. Each character has a bounding box for identifying which page it is found on, and where on the page.

Bounding boxes for paragraphs, lines or words are not provided, but can be easily calculated from the bounding boxes on the individual characters.

Note that this endpoint only supports PDF files for now. Support for more file types will be added in the future.

This endpoint is in alpha and is available only when a cdf-version: YYYYMMDD-alpha header is provided.

Securityoidc-token or oauth2-client-credentials or oauth2-open-industrial-data or oauth2-auth-code

Request

path Parameters

required

integer <int64> (CogniteInternalId) [ 1 .. 9007199254740991 ]

A server-generated ID for the object.

header Parameters

cdf-version

string

cdf version header. Use this to specify the requested CDF release.

Example: alpha

Responses

200

Parsed textual content for the document

400

The response for a failed request.

get/documents/{id}/elements

Request samples

curl

Response samples

application/json

{"elements": [{"type": "title",
"level": 1,
"lines": [{"words": [{"characters": [null
]
}
]
}
]
}
]
}

➔ Next to Search for documents