Cognite Data Fusion API

Retrieve parsed document content

Returns parsed document content for the file.

Each document that is uploaded to CDF is OCR'ed and run through layout analysis, which detects titles, paragraphs and tables in the document. The textual content of the document is returned as a hierarchy of layout elements.

The high-level layout elements are again divided into lines and words. Each word has a bounding box for identifying which page it is found on, and where on the page.

Bounding boxes for paragraphs or lines are not provided, but can be easily calculated from the bounding boxes on the individual words.

Note that this endpoint only supports PDF files for now. Support for more file types will be added in the future.

This endpoint is in alpha and is available only when a cdf-version: YYYYMMDD-alpha header is provided.

Securityoidc-token or oauth2-client-credentials or oauth2-open-industrial-data or oauth2-auth-code

Request

header Parameters

cdf-version

string

cdf version header. Use this to specify the requested CDF release.

Example: alpha

Request Body schema: application/json
required

Fields to be set for the content request.

One of:

id required	integer <int64> (CogniteInternalId) [ 1 .. 9007199254740991 ] A server-generated ID for the object.
granularity	string Default: "words" Adjust the level of detail in the response. Enum: "words" "elements" "lines"

Responses

200

Parsed textual content for the document

400

The response for a failed request.

post/documents/elements

Request samples

Payload
curl

application/json

{"id": 1,
"granularity": "words"
}

Response samples

application/json

{"elements": [{"text": "string",
"page": 0,
"left": 0,
"right": 0,
"top": 0,
"bottom": 0
}
]
}

➔ Next to Search for documents

Retrieve parsed document content

header Parameters

Request Body schema: application/jsonrequired

Request Body schema: application/json
required