Returns parsed document content for the file.
Each document that is uploaded to CDF is OCR'ed and run through layout analysis, which detects titles, paragraphs and tables in the document. The textual content of the document is returned as a hierarchy of layout elements.
The high-level layout elements are again divided into lines, words and characters. Each character has a bounding box for identifying which page it is found on, and where on the page.
Bounding boxes for paragraphs, lines or words are not provided, but can be easily calculated from the bounding boxes on the individual characters.
Note that this endpoint only supports PDF files for now. Support for more file types will be added in the future.
This endpoint is in alpha
and is available only when a cdf-version: YYYYMMDD-alpha
header
is provided.
Parsed textual content for the document
The response for a failed request.
{- "elements": [
- {
- "type": "title",
- "level": 1,
- "lines": [
- {
- "words": [
- {
- "characters": [
- null
]
}
]
}
]
}
]
}