PDFlib TET (Text and Image Extraction Toolkit) is a toolkit for developers to consistently extract image, text, and metadata from PDF documents. TET strips the text contents of a PDF as Unicode strings, detailed colour, glyph and font information, and the page's position. Raster images are removed in common image formats.
TET can convert PDF documents to XML-based format known as TETML, which contains text and metadata and resource information. TET includes sophisticated content analysis algorithms for verifying word boundaries, grouping text into columns, detecting table structures and deleting unnecessary items, for example, shadow text.
The Text and Image Extraction Toolkit includes the pCOS interface for querying PDF document details such as XMP metadata, font lists, page size and document information fields.