GroupDocs GroupDocs.Parser for Java
GroupDocs.Parser for Java

GroupDocs.Parser for Java

GroupDocs.Parser for Java is a developer API for Java to extract raw or formatted text, metadata and images from documents, spreadsheets, presentations, emails & archives. Developers can extract raw, formatted & structured text and metadata from files. Using GroupDocs.Parser for Java, developers, are able to perform parsing of secure and encrypted documents in formats including Word, Excel, PowerPoints, OneNote, PDF and ZIP archives.

GroupDocs.Parser for Java Developer Small Business License type: Developer, Java
£915.00
GroupDocs.Parser for Java Developer OEM License type: Developer, Java
£2,735.00
GroupDocs.Parser for Java Site Small Business License type: Developer, Java
£4,555.00
GroupDocs.Parser for Java Site OEM License type: Developer, Java
£12,745.00

Overview

GroupDocs.Parser for Java is a developer API for Java to extract raw or formatted text, metadata and images from documents, spreadsheets, presentations, emails & archives. Developers can extract raw, formatted & structured text and metadata from files. Using GroupDocs.Parser for Java, developers, are able to perform parsing of secure and encrypted documents in formats including Word, Excel, PowerPoints, OneNote, PDF and ZIP archives.

GroupDocs.Parser for Java supports following document file formats:

Text Extraction

  • Text: DOC, DOCX, DOT, DOTM, DOTX, DOCM, RTF, ODT, OTT, TXT, MD, WordprocessingML (XML)
  • Spreadsheets: XLS, XLSX, CSV, XLSM, XLSB, ODS, SpreadsheetML (XML), XLT, XLTX, XLTM, OTS, XLA,, XLAM, TSV
  • Presentations: PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP
  • OneNote: ONE
  • Email: MSG, EML, EMLX, PST, OST, MS EXCHANGE SERVER, POP, IMAP
  • Electronic Publishing: EPUB, FB2
  • Portable Document: PDF, PDF Portfolio, Encrypted PDF
  • DOM-Based: XML, HTML, XHTML, MHTML
  • Compression & Packaging: ZIP, CHM
  • Database: ADOJava

Encoding Detection

  • BOM: UTF32 LE, UTF32 BE, UTF16 LE, UTF16 BE, UTF8, and UTF7
  • Content: UTF32 LE, UTF32 BE, UTF16 LE, UTF16 BE, UTF8, and ANSI

Metadata Extraction

  • Text: DOC, DOCX, DOT, DOTX, DOTM, OTT, ODT
  • Spreadsheets: XLS, XLSX, XLT, XLTX, XLTM, XLA, XLAM, OTS, ODS
  • Presentations: PPT, PPTX, POT, POTX, POTM, PPSM, PPTM, OTP, ODP
  • Email: MSG, EML, EMLX
  • Electronic Publishing: EPUB, FB2
  • Other: PDF

Text & Metadata Extraction

  • Template: DOTX, POTX
  • Macro-Enabled Template: DOTM, POTM, PPSM, PPTM
  • OpenDocument Template: OTT

Image Extraction

  • Text: DOC, DOCX, DOCM, RTF, DOT, DOTM, DOTX, ODT
  • Spreadsheets: XLS, XLSX, XLSM, XLSB, ODS, XLT, XLTM, XLTX
  • Presentations: PPT, PPTX, PPTM, ODP, POT, POTM, POTX, PPS, PPSX, PPSM
  • Portable Document: PDF, POT, POTM, POTX
  • Ebook: CHM, EPUB, FB2
  • Markup: HTML

Features

Statistically Count Word Occurrence in Single or Multiple Files
Extract Text and Metadata from Excel Worksheets and Presentation Templates
Extract Text Content from a File or Stream without Installing Document Reader
Get Formatted Text from a Document using Fast or Standard Text Extraction Mode
Detect the Media Type of Password Protected XML Documents & Pull Text from them
Programmatically Get Formatted Text from Within Emails & Attachments
Draw Out Text from Single or Multiple Pages of OneNote Document
Extract Data from PDF, MS Word, Excel and Presentation Documents
Extract Data from the PDF Forms & Take Out Text from Simple PDF File or a PDF Portfolio Document
Get Formatted Text from PowerPoint Presentation or Drive out Text from Specific Slide
Gather Raw or Formatted Text from Cells, Rows, and Columns from Excel Spreadsheet
Extract Raw or HTML Formatted Text from Word Document
HTML Formatter Supports Formatting of Paragraph, Hyperlink, Font, Headings, Lists & Tables
Pull Out Single Sentence or Whole Text from EPUB, CHM, Markdown & FB2 Files
Excerpt Table of Contents from Databases, PDF, EPUB, CHM & Word Processing Documents
Pull Out Text with its Content Structure Intact & Excerpt Highlighted Text from Documents
Obtain Text Area from Documents for Analysis & Draw out Metadata from Supported Document Formats
Obtain All or Selected Images from Supported Formats & Rotate Extracted Image(s)
Take Out Text from Files within Zip Archives & OST Containers & Detect file types of ZIP Container Items
Get Data from Email Container (Exchange Web Server, POP3, IMAP)
Search Simple Text, Whole Word & Regular Expression within Documents
Prepare Document Template, Extract Data from Document and Analyze Data Fields & Tables
Search and Extract Highlighted Expressions in Documents
Get Text with Plain Text Formatter (Simple & ASCII) or with Markdown Formatter
Markdown Formatter Supports Formatting of Font, Hyperlinks, Headings, Lists & Tables
Perform Custom Formatting with Edges, Angles, and Intersections to Format Plain Text
Move Table Layout & Detect Tables in a Rectangular Area by Column Separators
Extract Text from Shapes, WordArt Objects & Text Boxes within Microsoft Office File Formats
Extract Images to Files – Save to JPG, PNG, GIF, BMP, PNG or WEBP Formats