EXTRACT TEXT FROM PDF IN JAVA
JPedal makes it easy for Java Developers to extract the text content from a PDF file and make use of any structure included
Why do Java Developers use JPedal for text extraction?
JPedal is a Java PDF library which can parse and decode even the most complex PDF files. It is able to extract the text content and also provide search functions on a PDF Document.
Support for PDF 2.0 specification
JPedal supports all the features in the latest PDF Specification including structure tags, complex fonts, and multiple languages.
Preserve text information
Text location and metrics are all preserved when JPedal parses a PDF file. WordList mode allows a list of all page words with their outline rectangles on the page.
Multiple language support
Jpedal supports CID and non-CID fonts. OpenType and PostScript fonts are both fully supported.
Perform complex text search
JPedal is able to make complex multi-line searches using regular expressions with wildcards.
JPedal Text Extraction Key Features
JPedal allows developers to extract the textual content inside a PDF Document.
Unicode Text Extraction
JPedal removes all the complexity of PDF content encoding. Text is converted into Unicode values.
Extract Structured Text
PDF files can be created with optional structure tags. If present, JPedal will extract and convert the content into XML.
Unstructured Text Extraction
In PDF files with no Structure, JPedal will extract the text using the top- bottom flow of the page in a text file.
Extract Word Positioning
JPedal is able to extract all the individual words on a page with the co-ordinates of their bounding boxes.
Search PDF files
JPedal can search PDF Documents for text and return location and pages. Complex wildcard search is supported.
Extract Document outlines
JPedal will convert the PDF Document outline (if present) into XML including page title, page number and zoom level.