How to Extract Text from PDF files
The links below provide links to Javadocs with example source code for extracting text from PDF files.
- ExtractStructuredContent – View Javadoc for API to extract any Structured content (if not present – Documents without structure will return no data).
- ExtractTextInRectangle – View Javadoc for API to extract text from any rectangular area of the PDF page.
- ExtractTextAsWordlist – View Javadoc for API to generate a list of words on the PDF page with page co-ordinates.
- ExtractOutline – View Javadoc for API to extract the PDF outline tree from a PDF file (if present) as an XML structure.
PDF to Text Extraction Options
Because the PDF was originally designed as a display format, it does not generally contain formatted text, just blocks of unconnected text in a possibly random order. JPedal provides the ability to extract these text blocks and heuristics to merge them into useful content.
If you are creating PDFs, we recommend you create PDFs which contain structured/marked Content. This additional option includes extra metadata in the PDF so text can be perfectly extracted.
The three extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).