How to Extract Text from PDF files

List of code examples

All these examples are included in the trial and full versions of JPedal. They give details on how to run each directly from the jar and any useful options. We also provide links to the source code so you can use the code in your own programs and tailor it to your exact requirements.

Automated PDF Text Extraction examples

PDF to Text Extraction Options

Because the PDF was originally designed as a display format, it does not generally contain formatted text, just blocks of unconnected text in a possibly random order. JPedal provides code to extract these text blocks and heuristics to merge them into useful content. You can use JPedal to extract the raw text and apply your own merging logic, or use our algorithms. If you are creating PDFs, we recommend you create PDFs which contain structured/marked Content. This additional option includes extra metadata in the PDF so text can be perfectly extracted.

Methods in PdfGroupingAlgorithms

All the text extraction algorithms are located in the PdfGroupingAlgorithms class. The public facing methods in this class that can be used for extracting are, extractTextAsTableextractTextAsWordlist and extractTextInRectangle.

Coordinates

The three extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left hand corner), and x2, y2 (bottom right). The page orgin is bottom left (opposite to Java).

Overview of PDF to Text Extraction Methods

Map extractTextAsTable(int x1, int y1, int x2, int y2, int pageNumber, boolean isCSV, boolean keepFontInfo, boolean keepWidthInfo, boolean keepAlignmentInfo, int borderWidth)

Algorithm to extract all the text in a given rectangle x1, y1, x2, y2 on page_number outputing all text as a series of columns and rows.  The returned map contains the elements content, x1, y1, x2, y2.

 

The input variables after isCSV are only used if the extraction is extracting as xhtml (isCSV=false).

The variable keepFontInfo is true if we wish to extract the font information with the extracted table

The variable keepWidthInfo is true if we wish to extract the width information with the extracted text

The variable keepAlignmentInfo is true if we wish to extract the alignment information with the extracted text

The variable borderWidth is the value of the table borders.

Vector extractTextAsWordlist(int x1, int y1, int x2, int y2, int page_number, boolean breakFragments, String punctuation)

Algorithm to extract all the PDF text in a given rectangle x1, y1, x2, y2 on page_number outputing all text as a list of words followed by the words coordinates.  The returned vectors elements are ordered as words as a string, x1 coordinate as a float, y1 coordinate as a float, x2 coordinate as a float, y2 coordinate as a float.

The value of breakFragments dictates if we should attempt to divide the text by columns if there are any.

The value of punctuation dictates what characters should be treated as punctuation when determining the start and end of a word.

String extractTextInRectangle(int x1, int y1, int x2, int y2, int page_number, boolean estimateParagraphs, boolean breakFragments)

Algorithm to extract all the PDF text in a given rectangle x1, y1, x2, y2 on page_number outputing all text as a string keeping line structure.

The value of estimateParagraphs dictates if we should attempt to predict the paragraphs in the text.

The value of breakFragments dictates if we should attempt to divide the text by columns if there are any.