How to Find Text in a PDF File

List of Java code examples

All these examples are included in the trial and full versions of JPedal. They give details on how to run each directly from the jar and any useful options. We also provide links to the Java source code so you can use the Java code in your own programs and tailor it to your exact requirements.

Automated PDF Text search example

 

A note on co-ordinates

Examples use the PDF co-ordinates which start at bottom left of page and run up the page. This is the opposite of Java (which run from top left down page).

Interactive PDF search in the PDF Viewer

The built-in PDF Viewer offer powerful PDF search capabilities as standard.

The interactive PDF search function allows you to search either the current page or the entire PDF document for occurrences of a word or a phrase. It supports the 3 GUI layouts of search which Adobe has offered in releases of Acrobat using an Options setting. This functionality can also be accessed from your own code

Preforming a search from within the Viewer class is done via the Object executeCommand(int commandID, Object[] args) usingCommands.FIND as the int value. For details of how to preform a search from the Viewer class please check the tutorial how to access PDF Viewer functions

The search results can be retrieved using the method getSearchResults() in the Viewer class. The results are returned as a SearchList. If the method is called half way through a search the search results at that point will be returned and the search will continue. To ensure the search was complete you can call getStatus() from within the SearchList and compare with the following values.

  • public final static int NO_RESULTS_FOUND = 1;
  • public final static int SEARCH_COMPLETE_SUCCESSFULLY = 2;
  • public final static int SEARCH_INCOMPLETE = 4;
  • public final static int SEARCH_PRODUCED_ERROR = 8;

PDF Search Access from PdfGroupingAlgorithms

All the searching and text extraction algorithms are located in the PdfGroupingAlgorithms class. The public facing methods in this class that can be used for searching are findTextfindMultipleTermsInRectangle, and findMultipleTermsInRectangleWithMatchingTeasers.

Search Type

All four methods take an integer value describing the type of search to conduct. This value is made up of either one, or a combination of more than one value contained in the SearchType class. These values are

  • public final static int DEFAULT = 0;
  • public final static int WHOLE_WORDS_ONLY = 1;
  • public final static int CASE_SENSITIVE = 2;
  • public final static int FIND_FIRST_OCCURANCE_ONLY = 4;
  • public final static int MUTLI_LINE_RESULTS = 8;
  • public final static int HIGHLIGHT_ALL_RESULTS = 16;
  • public final static int USE_REGULAR_EXPRESSIONS= 32;

These values can be combined by using the bitwise or operator. For example,

int searchType = SearchType.WHOLE_WORDS_ONLY | SearchType.CASE_SENSITIVE;

All four PDF search methods can find results split across multiple lines using the SearchType.MUTLI_LINE_RESULTS values, findTextInRectangleAcrossLines will find results across lines regardless.

 

Overview of PDF Searching Methods

float[] findText(Rectangle searchArea,int page_number,String[] terms,int searchType)

Algorithm to find an array of terms, textValue, in a given rectangle, searchArea, on page_number.  Returned is an array of coordinates for found text returned as a float[] containing the four coordiantes used to define a rectangle and a fifth value which is used to indicate that the next value is a continuation ofthis result (only if fifth value is -101).

 

List findMultipleTermsInRectangle(int x1, int y1, int x2, int y2, final int rotation, int page_number, String[] terms, boolean orderResults, int searchType, SearchListener listener)

Algorithm to find multiple text terms in a given rectangle x1, y1, x2, y2 on page_number If orderResults is true then the list that is returned is ordered to return the resulting rectangles in a logical order descending down the page, if false, rectangles for multiple terms are grouped together. A listener - an implementation of SearchListener is required, this is to enable searching to be cancelled.  Returned is a Listof objects that can contain a combination of Rectangles and Rectangle[] describing the locations of found text.

 

SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1, int y1, int x2, int y2, final int rotation, int page_number, String[] terms, int searchType, SearchListener listener)

Algorithm to find multiple text terms in a given rectangle x1, y1, x2, y2 on page_number with matching teasers. A listener - an implementation of SearchListener is required, this is to enable searching to be cancelled. Returned is a SortedMap containing a collection of objects which are a combination of Rectangles and Rectangle[] describing the location of found text, mapped to a String which is the matching teaser.