Extract Text, Images and Metadata from PDF in Java

Q: Can JPedal extract text from a specific page range?

Yes. Use setPageRange() with a PageRanges value before calling openPDFFile(). For example, utilities.setPageRange(new PageRanges("1-5,8")) extracts text from pages 1 through 5 and page 8.

Q: Can JPedal search for text and return its location on the page?

Yes. FindTextInRectangle returns the coordinates of all matches for a search string across all pages. The results can be used to drive highlight overlays, redaction workflows, or navigation.

Q: Can JPedal extract text from password-protected PDF files?

Yes. Call utilities.setPassword("yourPassword") before openPDFFile().

Q: What Java version is required?

JPedal requires Java 17 as a minimum. Details at the JPedal Java version support page.

JPedal is a pure Java PDF library for extracting text, images, and metadata from PDF files.

Requirements

Requirement	Value
Minimum Java version	Java 17
Third-party dependencies	None

What JPedal can extract from a PDF

Capability	Output
Text (word list)	Words with bounding box coordinates (x1, y1, x2, y2)
Text (unstructured)	Text as seen on the page
Text (structured)	Logical hierarchy: headings, paragraphs, tables, lists
Text search	Page coordinates of all matches for a search string
Images	JPEG, JPEG2000, JBIG2, PNG and other embedded formats
Metadata	Page count, dimensions, author, date, XMP and DocInfo fields

Extract Text as a Word List

Word list extraction returns each word with its position on the page, making it suitable for indexing, search, and layout-aware text processing.

Extract text as a word list

Extract Text in Rectangle

JPedal can extract unstructured text from a specific rectangular area on a PDF page, useful for targeting content in known locations such as headers, footers or fixed-position fields.

Extract Unstructured text with a rectangle from PDF files

Search for Text in a PDF

JPedal can locate text matches within a PDF and return their page coordinates, enabling highlight, redaction, and navigation workflows.

Search for text in a PDF

Extract Structured and Marked Content

For PDF files with tagged content, JPedal can extract the document’s logical structure, including headings, paragraphs, tables and reading order, as well as the raw marked content tree.

Extract structured content from a PDF

Extract Images from a PDF

JPedal extracts images from a PDF in a format of your choice, including JPEG, JPEG2000, PNG and other raster formats.

Extract images from a PDF

Extract Metadata

JPedal provides access to document metadata including page count, dimensions, author, creation date, and all standard XMP and DocInfo fields such as title, subject, keywords and producer.

Extract PDF metadata

Extract text from the command line or another language

java -jar jpedal.jar --extractText "inputFile.pdf" "outputDir"

The command line interface can be invoked from any language that supports child processes, including Python, Node.js, C#, and shell scripts.

Frequently Asked Questions

Can JPedal extract text from any PDF file?

JPedal extracts text from PDFs that contain actual text content. Scanned PDFs consisting entirely of images require OCR, which is outside JPedal’s scope.

Does JPedal return text with position information?

Yes. Word list extraction returns each word with its bounding box coordinates, making it possible to reconstruct reading order, identify columns, or map words back to their visual location.

Can JPedal extract text from a specific page range?

Yes. Use setPageRange() with a PageRanges value before calling openPDFFile(). For example, utilities.setPageRange(new PageRanges("1-5,8")) extracts text from pages 1 through 5 and page 8.

Can JPedal search for text and return its location on the page?

Yes. FindTextInRectangle returns the coordinates of all matches for a search string across all pages. The results can be used to drive highlight overlays, redaction workflows, or navigation.

How does JPedal handle fonts and encoding in text extraction?

JPedal resolves font encoding internally, including Type1, TrueType, CIDFont, and embedded subsets. Most encoding edge cases are handled transparently.

Can JPedal extract text from password-protected PDF files?

Yes. Call utilities.setPassword("yourPassword") before openPDFFile().

Does JPedal require any third-party libraries for PDF extraction?

No. JPedal is a 100% Java solution with no required third-party dependencies. There are no native binaries and no external tools required.

Can I extract text from a PDF without writing Java code?

Yes. JPedal includes a command-line interface for text extraction. See the command line section above for syntax and language examples.

What Java version is required?

JPedal requires Java 17 as a minimum. Details at Which Java versions does JPedal support?

What is the difference between word list extraction and unstructured text extraction in JPedal?

Word list extraction returns words with position data, suited to search, redaction and indexing. Unstructured text extraction returns decoded text without position data, useful for simple text harvesting and AI ingestion.

Can extracted text be used as input for AI or LLM workflows?

Yes. JPedal can output extracted text as plain text, Markdown, XML, JSON, and several other formats, which can be passed directly to AI processing pipelines. See Extract structured content for output format options.

What image formats can JPedal extract from a PDF?

JPedal extracts images from a PDF in a format of your choice. Supported output formats include JPEG, JPEG2000, PNG and other raster formats.

What metadata can JPedal extract from a PDF?