Extract text, images and metadata from PDF Files in Java
JPedal is a pure Java PDF library that gives Java developers the capability to extract text, images and metadata from PDF files — including multiple version of images, metadata, word positions, font information, marked content, and raw character streams.
Extract Metadata
JPedal provides access to document-level metadata including page count, page dimensions, author, creation date, and all XMP and DocInfo fields.
Extract Images from a PDF
In addition to text, JPedal can extract the raw image data embedded within a PDF file.
Extract Text as a Word List
Word list extraction returns each word with its position on the page, making it suitable for indexing, search, and layout-aware text processing.
Search for Text in a PDF
JPedal can locate text matches within a PDF and return their page coordinates, enabling highlight, redaction, and navigation workflows.
Extract Structured and Marked Content
For PDF files with tagged content, JPedal can extract the document’s logical structure — headings, paragraphs, tables, and reading order — as well as the raw marked content stream.
Extract structured content from a PDF
Frequently Asked Questions
Can JPedal extract text from any PDF file?
JPedal extracts text from PDF files that contain actual text content. Scanned PDFs that consist entirely of rasterised images do not contain extractable text — those require OCR, which is outside JPedal’s scope. If a PDF was generated from a Word document, a layout application, or programmatically, the text will be extractable.
Does JPedal return text with position information?
Yes. Word list extraction returns each word alongside its bounding box coordinates on the page. This makes it possible to reconstruct reading order, identify columns, or map extracted words back to their visual location.
Can JPedal extract text from a specific page range?
Yes. Use setPageRange() with a PageRanges value before calling openPDFFile(). For example, utilities.setPageRange(new PageRanges("1-5,8")) extracts text from pages 1 through 5 and page 8.
Can JPedal search for text and return its location on the page?
Yes. FindTextInRectangle returns the coordinates of all matches for a search string across all pages. The results can be used to drive highlight overlays, redaction workflows, or navigation.
How does JPedal handle fonts and encoding in text extraction?
JPedal resolves font encoding internally, including Type1, TrueType, CIDFont, and embedded subsets. Most encoding edge cases are handled transparently.
Can JPedal extract text from password-protected PDF files?
Yes. Call utilities.setPassword("yourPassword") before openPDFFile().
Does JPedal require any third-party libraries for extraction?
No. JPedal is a 100% Java solution with no required third-party dependencies. There are no native binaries and no external tools required.
Can I extract text without writing Java code?
Yes. JPedal includes a command-line interface for text extraction: java -jar jpedal.jar --wordlist "inputFile.pdf" "outputDir". This can be invoked from Python, Node.js, shell scripts, or any language that supports child processes.
What Java version is required?
JPedal requires Java 17 as a minimum. A separate Java 8 build is available for projects that cannot upgrade — see Which Java versions does JPedal support?
What is the difference between word list extraction and raw text extraction?
Word list extraction returns words with position data and is suited to layout-aware use cases such as search, redaction, and indexing. Raw text extraction returns the character stream in the order it appears in the PDF content stream, which may not match visual reading order but is useful for simple text harvesting or AI ingestion pipelines.
Can extracted text be used as input for AI or LLM workflows?
Yes. JPedal can output extracted text as plain text, XML, JSON, and several other formats, which can be passed directly to AI processing pipelines. See Extract structured content for output format options.