How to extract text from PDF files as JSON in Java (Tutorial)

Extract text from PDF files as JSON programmatically using JPedal

TL;DR

  • Goal: Extract text from PDF files as JSON programmatically in Java.
  • Tool: Utilize the JPedal library.
  • Process: Structured text extraction that preserves marked content and tags using specific methods like writeAllStructuredTextOutlinesToDir().
  • Outcome: JSON files which represent any structured content are created.

Introduction

Some PDF files can be "tagged" which means they contain information about the structure of the file. This structure is embedded as metadata within the PDF and is made up of a hierarchy of tags that label elements such as headings, paragraphs, lists, tables, and images.

This is very similar to HTML where text is contained within elements that have meaning, such as <p> for paragraph, or <table> for table.

If a PDF file does contain structured content (also known as marked content), then it can be processed and converted into other formats.

What is JSON?

A JSON file (short for JavaScript Object Notation) is a lightweight, text-based format used for storing and exchanging structured data between systems.

It represents data as key-value pairs and arrays, making it easy for both machines and developers to read and write. JSON is commonly used in web applications to transmit data between server and a client, and is natively supported by most programming languages.

Despite its origin in JavaScript, JSON is language-independent and has become a universal data format across APIs and software systems.

PDF vs JSON

Both formats serve distinct purposes, PDF preserves the fixed visual layout of a document, making it ideal for sharing print-ready content like reports, contracts and official documents. It is widely used when consistent appearance and layout across devices are critical, and is one of the most universally supported document formats.

In contrast, JSON is designed for data exchange and storage, not presentation. JSON is commonly used in web development, APIs, and software applications to sent structured data like user profiles, settings, or real-time content between servers and clients.

JSON can be imported directly in text editors like VS studio without affecting the format, with PDFs that is not the case. JSON is also best suited to be crawled by AI technologies like LLMs due to its structural nature.

Recently, we added PDF to JSON support to JPedal. If your PDF file contains structured content (how do I know?), then JPedal will be able to convert it to JSON.

Getting Started

  1. Add JPedal to your class or module path (download the trial jar)
  2. Create an ExtractStructuredTextProperties object and set the output format to OutputModes.JSON
  3. Call one of the methods from ExtractStructuredText

How to extract text from a PDF as JSON

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.JSON);
ExtractStructuredText.writeAllStructuredTextOutlinesToDir("inputFile.pdf", password, "outputFolder", null, null);

Summary

This guide demonstrated how to convert structured PDF files into JSON format using just a few lines of Java code. It also highlighted the key differences between PDF and JSON to help you determine which format best suits your needs.

For more in-depth insights into PDFs, feel free to explore our other articles — we’ve been working with the format for over a decade!

Frequently Asked Questions

Can all PDF files be converted to JSON?

No. Only PDFs that contain structured or tagged content can be accurately converted into structured JSON. Untagged PDFs may still allow plain text extraction, but the semantic structure of the document is often lost.

What does "tagged PDF" mean?

A tagged PDF includes metadata describing the logical structure of the document, such as headings, paragraphs, tables, lists, and images. This makes the document more accessible and easier to process programmatically.

Why convert PDF to JSON?

Converting PDF content to JSON makes it easier to integrate document data into applications, APIs, databases, search systems, and AI workflows. JSON is lightweight, structured, and widely supported across programming languages.

Can JSON preserve the visual layout of a PDF?

Not completely. JSON focuses on representing structured data rather than visual appearance. While structural relationships can be preserved, the exact page layout and formatting of the original PDF are not the primary goal.

Does JPedal support extracting structured text in Java?

Yes. JPedal provides Java APIs for extracting structured PDF content and exporting it as JSON using the ExtractStructuredText functionality.

Is JSON suitable for AI and LLM processing?

Yes. JSON is particularly well suited for AI pipelines and large language models because the structured format makes document content easier to parse, chunk, index, and analyse programmatically.