Link
Skip to main content

Extract Structured Content as YAML

JPedal supports extracting content as YAML from structured PDF files. This is useful for content pipelines, accessibility tooling, document archiving, and integrating PDF content into other systems.

Structured and Unstructured PDF Files

A PDF file may be tagged, contain structured text, or contain marked content. These terms are often used interchangeably, but in practice they refer to PDF files which contain information about the page structure and its elements. In this article we will refer to these as ‘structured’ PDF files.

Structured PDF files contain metadata tags (similar to HTML) to preserve the structure of textual content in a PDF. A PDF is created structured or unstructured, therefore it is generally not feasible to convert unstructured PDF files into structured PDF files.

The key advantage of structured PDF files is that their content can be extracted and transformed into other formats with high fidelity. JPedal currently supports outputting this content as either EPUB, HTML, JSON, Markdown, XML, or YAML.

PDF files may also include images in the structured content, these are called figures. If the figures have alt text or actualtext, these will be present in the output as well. If you are using the Java API methods, you may provide a scaling value for the images.

When No Structured Content Is Present

JPedal can extract all structured text present in a PDF file. If no structured content is found, the output file will contain a brief message explaining that no content was available.

Extract Structured Content as YAML from a PDF in Java

// Options
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.YAML); // output format        

// Extract structured text as YAML
ExtractStructuredText.writeAllStructuredTextOutlinesToDir(
        "inputFileOrFolder", // a single PDF of a folder containing PDFs
        "password",          // the password or null if not required
        "outputFolder",      // the output folder for the YAML
        null,                // error callback
        properties           // our settings object
);

// Extract structured text and images as YAML
ExtractStructuredText.writeAllStructuredTextOutlinesAndFiguresToDir(
        "inputFileOrFolder", // a single PDF of a folder containing PDFs
        "password",          // the password or null if not required
        "outputFolder",      // the output folder for the YAML
        null,                // error callback
        properties           // our settings object
        "figuresFolder",     // the output folder for the images
        "imageFormat",       // the format for the images
        1.0f                 // the scaling for the images
);

Extract Structured Content as YAML from a PDF from the Command Line or Another Language

To extract structured content from the command line or from a language other than Java, use the ExtractStructuredText class directly:

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "yaml"

We recommend modules, but you can still use the classpath if you want to.

Javadocs

This example uses the JPedal ExtractStructuredText class.

We also have a demo project on our GitHub page.

Frequently Asked Questions

What is a structured PDF?

A structured PDF is a PDF file that contains metadata tags describing the document’s logical structure — headings, paragraphs, lists, tables, and figures. These tags are similar in purpose to HTML tags and allow the content to be read and extracted programmatically. It can also be called marked content or tagged PDF.

How do I know if my PDF is structured?

Open the PDF in a viewer that supports tag inspection (such as Adobe Acrobat or JPedal) and check the Tags panel. If it shows a tag tree, the PDF is structured. Alternatively, run JPedal’s extraction — if the output file contains only the “no content available” message, the PDF is likely unstructured.

Can I convert an unstructured PDF to a structured one?

Generally no. Structure must be embedded when the PDF is created. Some tools can add basic tags to an existing PDF, but the results are rarely accurate enough for reliable content extraction.

Which output format should I use?

  • Use HTML or EPUB for content that will be read by humans or assistive technologies.
  • Use JSON, XML, or YAML for content that will be consumed by an application or pipeline.
  • Use Markdown for content that will be fed into an LLM or published to a documentation platform or wiki.

Why JPedal?

  • Actively developed commercial library with full support and no third party dependencies.
  • Process PDF files up to 3x faster than alternative Java PDF libraries.
  • Simple licensing options and source code access for OEM users.

Learn more about JPedal

Start Your Free Trial


Customer Downloads

Select Download