Application Launcher

Date Published: February 25, 2022

Issue

While indexing files, you receive this error:

Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content

Resolution

You can check Solr sees by running the Tika extractor manually:

Install Java
Download https://archive.apache.org/dist/tika/tika-app-0.10.jar
Download the PDF in question.
Run the below command:
- java -jar tika-app-0.10.jar {filename-to-test}

Cause

There are many possible causes for Tika to give this error, but here are a few:

The PDF could be password-protected.
It could be too big.
It could be an incompatible format.

To rule out a version incompatibility, you can convert the PDF file that is generating the error to an earlier version. You can use something like this sample Ghostscript (https://www.ghostscript.com/) to achieve this:

$ gs                        \
   -sDEVICE=pdfwrite        \
   -dCompatibilityLevel=1.5 \
   -o output.pdf            \
   input.pdf

Troubleshooting "Unable to extract PDF content"

Issue

Resolution

Cause