Loading...


Related Products


Date Published: February 25, 2022

Troubleshooting "Unable to extract PDF content"

Issue

While indexing files, you receive this error:

Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content

Resolution

You can check Solr sees by running the Tika extractor manually:

  1. Install Java 
  2. Download https://archive.apache.org/dist/tika/tika-app-0.10.jar
  3. Download the PDF in question.
  4. Run the below command:
    • java -jar tika-app-0.10.jar {filename-to-test}

Cause

There are many possible causes for Tika to give this error, but here are a few:

  • The PDF could be password-protected.
  • It could be too big.
  • It could be an incompatible format.

To rule out a version incompatibility, you can convert the PDF file that is generating the error to an earlier version. You can use something like this sample Ghostscript (https://www.ghostscript.com/) to achieve this:

$ gs                        \
   -sDEVICE=pdfwrite        \
   -dCompatibilityLevel=1.5 \
   -o output.pdf            \
   input.pdf

Did not find what you were looking for?

If this content did not answer your questions, try searching or contacting our support team for further assistance.

Back to Section navigation
Back to Site navigation