Loading...


Related Products


Date Published: August 24, 2023

Issues with large attachments and Solr search

Acquia Search is a complex platform for hosting Solr indexes, and is not infinitely scalable. There are challenges both on the Drupal side and the server side that must be identified and considered when indexing large files.

Note: In general, indexing large files is not recommended.

Background: How the Indexing process works with files

Solr (including Acquia Search) completes the following three steps when indexing attachments:

  1. Drupal uses either the Apache Solr Attachments module or the Search API attachments module, which sends the file to either a:
    • Website on Acquia Cloud - Solr+Tika backend, which is a version of Solr that has been compiled to include Tika. You configure your modules to send a POST request to Solr at a special URL with the original document binary data.
    • Website not hosted by Acquia - Tika command, which is a stand-alone Tika executable that requires Java.

    Either of these modules will return text. See the issues and limitations.

  2. Drupal receives the extracted text, and then that text is stored locally in the database until it's needed at indexing time. This data also serves as a cache if other Drupal nodes reference the same file. See the issues and limitations.
  3. During Drupal indexing (cron runs or specific drush calls), the extracted text data from the database is then added by the ApacheSolr or Search API Solr Search modules to the data object to be sent to Solr. See the issues and limitations.

Issues and limitations

This section describes some of the limitations to the indexing process when dealing with large amounts of data.

External Solr instances

If your Solr instance is external to your website, like Acquia Search is, and you communicate with it using HTTP, you can encounter the following problems when the file is being sent to the Tika service:

  • Hitting a limit when sending too-large files
  • Hitting a timeout limit during your HTTP request
  • Running out of memory on the Tika backend

Websites not hosted by Acquia can work around these issues by running a locally hosted Java Tika application. Acquia Cloud does not run Java.

Database limitations

Your database may not be able to handle large files. The returned text needs to be stored in the database. This can greatly increase database table size and risks errors or data truncation if the table's field size is not large enough.

PHP and Memory limitations

When you're attempting to index a large file, your website can have additional problems:

  • Drupal's Apache Solr Attachments and Search API attachments modules need to keep all of the data returned from the database by Tika in memory. This can make PHP run out of memory.
  • The data for a single document needs to be kept in PHP memory, which can possibly cause out of memory errors if that data is too large.
  • Indexing operations normally work on batches of nodes or entities. Each of these entities could potentially have large amounts of extracted text, increasing the possibility of out of memory errors (even if each document was small). Workarounds include:
    • Run indexing for single items. This can be time-consuming, reduces but does not eliminate the out of memory problem.
    • Re-write the modules' architecture to auto-detect when it's running out of memory.
  • The HTTP POST request that sends the data to Solr may time out.
  • The Solr backend may reject the request as too large.
  • Even if all of the preceding steps work, Solr itself has a hard limit on amount of tokens per field defined in solrconfig.xml, for example:
    <maxfieldlength>20000</maxfieldlength>

    Even if everything else works, if the extracted text has more words than the maxFieldLength, Solr will truncate the indexed data to this amount. Smaller PDF files (such as a 5MB PDF file) can contain far more than 20,000 words.

Limits when using Acquia Search for Tika extraction and searching

Acquia Search has several non-configurable limits:

  • The Acquia Search infrastructure's Tika extractor does not allow extracting text from files larger than 20MB.
  • The maxFieldLength in solrconfig.xml is set to a maximum of 20,000 tokens in our platform. A dedicated search farm is required to go above this limit. Contact your account manager if you require a dedicated farm.
  • Note that indexing large amounts of text (for example, many PDF documents each spanning hundreds of pages) will consume your available index space quickly.

Large-file indexing options

If you absolutely need to index large files, there are several options available for your use, including the following:

  • Large attachments can be ignored (their text extraction will not be attempted) by setting a file size limit in the Drupal admin UI.

    The Search API attachments module has this feature. To use it:

    There is a similar options for the D7 apachesolr.module

    1. Edit the search index you are using to index your documents.
    2. In the Filters tab, select the File attachments checkbox, and then scroll down. Change the Maximum file size.
    3. Click Save.
  • You can also configure Search API Attachments module to only index a limited amount of the text that has been extracted from files. Doing this is important because large amounts of text--while increasing recall--could be adding too much data to Solr, increasing indexing time, Solr disk usage, and query time. If your documents include useful data like tables of contents, summaries, etc. near their beginning, doing this will still populate your index with valuable keywords sought out by your site visitors. 
    1. Edit the search index you are using to index your documents.
    2. In the Filters tab, select the File attachments checkbox, and then scroll down. Change the value of Limit size of the extracted string before indexing. (We recommend a value between 5 and 20 kB).
    3. Click Save.
  • If your website is externally-hosted, and then you can run Java and compile Tika properly, you can bypass some of the limits in the external Solr instances section above.
  • If you're comfortable with properly configuring servers to bypass most of the mentioned limits, you can install and maintain your own Solr instances servers that are not hosted by Acquia.

Did not find what you were looking for?

If this content did not answer your questions, try searching or contacting our support team for further assistance.

Back to Section navigation