Acquia Search is a complex platform for hosting Solr indexes, and is not infinitely scalable. There are challenges both on the Drupal side and the server side that must be identified and considered when indexing large files.
Note: In general, indexing large files is not recommended.
Background: How the Indexing process works with files
Solr (including Acquia Search) completes the following three steps when indexing attachments:
Website on Acquia Cloud - Solr+Tika backend, which is a version of Solr that has been compiled to include Tika. You configure your modules to send a POST request to Solr at a special URL with the original document binary data.
Website not hosted by Acquia - Tika command, which is a stand-alone Tika executable that requires Java.
Drupal receives the extracted text, and then that text is stored locally in the database until it's needed at indexing time. This data also serves as a cache if other Drupal nodes reference the same file. See the issues and limitations.
This section describes some of the limitations to the indexing process when dealing with large amounts of data.
External Solr instances
If your Solr instance is external to your website, like Acquia Search is, and you communicate with it using HTTP, you can encounter the following problems when the file is being sent to the Tika service:
Hitting a limit when sending too-large files
Hitting a timeout limit during your HTTP request
Running out of memory on the Tika backend
Websites not hosted by Acquia can work around these issues by running a locally hosted Java Tika application. Acquia Cloud does not run Java.
Database limitations
Your database may not be able to handle large files. The returned text needs to be stored in the database. This can greatly increase database table size and risks errors or data truncation if the table's field size is not large enough.
PHP and Memory limitations
When you're attempting to index a large file, your website can have additional problems:
The data for a single document needs to be kept in PHP memory, which can possibly cause out of memory errors if that data is too large.
Indexing operations normally work on batches of nodes or entities. Each of these entities could potentially have large amounts of extracted text, increasing the possibility of out of memory errors (even if each document was small). Workarounds include:
Run indexing for single items. This can be time-consuming, reduces but does not eliminate the out of memory problem.
Re-write the modules' architecture to auto-detect when it's running out of memory.
The HTTP POST request that sends the data to Solr may time out.
The Solr backend may reject the request as too large.
Even if all of the preceding steps work, Solr itself has a hard limit on amount of tokens per field defined in solrconfig.xml, for example:
<maxfieldlength>20000</maxfieldlength>
Even if everything else works, if the extracted text has more words than the maxFieldLength, Solr will truncate the indexed data to this amount. Smaller PDF files (such as a 5MB PDF file) can contain far more than 20,000 words.
Limits when using Acquia Search for Tika extraction and searching
Acquia Search has several non-configurable limits:
The Acquia Search infrastructure's Tika extractor does not allow extracting text from files larger than 20MB.
The maxFieldLength in solrconfig.xml is set to a maximum of 20,000 tokens in our platform. A dedicated search farm is required to go above this limit. Contact your account manager if you require a dedicated farm.
Note that indexing large amounts of text (for example, many PDF documents each spanning hundreds of pages) will consume your available index space quickly.
Large-file indexing options
If you absolutely need to index large files, there are several options available for your use, including the following:
Large attachments can be ignored (their text extraction will not be attempted) by setting a file size limit in the Drupal admin UI.
There is a similar options for the D7 apachesolr.module
Edit the search index you are using to index your documents.
In the Processors tab, select the File attachments checkbox, and then scroll down. Change the Maximum file size.
Click Save.
You can also configure Search API Attachments module to only index a limited amount of the text that has been extracted from files. Doing this is important because large amounts of text--while increasing recall--could be adding too much data to Solr, increasing indexing time, Solr disk usage, and query time. If your documents include useful data like tables of contents, summaries, etc. near their beginning, doing this will still populate your index with valuable keywords sought out by your site visitors.
Edit the search index you are using to index your documents.
In the Processors tab, select the File attachments checkbox, and then scroll down. Change the value of Limit size of the extracted string before indexing. 1.
Click Save.
If your website is externally-hosted, and then you can run Java and compile Tika properly, you can bypass some of the limits in the external Solr instances section above.
If you're comfortable with properly configuring servers to bypass most of the mentioned limits, you can install and maintain your own Solr instances servers that are not hosted by Acquia.
Issues with large attachments and Solr search | Acquia Product Documentation