Indexing attachments

In addition to Drupal nodes, Acquia Search results presented to your users can include files that contain matches to the search query. File formats that can be indexed include HTML, XML, Microsoft Office documents, OpenDocument, PDF, RTF, .zip and other compression formats, text formats, audio formats, video formats and more. For a complete list of supported document formats, see the Apache Tika documentation.

Search results for attached files display a direct link to the attached file as well as the node to which it is attached:

Attachments in search results

Searching file attachments requires the Apache Solr Attachments module. The Apache Solr Attachments module uses the Apache Tika Content Analysis Toolkit (currently Tika version 0.10) to detect and extract meta data and structured text content from a wide variety of file formats. Once extracted, this information is indexed and available to your users via Acquia Search.

Installing the Apache Solr Attachments module

To index attachments, you must install and enable the Apache Solr Attachments module on your website. Then, on the Apache Solr search > Configuration > Default Index page, under Configuration, select File as an entity to be indexed, and then click Save.

What to index

The Apache Solr Search configuration page now displays the Attachments tab. Use the settings on this tab to configure the file attachment indexing settings.

Configuring file attachment index settings

The Attachments tab of the Apache Solr Search configuration page contains the following configuration options for indexing attachments in Apache Solr Search:

Search - File Attachments configuration tab

Item Description
1. Excluded file extensions A space-separated list of file extensions that are excluded from indexing. Modify this list to suit the needs of your site. Extensions are internally mapped to a MIME type, so it is not necessary to include variations that map to the same type. For example, tif is sufficient to exclude both the tif and tiff file extensions.
2. Extract using Acquia Search includes Apache Tika for indexing documents. For best performance, select Solr (remote server).
3. Tika directory path Leave this blank.
4. Tika jar file Leave this set to the default value is tika-app-1.1.jar.

Index and cache controls

File attachments get indexed at the same time as their parent entities. Under Actions, you can:

Action Description
Clear the attachment text extraction cache Clears all extracted data.
Delete the attachments from index Deletes all the attached files on your site from the Acquia Search index. You need to do this if you change what types of files should be indexed, if your search index becomes corrupted, or if you install a new schema.xml.
Test your tika extraction Tests if your Tika configuration settings work.