Loading...


Related Products


Date Published: April 26, 2024

Block excessive crawling of Drupal Views or search results

Issue

Sometimes, robot webcrawlers (like Bing, Huwaei Cloud, Yandex, Semrush, etc) can attempt to crawl a Drupal View's search results pages, and could also be following links to each of the view's filtering options.  This places extra load on your site. Additionally, the crawling (even if done by legitimate search engines) may not be increasing your site's visibility to users of search engines.

Therefore, we suggest blocking or re-routing this traffic to reduce resource consumption at the Acquia platform, avoid overages to your Acquia entitlements (for Acquia Search, Views & Visits, etc.), and to generally help your site perform better.

Resolution

You may need multiple strategies to resolve this problem:

  • Option A) Well-behaved robots that adhere to the robots.txt specifications can be told to avoid indexing Views/Search pages by adding some directives to your site.
  • Option B) For other crawlers, you may need to specify a hard block to some URLs.
Keep reading for implementation examples.

Option A) Robot Exclusion Protocol method

For well-behaved robots and crawlers, you can avoid excessive crawling of your Views by adding some rules to your docroot/robots.txt file. Here's some examples:
# Do not index nor follow links that have a query string
# (e.g. /search?page=123  or /search?size=small&color=red)
User-agent: *
Disallow: /*?

# If your views or search pages use a module to convert facets/filters 
# to clean URLs (e.g. /search/page/123  or /search/size/small)
# you can try disallowing the search page's URL
User-agent: *
Disallow: /search*
For more about how to write and test rules inside your docroot/robots.txt file, see https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

Option B) Blocking by user agent and URL pattern

While not a perfect solution, blocking at the Apache layer with the following .htaccess rule can help.

  • Warning: This following snippet is provided as an example; we can not guarantee it will only block "bad" traffic (nor all of it!). We can say, however, that it has successfully helped some sites that were under a similar traffic pattern. 
  • This snippet is meant to block some User Agents that are requesting URLs that have query strings (like Search pages with various filtering or faceting options, listings of items with dozens/hundreds of pages, etc.)
  • The included User Agent list is not exhaustive nor authoritative, and may need editing.
  • We recommend using it along with your own observation and analysis of traffic in your logs. You may need to make adjustments if you are blocking too strictly. Also you should consider periodically doing this same analysis and blocking cycle.

If you do decide to try out this code in your .htaccess, please follow the included instructions. 

# EXAMPLE ROBOT BLOCKING CODE for Search pages or views. 
#   From: https://support-acquia.force.com/s/article/4408794498199-Block-excessive-crawling-of-Drupal-Views-or-search-results
#   Robot list updated: 2024-01-22
#   NOTE: May need editing depending on your use case(s).
#
# INSTRUCTIONS:
# PLACE THIS BLOCK directly after the "RewriteEngine on" line 
#   on your docroot/.htaccess file.
#
# This will block some known robots/crawlers on URLs when query arguments are present.
#   DOES allow basic URLs like /news/feed, /node/1 or /rss, etc.
#   BLOCKS only when search arguments are present like
#     /news/feed?search=XXX or /rss?page=21.
# Note: You can add more conditions if needed.
#   For example, to only block on URLs that begin with '/search', add this
#   line before the RewriteRule:
#     RewriteCond %{REQUEST_URI} ^/search
#
RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_USER_AGENT} "11A465|AddThis.com|AdsBot-Google|Ahrefs|aiohttp|alexa site audit|AlipesNewsBot|Amazonbot|Amazon-Route53-Health-Check-Service|ApacheBench|AppDynamics|Applebot|ArchiveBot|Archive-It|AspiegelBot|Assetnote|axios|azure-logic-apps|Baiduspider|Barkrowler|bingbot|BLEXBot|BLP_bbot|BluechipBacklinks|Buck|Bytespider|CatchBot|CCBot|check_http|ClaudeBot|CloudFlare-Prefetch|cludo.com bot|colly|contentkingapp|Cookiebot|CopperEgg|crawler4j|Csnibot|Curebot|curl|CyotekWebCopy|Daum|Datadog Agent|DataForSeoBot|Detectify|DotBot|Dow Jones Searchbot|DuckDuckBot|facebookexternalhit|Faraday|FeedBurner|FeedFetcher-Google|feedonomics|feroxbuster|Fess|Funnelback|Fuzz Faster U Fool|GAChecker|Ghost Inspector|GPTBot|Grapeshot|gobuster|gocolly|Googlebot|GoogleStackdriverMonitoring|Go-http-client|go-resty|GuzzleHttp|HeadlessChrome|heritrix|hokifyBot|Honolulu-bot|HTTrack|HubSpot Crawler|ICC-Crawler|Imperva|IonCrawl|jooble|KauaiBot|Kinza|LieBaoFast|linabot|Linespider|Linguee|LinkChecker|LinkedInBot|LinkUpBot|LinuxGetUrl|LMY47V|MacOutlook|Magnet.me|Magus Bot|Mail.RU_Bot|MauiBot|Mb2345Browser|MegaIndex|Microsoft Office|Microsoft Outlook|Microsoft Word|MicroMessenger|mindbreeze-crawler|mirrorweb.com|MJ12bot|monitoring-plugins|Monsidobot|MQQBrowser|msnbot|MSOffice|MTRobot|nagios-plugins|nettle|Neevabot|NewsCred|newspaper|node-fetch|Nuclei|NukeScan|okhttp|OnCrawl|Orbbot|PageFreezer|panscient.com|PetalBot|Pingdom.com|Pinterestbot|PiplBot|python-requests|Qwantify|Re-re Studio|Riddler|RocketValidator|rogerbot|RustBot|Safeassign|Scrapy|Screaming Frog|SeobilityBot|Search365bot|SearchBlox|SearchmetricsBot|searchunify|Seekport|SemanticScholarBot|SemrushBot|SEOkicks|seoscanners|serpstatbot|SessionCam|SeznamBot|Site24x7|SiteAuditBot|siteimprove|Siteimprove|SiteLockSpider|SiteSucker|SkypeRoom|Slackbot|Slurp|Sogou web spider|special_archiver|SpiderLing|StatusCake|Swiftbot|Synack|Turnitin|trendictionbot|trendkite-akashic-crawler|UCBrowser|Uptime|UptimeRobot|usasearch|UT-Dorkbot|weborama-fetcher|WhiteHat Security|WidenWebhookClient|Wget|WTWBot|www.loc.gov|Xenu Link Sleuth|Vagabondo|VelenPublicWebCrawler|Yeti|Veracode Security Scan|YandexBot|YandexImages|YisouSpider|Y!J|Zabbix|ZoominfoBot|ZoomSpider" [NC]
RewriteRule ^.* - [F,L]

 

Did not find what you were looking for?

If this content did not answer your questions, try searching or contacting our support team for further assistance.

Back to Section navigation