You can exclude certain links and URL paths from a scan and exclude specific pages as well. Use these to stop the scan from following, scanning, and counting certain links and pages.
These allow you to specify patterns to exclude specific links from the scan by giving instructions to the crawler to ignore all URLs that match the pattern. The link is still recorded as present on the page, but we will not check it or follow it.
For more information, see the user guide articles:
These give you the option to control the pages that are covered by the scan. With a regular expression, you can include or exclude content from the scan.
Example 1:
If you only want to scan the news section of your homepage, https://domain.tld/new
s, add a constraint with
^/news
to force the crawler to crawl content there.
The start URL needs to be a part of the constraints. This can be done is either of the following ways:
https://domain.tld/news
.^/$
- that includes the frontpage.Example 2:
If you have a result page and you want to remove all the results from the scan. The results page could look something like:
https:/ /domain.tld/search/results?query=test
It is possible to create a negative constraint to do this. It could look something like: !search/results?
For more information, see the user guide article:
Path constraints and link exclusions.
Canonical links are a way to indicate the preferred version of a webpage when there are duplicate or similar versions under different URLs. Canonical links are mostly used to help search engines understand which version to index and display in the search results, which improves SEO.
The following external resources provide more information about canonical links:
Example uses of canonical links:
Example 1: Print version of a page
For example this URL:
https:/ /domain.tld/page_id=32
When a print version of this page is created, many CMS systems add a print parameter within the URL that looks something like:
https:/ /domain.tld/page_id=32?print=yes
In many real-world cases, the content of these two pages is effectively or exactly the same. In the example above, these URLs register as two separate pages for web crawlers or search engines.
A common way to address this is to add a canonical tag on the print page like this:
https:/ /domain.tld/page_id=32?print=yes
That points to the primary page:
https:/ /domain.tld/page_id=32
To do this, insert a tag into the head section of the HTML, here is an example:
<link rel="canonical" href="https://domain.tld/page_id=32">
This canonical tag tells web crawlers and search engines that:
Example 2: Sortable lists
Another example is a page that displays a sortable list of items, for example, a news site with a list of articles or a store with a list of products.
Assume that the https:/ /domain.tld/list contains a list where you can sort by color, price, or size. The content contained on the page remains the same, but each sorted version of the page has a unique URL such as the following:
https:/ /domain.tld/list?sort=colors
https:/ /domain.tld/list?sort=price
https:/ /domain.tld/list?sort=size
In this case, you could add a canonical link to the main list like this:
<link rel="canonical" href="https://domain.tld/list">
.
This tag indicates that the default sort version of the page should be considered the primary version.
Add this canonical tag to alert search engines or web crawlers (like Acquia Optimize) that each of these URLs is actually a link to a page that has the same content.
You can read more about canonical links on the following external pages:
It is also possible to configure canonical tags to exclude URLs that point to identical content.
If this content did not answer your questions, try searching or contacting our support team for further assistance.
Thu Mar 13 2025 10:41:50 GMT+0000 (Coordinated Universal Time)