This article gives information about how the scan works finds and handles links that are on your website.
The scan follows the links on your website to find all of the content that is present on your domain.
A link is a unique URL that points to a HTML page, or an asset like an image, JavaScript file, CSS file, or documents. All links are unique – any difference in a URL (such as a single letter change) creates a distinct link.
Examples of unique links are:
https:/ /domain.tld/webaccessibility
https:/ /domain.tld/web-accessibility
Or
https:/ /domain.tld/pageId=23
https:/ /domain.tld/pageId=24
Or
https:/ /domain.tld/list?sort=price
https:/ /domain.tld/list?sort=color
In each pair of links above, the two links shown are counted as two separate, unique links.
Any unique link to any HTML page on the primary domain, subdomain (if added), or internal URL (link) is counted as one page.
The scan only counts each unique link once, even if the link occurs multiple times and/or is present on multiple pages.
The structure in the image above is like this:
Primary Domain (page 1) contains three links:
Link to page 2
Link to page 3
Link to page 4
Page 2 contains one link:
Link to page 5
Page 3 contains one link:
Link to page 5
Page 4 contains one link:
Link to page 5
Page 5 contains one link:
Link to page 1
In this example, the scanner counts:
5 unique pages (primary domain URL + 4 URL pages)
5 unique links
3 occurrences of link 5
External links: The scanner determines if an external link is broken, but it does not count external links as pages.
You can exclude certain links and URL paths from a scan. Use these to stop the scan from following, scanning, and counting certain links and pages.
These allow you to specify patterns to exclude specific links from the scan by giving instructions to the crawler to ignore all URLs that match the pattern. The link is still recorded as present on the page, but we will not check it or follow it.
For more information, see the User Guide articles:
These give you the option to control the pages that are covered by the scan. With a regular expression, you can include or exclude content from the scan.
Example 1:
If you only want to scan the news section of your homepage, https:/ /domain.tld/news, add a constraint with ^/news to force the crawler to crawl content there.
Important note: The start URL needs to be a part of the constraints. This can be done in multiple ways.
Change the start URL to https:/ /domain.tld/news
or
Add an extra constraint with ^/$
- that includes the frontpage and add it.
Example 2:
If you have a result page and you want to remove all the results from the scan. The results page could look something like:
https:/ /domain.tld/search/results?query=test
It is possible to create a negative constraint to do this. It could look something like: !search/results?
For more information, see the user guide article:
Path Constraints and Link Exclusions.
Canonical links are a way to indicate the preferred version of a webpage when there are duplicate or similar versions under different URLs. Canonical links are mostly used to help search engines understand which version to index and display in the search results, which improves SEO.
You can read more about canonical links here:
Google: What is Canonicalization
Google: How to specify a canonical with rel="canonical" and other methods
Example uses of canonical links:
Example 1: Print version of a page
For example this URL:
https:/ /domain.tld/page_id=32
When a print version of this page is created, many CMS systems add a print parameter within the URL that looks something like:
https:/ /domain.tld/page_id=32?print=yes
In many real-world cases, the content of these two pages is effectively or exactly the same. In the example above, these URLs register as two separate pages for web crawlers or search engines.
A common way to address this is to add a canonical tag on the print page like this:
https:/ /domain.tld/page_id=32?print=yes
That points to the primary page:
https:/ /domain.tld/page_id=32
To do this, insert a tag into the head section of the HTML, here is an example:
<link rel="canonical" href="https://domain.tld/page_id=32">
This canonical tag tells web crawlers and search engines that:
i) these pages contain duplicate content and
ii) the URL without the print parameter is the main version of the page.
Example 2: Sortable lists
Another example is a page that displays a sortable list of items – like a news site with a list of articles or a store with a list of products.
Assume the https:/ /domain.tld/list contains a list where you can sort by color, price, or size. The content contained on the page remains the same, but each sorted version of the page has a unique URL such as:
https:/ /domain.tld/list?sort=colors
https:/ /domain.tld/list?sort=price
https:/ /domain.tld/list?sort=size
In this case, you could add a canonical link to the main list like this:
<link rel="canonical" href="https://domain.tld/list">
.
This tag indicates that the default sort version of the page should be considered the primary version.
Add this canonical tag to alert search engines or web crawlers (like Acquia Optimize) that each of these URLs links to pages that have the same content.
You can read more on canonical links here:
It is also possible to configure canonical tags to exclude URLs that point to identical content, see the user guide articles:
The crawler uses a dynamic discovery process, which means it actively explores and discovers web pages on your website. It does this by following links from one page to another systematically to find all pages on your domain.
The scan is done with a breadth-first approach, which means it starts with the initial webpage and then systematically explores all of the links on a page before it moves on to the next level/depth of pages. The crawler can simultaneously scan 10 different pages of the same domain, and in most cases respects the depth priority of the links, depending on the response and processing latency of each page. When a sitemap is found, all pages in the sitemap are considered to be at depth level 0 (top).
The crawler is capped at a depth of 100 links from the start page.
The crawler can also inspect robots.txt files. It can detect sitemaps that are declared on the robots.txt file and automatically does a scan of all links that are on the XML of the sitemap.
For more information, see the user guide article:
Links that the Scan Automatically Ignores.
As an alternative to a start page, you could add a sitemap. Sitemaps can be an essential tool to enhance the effectiveness of the domain scan. Their advantages come to light particularly when dealing with large or complex URLs, as well as those that contain a great deal of multimedia content.
A sitemap essentially serves as a roadmap for the crawler, which shows an organized structure of your website. This makes it easier to navigate and helps the scan to discover URLs across your site. Making sure that every page is linked to at least one other page can be challenging for users with large websites. A sitemap can address this issue by guiding the crawler to new pages that might otherwise be overlooked.
Many CMS systems categorize different types of pages into categories such as news pages, event pages, and other pages that are created by a CMS module.
Often, a page is a normal content page that the user can set up themselves, whereas a news page created by a CMS module is categorized differently. These pages can either be categorized as a collection of pages or as a specific content type inside the CMS system. The same happens with events and forms. All content has a unique URL and can be accessed by a user.
We have collected the answers to some common issues that you might encounter regarding broken links found by the scan.
For more information, see the user guide articles:
If this content did not answer your questions, try searching or contacting our support team for further assistance.
Tue Oct 22 2024 21:50:45 GMT+0000 (Coordinated Universal Time)