Indexability relates to the technical configuration of URLs so that they are either Indexable or Not Indexable.
Search engines generally take the stance that any successful URLs (i.e., HTTP status 200) they find should be indexed by default – and they will, in the main, index everything they can find. However, you can give specific signals and directives to search engines that instruct them NOT to index specific URLs.
Setting URLs to be Not Indexable is a relatively common and straightforward task in most modern CMSs. You might want to set a URL to noindex, for instance, if it is helpful to website users but is not a page that would represent a valid search result (e.g., a ‘print’ version of a page).
However, indexing signals often get misconfigured, or set up incorrectly, which can result in essential URLs not getting indexed. An important thing to note is that if a page is not indexed, it cannot generate organic search traffic.
What Are Robots Directives?
Robots directives are lines of code instructing how search engines should treat content from the perspective of crawling and indexing.
By default – or without any robot directive – search engines work on the basis that every URL they encounter is both crawlable and indexable. This does not mean they will necessarily crawl and index the content, but it is the default behavior should they encounter the URL.
Thus, robot directives are essentially used to change this default behavior – by instructing search engines to either not crawl or not index, specific content.
How Are Robot Directives Presented To Search Engines?
There are three ways in which robots directives can be specified:
- Robots meta directives (called ‘meta tags’) work at a page level. Within the <head> of a page’s HTML, you include meta tags like this: <meta name=”robots” content=”noindex, nofollow”> to control crawling and indexing on a specific URL.
- X-robots-tags, which can be added to a site’s HTTP responses and can control robots directives on a granular page level, just like meta tags, but can also be used to specify directives across a whole site via the use of regular expressions.
- Robots.txt file, which usually lives on example.com/robots.txt, and is typically used to instruct search engine crawlers which paths, folders, or URLs you don’t want to crawl through ‘disallow’ rules.
In the methods outlined above, if the ‘nofollow’ directive is used, you do not wish for any of the links on the page to be followed. However, it is also possible to specify that individual links should not be followed via the nofollow link element.
What Is A Canonical?
In the field of SEO, a ‘canonical’, is a way of indicating to search engines the ‘preferred’ version of a URL. So if we have 2 URLs with very similar content – Page A and Page B – we could put a canonical tag on Page A, which specifies Page B as the canonical URL.
To do this, we could add the rel=canonical element in the <head> section on Page A;
<link rel=”canonical” href=”https://example.com/page-b” />
If this were to happen, you would describe Page A as ‘canonicalized’ to Page B. This generally means that Page A will not appear in search results, whereas Page B will. As such, it can be a very effective way of stopping duplicate content from getting indexed.
When you set up a canonical, you effectively say to search engines: ‘This is the URL I want you to index.’ People may refer to a canonical as ‘a canonical tag’, ‘rel canonical,’ or even ‘rel=canonical.’
Self-referential canonicals are a useful default configuration and are typically set up to help avoid duplicate, parameterized versions of the same URL from getting indexed, for example:
https://example.com/page?utm_medium=email
How Are Canonicals Implemented?
The most common way canonicals are implemented is through a <link> tag in a URL’s <head> section. So on Page A, we could specify that the canonical URL is Page B with the following:
<link rel=”canonical” href=”https://example.com/page-b” />
Canonicals can also be implemented through HTTP headers, where the header looks like this:
HTTP/… 200 OK
…
Link: <https://example.com/page-b>; rel=”canonical”
Typically, this adds canonicals to non-HTML documents such as PDFs. However, they can be used for any document.
As such, it is considered best practice only to use one method of assigning canonicals for each URL on a given website.
You might read Google’s “What is canonicalization” reference page to learn more about Canonicalization.