We do our fair share of website audits for new and prospective clients. For many of the sites we evaluate, we often run into one or more ways that the client is instructing search engines, like Google, to not include their very important pages in the search index. If a page isn’t in the search index, then when a person uses Google to search, they cannot find the page.

Almost always, a well-intentioned employee has accidently told the search engine to keep away or not include a web page or group of web pages. These types of issues tend to be more technical in nature. Non-technical marketers cannot often discover that an issue is occurring, unless they are actively looking for it.

Although there are more than a dozen different ways to inadvertently prevent the search engines from including web pages, here are the top 5 most common ways that website owners shoot themselves in the foot.

#1 NoIndex, NoFollow

During a recent site audit, I Googled the client’s brand, which was also the domain name of the website. I was surprised that first listing displayed was for a “Meet the Team” page instead of the site’s homepage.  After further investigation, I discovered that the home page wasn’t even in the search engine’s search index.

The culprit was a meta tag in the HTML. This meta tag instructed search engine to not include the homepage in their index. Included in the HTML source for the homepage was this simple one line of HTML :

 

.

This snippet instructed the search engine to exclude, “noindex”, the page and to not have the homepage (usually the most powerful page on the site) give a boost from the homepage to any of the pages linked (nofollow) from it.

#2 Robots Exclusion

There is a special file that can be included on any website. This file contains instructions for crawlers, like Google’s crawling robot (“bot” for short). Each time a bot comes to crawl the content on your website, it first requests the robots.txt file in order to explicitly understand what it should or shouldn’t examine.

For example: It is common for the robots.txt file to include a set of instructions that informs the bots to stay away from the web based content management system’s internal pages for managing the website.

A common mistake is that a developer might use the robots.txt file to provide instructions to a bot that tell it not to crawl and index the entire website while it is in a publicly available staging environment. Note: Google is good at finding a staging server that you may use to test your website while it is under development, so you should take steps to prevent it from appearing in the search results.

The problem can occur when the website in staging is approved by marketing and then all of the files, including the robots.txt file, are deployed from staging to production. The next time Google’s crawler visits the production website, it asks for the robots.txt file and it reads the instruction to not crawl the entire website.

If the bot doesn’t your crawl new pages, because they are blocked, they won’t appear in the search index. We had a B2B prospect come to us for help because all of their pages that promote their white papers were missing from the Google index. It turns out that previously someone didn’t want the white papers in the search engine’s index, so the robots.txt file instructed search engines to avoid all of the content on the website that began with “/white-papers/”. Unfortunately, this included not just the PDF white papers, but all of the web pages that promoted them.

#3 Canonicals

The Canonical link is used to instruct a crawler what the preferred URL (web address) for a web page is. This optional link tag is located in the HTML source of a web page. If a bot crawls a page and the URL in the canonical link’s href is different than the address of the current webpage, then the current web page most likely won’t be indexed by search engines.

The canonical tag looks like this:

 

The most common mistakes are either that the domain name is wrong; instead of the production domain a staging domain is displayed; or that many similar pages, such as job listings or list of blog posts, all have the same canonical tag. One of the websites I looked at recently had over 300 individual job listings all using the same canonical URL “/jobs/”. None of the individual job listings appeared in the search results even though the website owner really wanted them to appear.

#4 Not linking internally to the content

A search engine bot typically starts at your homepage and then crawls all the links it discovers on the homepage. This process repeats itself for each link the bot finds until it exhausts (doesn’t think that it can find any new links) its effort. Any URLs on your website that it doesn’t find while crawling are in most cases not included in Google’s index.

A best practice is to include one or more relevant links to all content from other relevant content pages within your website. Not only does this allow a search engine to find the content, but it also helps the search engine to better understand the context of the web pages within your website. The context can help the search engine to match your page to a user’s search query.

Relying on just an XML sitemap alone, which is a special file that informs the search engines about the content on your website that you want crawled and indexed, does not always guarantee that the search engine will include pages, especially ones that are only linked from the sitemap, in their index or search results.  We recently reviewed a prospect’s website and found that more than 60% of the pages on the site were not discoverable by crawling. The 60% of pages only had links defined in the XML sitemap. Less than half of these pages were in the google search index and only a small (single digit) percentage of the pages actually received search engine referrals.

#5 Dynamically loaded content (Such as infinite scroll)

A search engine first looks at the HTML in a given web page to find other links to web pages that it thinks it should crawl and possibly index. Some web pages don’t include all the content and links shown in the user’s web browser in the HTML that is returned when the browser requests the web page. The content that isn’t in the HTML is dynamically loaded in by the user’s web browser after the page has been initially downloaded.

Dynamically loaded content might be the website’s search results displayed on a page, twitter posts, gallery items or a long list of blog posts. Sometimes a website uses an infinite scroll to dynamically load additional content once the user scrolls down to the bottom of web page. Twitter and Facebook do this to show more posts or tweets once the user has scrolled down past the most recent content that it wants to show to the user.

Although there are ways to programmatically instruct the bots on how to view and process the dynamic content, but it is complicated and can bring up additional issues if not implemented carefully.

Summary

Indexing is extremely important if you want your content to appear in search engines. If you don’t give your developers the requirements that web pages appear in search results, and validate that issues like the ones in this article aren’t preventing it from happening, then you are at risk that your hard work is not leading to search engine success.