Standard for Robot Exclusion: Excluding Robots Since ’94

Standard for Robot Exclusion: Excluding Robots Since ’94

The Standard for Robot Exclusion, which you may know as a robots.txt file, just turned 20 years old. To mark its two decades in existence, we thought it would be illuminating to take a closer look at the robots.txt files, and how they are used in today’s world. A blog post written by Brian Ussery on the topic is very educational, and illustrates this file’s complexity. Back in the ’90s, robots essentially ran unchecked on the web, and poked around in branches of certain websites in which they had no business. To limit access, the Standard for Robot Exclusion came into being. Though its purpose is a simple one, of playing bouncer to robots, the nuances of a robot’s response to the robots.txt file is intricate. To demonstrate, prohibiting robots from certain areas of a website does not guarantee their exclusion from appearing on a search engine results page (SERP). Because search engines operate on the premise of indexing the whole of the web available to them in order to deliver the best results, if a search engine recognizes a URL on a website that appears to be relevant to a certain search query, that search engine can bring up that URL on a SERP even if the URL is blocked by a robots.txt file. So, how then does a webmaster block robots while simultaneously ensuring the exclusion of a webpage on a SERP? This double exclusion can be achieved by using a meta tag on the page(s) in question. Place the following in the head section of the page(s). meta name=”robots” content=”noindex” Again, when dealing with robots, instructions...