Crash Course on Properly Developing Robots.txt

Written by Peter Devereaux | Mar 24, 2015 5:00:00 AM

Web workers responsible for search engine optimization always have a trick or two up their digital sleeves. And when in pursuit of better organic search results, one of the most useful is that of controlling the bots in a website's robots.txt file.

The robots exclusion standard, also known as the robots exclusion protocol or robots.txt protocol, is a standard used by websites to communicate with web crawlers and other web robots (or bots).

The standard essentially provides instructions to the robot about which areas of the website should (or should not) be processed/scanned/crawled/indexed. Since bots are used most often by search engines, the following provides some insights into what to include in a robots.txt file and and what you can and can't do with this protocol.

Most websites use the wildcard * to to tell all bots that they can visit all files, like so:

User-agent: *
Disallow:

Of course, it's also possible to tell bots to stay out of a website completely (by using the / mark":

User-agent: *
Disallow: /

SEOs and webmasters can also have more granular control and can indicate a specific directory that bots are advised to stay away from:

User-agent: *
Disallow: /cgi-bin/

This example tells all robots to stay away from one specific file:

User-agent: *
Disallow: /directory/file.html

There are also several "non-standard" extensions that can prove useful if you're building your own robots.txt (and not leaving it to the software being used such as a content management system).

For example, many crawlers support a "crawl-delay' parameter which is set to the number of seconds to wait between successive requests to the same server (webmasters can also modify the crawl rate of google-bot specifically within Google webmaster Tools).

Some crawlers also support an 'Allow' direction which counteracts a 'Disallow' direction that follows (which you might want to use to allow access to one file within a folder that has otherwise been disallowed - see example below).

Allow: /directory1/myfile.html
Disallow: /directory1/

Despite the use of the terms "allow" and "disallow", the protocol has to rely on the cooperation of the web robot, so indicating that a site should not be accessed (processed, scanned, crawled or indexed) with robots.txt does not guarantee exclusion of all web robots. There are many web robots (e.g. SEMalt) that don't abide by the guidance provided within robots.txt but it's certainly better to have one than not.

One of the reasons your site may be experiencing some trouble achieving high placement on search results for relevant (and competitive terms) is because the guidance being provided in the robots.txt file is holding you back. Let this encourage you to analyze your brand's approach to the robots exclusion standard and see where the Web can take you next.

View full post