Tees In A Pod: Robots.txt - Be Friends with It!

Sunday, June 20, 2010

Robots.txt - Be Friends with It!

So, this is one of the first posts in what we here at Tees in a Pod affectionately called "SEO Corner". In a previous post we touched on 6 SEO Onsite Tips for your website and in this post we're going to expand on Robots.txt and why you want to be friends with it.

So, what is the purpose of the robots.txt file?
In a nutshell what a robots.txt file will do is tell the search engines what they are allowed to crawl and index on your site. The file is called robots.txt because it is created for the search engine bots and it must be a plain text file. All spiders will look for it. It is possible to give spiders info on each page of your site via the meta-robots tag but that is a lot more effort than having it all consolidated in one file.

Why Should You Use A robots.txt file?
As a site grows there may be directories on it that you want to remain private and unindexed by the search engines. The only way to stop the big, robots.txt obeying bots from crawling is to "disallow" them access to that directory or file via the robots.txt. Also, you might have one page for testing and one "live" page on your site. If google sees "duplicate" content you may get docked in the rankings - by excluding the test pages you will ensure that bots just crawl what you want them to crawl.

How Do You Use a robots.txt file?
Google, the King Kong of search engines have made it insanely easy to make sure you have your robots.txt file in order. If you have not yet accessed your webmaster tools do it immediately and explore. You will benefit your site and search results by spending an hour or two to rummage around and see how google sees your website - what errors does google see that you don't see?

During your rummaging around you should find the section "Site Configuration" which has the "crawler access" tab. It is in this tab that you make sure you have properly set up your robots.txt file.

Robots.txt files are written using what is called the "Robots Exclusion Standard" which is an agreed language for all search engine spiders. And, the good news it is very simple to understand. So, lets have a look at some code examples and make sense of robots.txt file..

User-agent: *
Disallow: /

In the above User-agent: is used to specify the search engine bot you want to give instructions to. The * symbol indicates that all search engine bots are to follow the rules of the robots.txt file. The disallow: command is used to list what pages all bots are to be disallowed from. The / symbol lets the bots know that all folders and files are disallowed. So, the above example would tell all bots to not crawl/index all folders/files on your site - possibly not the best idea. If you wanted all bots to access all folders/files the following would do the trick

User-agent: *
Allow: /

Now, lets go a bit deeper and we'll go through it line by line..

User-agent: * - talking to all spiders/bots
Disallow: / - telling them all they are not to crawl anywhere
User-agent: Googlebot - singling out googles spider
Disallow: /cgi-bin/ - telling googlebot not to crawl cgi-bin folder
Disallow: /privatedir/ - telling googlebot not to crawl privatedir folder

But, how can this be? The first two lines of the robots.txt tell all spiders to take a hike and then we have to tell google to take a hike again? The reason for this is because in line 3 we have addressed Googlebot specifically and given Googlebot license to roam all of our site. We then clip it's wings a bit and disallow googlebot from the cgi-bin folder and privatedir folder but it still has access to all other areas. From this you can see that once you address a bot specifically you give it access to all areas of the site unless you put some restrictions in via the disallow statements. To try emphasise this the below two examples mean the same thing:

User-agent: *
Disallow:

User-agent: *
Allow: /

In the first example we are talking to all bots and disallowing them access to nowhere, in the 2nd example we are talking to all bots and allowing them access to everywhere.

So, when is robots.txt really useful?
As mentioned earlier there may be areas of your site that you don't want bots to poke their noses in (private data etc). Now, lets consider the following scenario. You may have pages which have some ad sense info in their url which can get indexed and then be returned in the organic search results. However, when people click on these links you will get charged as it will be counted as a click. Robots.txt files can be used to exclude all files with certain parameters. This can be done nice and easily using the wildcard symbol which is *.

User-agent: *
Disallow: /*?

So, the above will tell all bots to steer clear of all files which contain a question mark (?). The wildcard matching symbol is very powerful to quickly eliminate files from the bots spidering duties.

Another use of robots.txt files is to restrict access to images. There are two reasons why you'd want to do this:
- If a bot encounters an error in an image (incorrectly titled, corrupted, etc)it will "self destruct" and not complete it's spidering mission. It will be less inclined to return to your site then which will result in less spidering and less search engine love.
- If you have a lot of pictures and all bots are indexing them then that will become a drain on your bandwidth resources. Some of your images might be found in the the google images search but how many conversions will you get as a result of being found? One thing that will be certain though is that if you have lots of images being crawled regularly your bandwidth usage will go up which may result in additional charges (or slower performance) for your website from your host.

I hope after reading this you have a better understanding of what a robots.txt file does, how to use it and why it should be used. I strongly suggest that you go to google.com/webmasters now and make sure you're robots.txt file is properly set up and let us know what impact it has on your site in the search engine results. If you have any questions regarding the robots.txt file please feel free to ask and I'd be more than happy to try explain for you.

This post was written by Rob from LadyUmbrella ladies t-shirts.