Wednesday, July 23, 2008

Controlling Googlebot

For some webmasters Google crawls too often (and consumes too much bandwidth). For others it visits too infrequently. Some complain that it doesn’t visit their entire site and others get upset when areas that they didn’t want accessible via search engines appear in the Google index.

To a certain extent, it is not possible to attract robots. Google will visit your site often if the site has excellent content that is updated frequently and cited often by other sites. No amount of shouting will make you popular! However, it is certainly possible to deter robots. You can control both the pages that Googlebot crawls and (should you wish) request a reduction in the frequency or depth of each crawl.

To prevent Google from crawling certain pages, the best method is to use a robots.txt file. This is simply an ASCII text file that you place at the root of your domain. For example, if your domain is

http://www.yourdomain.com, place the file at http://www.yourdomain.com/robots.txt. You might use robots.txt to prevent Google indexing your images, running your PERL scripts (for example, any forms for your customers to fill in), or accessing pages that are copyrighted. Each block of the robots.txt file lists first the name of the spider, then the list of directories or files it is not allowed to access on subsequent, separate lines. The format supports the use of wildcard characters, such as * or ? to represent numbers or letters.

The following robots.txt file would prevent all robots from accessing your image or PERL script directories and just Googlebot from accessing your copyrighted material and copyright notice page (assuming you had placed images in an “images” directory and your copyrighted material in a “copyright” directory):

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Googlebot
Disallow: /copyright/
Disallow: /content/copyright-notice.html

To control Googlebot’s crawl rate, you need to sign up for Google Webmaster Tools (a process I cover in detail in the section on tracking and tuning, page 228). You can then choose from one of three settings for your crawl: faster, normal, or slower (although sometimes faster is not an available choice). Normal is the default (and recommended) crawl rate. A slower crawl will reduce Googlebot’s traffic on your server, but Google may not be able to crawl your site as often.

You should note that none of these crawl adjustment methods is 100% reliable (particularly for spiders that are less well behaved than Googlebot). Even less likely to work are metadata robot instructions, which you incorporate in the meta tags section of your web page.

However, I will include them for completeness. The meta tag to stop spiders indexing a page is:

meta name="“robots”" content="“NOINDEX”"

The meta tag to prevent spiders following the links on your page is:

meta name="“robots”" content="“NOFOLLOW”"

Google is known to observe both the NOINDEX and NOFOLLOW instructions, but as other search engines often do not, I would recommend the use of robots.txt as a better method.

No comments: