Wednesday, July 23, 2008

How Googlebot first finds your site

There are essentially four ways in which Googlebot finds your new site. The first and most obvious way is for you to submit your URL to Google for crawling, via the “Add URL” form at www.google.com/addurl.html. The second way is when Google finds a link to your site from another site that it has already indexed and subsequently sends its spider to follow the link. The third way is when you sign up for Google Webmaster Tools (more on this on page 228), verify your site, and submit a sitemap. The fourth (and final) way is when you redirect an already indexed web-page to the new page (for example using a 301 redirect, about which there is more later).

In the past you could use search engine submission software, but Google now prevents this – and prevents spammers bombarding it with new sites – by using a CAPTCHA, a challenge-response test to determine whether the user is human, on its Add URL page. CAPTCHA stands for Completely Automated Public Turing test to tell Computers
and Humans Apart, and typically takes the form of a distorted image of letters and/or numbers that you have to type in as part of the submission.

How quickly you can expect to be crawled

There are no firm guarantees as to how quickly new sites – or pages – will be crawled by Google and then appear in the search index. However, following one of the four actions above, you would normally expect to be crawled within a month and then see your pages appear in the index two to three weeks afterwards. In my experience, submission via Google Webmaster Tools is the most effective way to manage your crawl and to be crawled quickly, so I typically do this for all my clients.

What Googlebot does on your site

Once Googlebot is on your site, it crawls each page in turn. When it finds an internal link, it will remember it and crawl it, either later that visit or on a subsequent trip to your site. Eventually, Google will crawl your whole site.

In the next step I will explain how Google indexes your pages for retrieval during a search query. In the step after that I will explain how each indexed page is actually ranked. However, for now the best analogy I can give you is to imagine that your site is a tree, with the base of the trunk being your home page, your directories the ranches, and your pages the leaves on the end of the branches. Google will crawl up the tree like nutrients from the roots, gifting each part of the tree with its all important PageRank. If your tree is well structured and has good symmetry, the crawl will be even and each branch and leaf will enjoy a proportionate benefit. There is (much) more on this later.

Controlling Googlebot

For some webmasters Google crawls too often (and consumes too much bandwidth). For others it visits too infrequently. Some complain that it doesn’t visit their entire site and others get upset when areas that they didn’t want accessible via search engines appear in the Google index.

To a certain extent, it is not possible to attract robots. Google will visit your site often if the site has excellent content that is updated frequently and cited often by other sites. No amount of shouting will make you popular! However, it is certainly possible to deter robots. You can control both the pages that Googlebot crawls and (should you wish) request a reduction in the frequency or depth of each crawl.

To prevent Google from crawling certain pages, the best method is to use a robots.txt file. This is simply an ASCII text file that you place at the root of your domain. For example, if your domain is

http://www.yourdomain.com, place the file at http://www.yourdomain.com/robots.txt. You might use robots.txt to prevent Google indexing your images, running your PERL scripts (for example, any forms for your customers to fill in), or accessing pages that are copyrighted. Each block of the robots.txt file lists first the name of the spider, then the list of directories or files it is not allowed to access on subsequent, separate lines. The format supports the use of wildcard characters, such as * or ? to represent numbers or letters.

The following robots.txt file would prevent all robots from accessing your image or PERL script directories and just Googlebot from accessing your copyrighted material and copyright notice page (assuming you had placed images in an “images” directory and your copyrighted material in a “copyright” directory):

User-agent: *
Disallow: /images/
Disallow: /cgi-bin/

User-agent: Googlebot
Disallow: /copyright/
Disallow: /content/copyright-notice.html

To control Googlebot’s crawl rate, you need to sign up for Google Webmaster Tools (a process I cover in detail in the section on tracking and tuning, page 228). You can then choose from one of three settings for your crawl: faster, normal, or slower (although sometimes faster is not an available choice). Normal is the default (and recommended) crawl rate. A slower crawl will reduce Googlebot’s traffic on your server, but Google may not be able to crawl your site as often.

You should note that none of these crawl adjustment methods is 100% reliable (particularly for spiders that are less well behaved than Googlebot). Even less likely to work are metadata robot instructions, which you incorporate in the meta tags section of your web page.

However, I will include them for completeness. The meta tag to stop spiders indexing a page is:

meta name="“robots”" content="“NOINDEX”"

The meta tag to prevent spiders following the links on your page is:

meta name="“robots”" content="“NOFOLLOW”"

Google is known to observe both the NOINDEX and NOFOLLOW instructions, but as other search engines often do not, I would recommend the use of robots.txt as a better method.

Sitemaps

A sitemap (with which you may well be familiar) is an HTML page containing an ordered list of all the pages on your site (or, for a large site, at least the most important pages).

Good sitemaps help humans to find what they are looking for and help search engines to orient themselves and manage their crawl activities. Googlebot, in particular, may complete the indexing of your site over multiple visits, and even after that will return from time to time to check for changes. A sitemap gives the spider a rapid guide to the structure of your site and what has changed since last time.

Googlebot will also look at the number of levels – and breadth – of your sitemap (together with other factors) to work out how to distribute your PageRank, the numerical weighting it assigns to the relative importance of your pages.