Friday, May 4, 2007

Conserve Your Web Site's Bandwidth

In a recent article by Reuters, "Survey: Google draws 64 percent of search queries" , the story confirmed what the majority of web surfers already knew; Google is the top search engine. Quoting from the March survey conducted by Hitwise, Reuters reported that the top three search engines account for 94.5% of all search queries on the web. Google, as the market leader, holds a commanding share of 64.1%, followed by Yahoo's 21.3%, and MSN Search with a 9.1% share of search queries. Now this may not be big news to most web users, but for the marketing folks, advertisers, and webmasters alike, it may mean the difference between a site’s success or failure and the report may also help them decide where to concentrate their resources.

As a webmaster, I am often concerned with conserving bandwidth on my web sites and each being available to the customers and clients that I hope will generate income for me. From the Reuters' story, I know that 94.5% of all web queries are conducted by only three search engines and as a result, I should get "more bang for my buck" if I focus more of my resources on just them. In terms of conserving bandwidth, I recently posted a comment on a site: "How I Invented the Free Lunch" , where I addressed a question (#57) that was raised by one of the readers: “How did you avoid the search engines from slurping up all your bandwidth?” One of the easiest ways to restrict the activities of search engines on a site, yet often overlooked by webmasters, is through the use of a Robots.txt file.

My response on line #82 was:

**********
Most webmasters use a Robots.txt file and a Meta tag to control the activities of search engines. For example, I use the following Meta tag on some sites:

META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"

and a Robots.txt file in the root directory that limits the access of search engines to only the major ones I want, such as Yahoo and Google. For example:

User-agent: Googlebot
Allow: /
User-agent: Inktomi Slurp
Allow: /
User-agent: *
Disallow: /

I can also restrict the content that search engines may access on my web site by using the Robots.txt file. For example, to prevent Google or others from indexing web site images thus conserving bandwidth, I use the following lines in my robots file:

User-agent: Googlebot-Image
Disallow: /
User-agent: *
Disallow: /images/

If anyone is interested, I would be happy to post an example of a Robots.txt file that you could edit yourself to meet your own needs.
**********

Sound too simple to be true? I know from the use of web stats software that each site is crawled only by the search engines I want and they only access the content I allow. So I can only conclude that the Robots.txt file does indeed work. For more information and tips on using Robots.txt files, please visit: http://www.robotstxt.org/

No comments: