Forum Moderators: open
i have some questions regarding the behavior of googlebot does the bot crawls pages one at a time or it will crawl all of your pages.
i have seen that other people complain why only one page (index) that the googlebot crawled.
in our situation its reciprocal. it crawled a huge number of pages 3,*** which worry me because of bandwidth issues. the other day it crawled the 3,*** of pages in one time and googlebot used a number of ip block.
honestly it eat up 400MB(+,-)of bandwidth :( which opt me to download the whole raw log file and see the pages googlebot crawled and i saw and verified it crawled our new pages (bunch of contents and write-ups)
how could i avoid this?
if in this case we will lose bandwidth if googlebot crawl our website more than 1 in a month. let say 10 times HUHUHUH we will lose (400mb x 10crawls) = 4GB of bandwidth just in one month with 10 crawls.
base on my understanding and experience googlebot crawl our pages one at a time not this huge number.
anybody experience the same. please give any advice.
thanks
steven
I have not seen Googlebot/2.1 doing anything abnormal recently. If you have 3000 pages and you don't use robots.txt to tell Googlebot not to crawl all of them, then it may take one page every five seconds, and complete the crawl of your site in just over four hours.
If it is fetching pages faster than one page every few seconds, then see the Google Web site and report the problem.
If you have some parts of your site that you don't want/need to appear in Google, then use robots.txt to exclude their robot from those pages.
Jim
the real data is that we have more than 3k+ off content pages including images.
we want them to be in the search engine listing, but the way googlebot crawls at 3k+ pages @ one crawl session is too much.
is there a way that i could "command"(robots.txt) that i will instruct bots(googlebot et al) to crawl 100 pages at a time or less than what it crawled the other day.
best regards
steven
You will have to make the decision - Either disallow the robots from some or all of your images and pages, upgrade your server, or pay for more bandwidth. You must decide if the cost of the bandwidth is worth the benefit that having all of your pages and images listed in Google brings.
There is no command to control the rate at which robots spider your site. However, they do promise not to exceed a certain rate, such as the one page every few seconds mentioned above.
Jim
<added>I wouldn't use robots.txt to prevent G from crawling html files</added>
A Standard for Robot Exclusion [robotstxt.org]
Sample robots.txt [webmasterworld.com]
Jim