Forum Moderators: open

Message Too Old, No Replies

Googlebot 2.1 Abnormal Behavior?

googlebot 2.1 Abnormal behavior crawled huge file in one time.

         

steven mheakyle

3:21 am on Apr 17, 2004 (gmt 0)

10+ Year Member



hello webmasterworld

i have some questions regarding the behavior of googlebot does the bot crawls pages one at a time or it will crawl all of your pages.

i have seen that other people complain why only one page (index) that the googlebot crawled.

in our situation its reciprocal. it crawled a huge number of pages 3,*** which worry me because of bandwidth issues. the other day it crawled the 3,*** of pages in one time and googlebot used a number of ip block.

honestly it eat up 400MB(+,-)of bandwidth :( which opt me to download the whole raw log file and see the pages googlebot crawled and i saw and verified it crawled our new pages (bunch of contents and write-ups)

how could i avoid this?

if in this case we will lose bandwidth if googlebot crawl our website more than 1 in a month. let say 10 times HUHUHUH we will lose (400mb x 10crawls) = 4GB of bandwidth just in one month with 10 crawls.

base on my understanding and experience googlebot crawl our pages one at a time not this huge number.

anybody experience the same. please give any advice.

thanks

steven

jdMorgan

4:24 am on Apr 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



steven,

I have not seen Googlebot/2.1 doing anything abnormal recently. If you have 3000 pages and you don't use robots.txt to tell Googlebot not to crawl all of them, then it may take one page every five seconds, and complete the crawl of your site in just over four hours.

If it is fetching pages faster than one page every few seconds, then see the Google Web site and report the problem.

If you have some parts of your site that you don't want/need to appear in Google, then use robots.txt to exclude their robot from those pages.

Jim

Robino

4:31 am on Apr 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




I understand what you're saying Steven_.

There really isn't anything you can do (assuming you want your pages included in the index). G-bot seems to be very prolific lately. If it's slowing down your site, you might want to consider upgrading to a new server.

steven mheakyle

4:54 am on Apr 17, 2004 (gmt 0)

10+ Year Member



hello moderator,

the real data is that we have more than 3k+ off content pages including images.

we want them to be in the search engine listing, but the way googlebot crawls at 3k+ pages @ one crawl session is too much.

is there a way that i could "command"(robots.txt) that i will instruct bots(googlebot et al) to crawl 100 pages at a time or less than what it crawled the other day.

best regards

steven

jdMorgan

5:34 am on Apr 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pages and images are crawled by different Googlebots. Do you need to have your images in Google's image search? If not, then Disallow Googlebot-Image in your robots.txt file. That will save you a lot of bandwidth.

You will have to make the decision - Either disallow the robots from some or all of your images and pages, upgrade your server, or pay for more bandwidth. You must decide if the cost of the bandwidth is worth the benefit that having all of your pages and images listed in Google brings.

There is no command to control the rate at which robots spider your site. However, they do promise not to exceed a certain rate, such as the one page every few seconds mentioned above.

Jim

BReflection

12:46 pm on Apr 17, 2004 (gmt 0)

10+ Year Member



steven, Googlebot visits my site either every day or every other day, and asks for my robots.txt every single time. Perhaps you could use your robots.txt to feed Googlebot pages over time? Start with a little and slowly add more...

Just an idea.

steven mheakyle

5:56 pm on Apr 17, 2004 (gmt 0)

10+ Year Member



anybody could paste a sample of "robots.txt" for me to manage this problem of mine.

i am open for suggestions.

thank you

SlowMove

6:10 pm on Apr 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What kind of time interval between pages? 3000 pages is nothing if it happens over a few hours. If it is moving too fast, I'd check the IP address. Building a first rate search engine is difficult. Slowing down a process is something that's easy for any computer programmer.

<added>I wouldn't use robots.txt to prevent G from crawling html files</added>

jdMorgan

6:25 pm on Apr 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> anybody could paste a sample of "robots.txt" for me to manage this problem of mine.

A Standard for Robot Exclusion [robotstxt.org]
Sample robots.txt [webmasterworld.com]

Jim