homepage Welcome to WebmasterWorld Guest from 54.204.68.109
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How often does googlebot hit a large new website? How many reqs/min?
DiscoStu




msg:3911279
 8:06 pm on May 11, 2009 (gmt 0)

When google discovers a brand new website with a lot of content, say a million pages, how often does it hit the server - i.e. what's the frequency of requests while it's spidering the site? And what about other spiders, how many spider-requests can you expect when you upload a brand new site? Does it make sense to launch smaller portions of the site so as to not overload the server? thanks

 

GaryK




msg:3911957
 4:56 pm on May 12, 2009 (gmt 0)

I can't address the Googlebot questions, so I'll leave that to the smarter guys here. I can tell you most relatively new, dedicated web servers can't handle more than about a thousand or so concurrent visitors before they start to die under the load. Visitors includes humans and bots. My sysadm completely freaks-out and makes me add an additional server when he sees any one server approaching those levels on a sustained basis. So I guess use that as a guide for your last question about overloading the server.

Pfui




msg:3912009
 6:51 pm on May 12, 2009 (gmt 0)

If it's your site --

1.) Sign up for Google Webmaster Tools, add and verify your site, and then, in the Settings section, you can adjust the "Crawl rate." (The adjacent "Learn more" link contains specific info about default and custom rate settings.)

2.) Analyze your logs. Every bot is different and many behave differently from day to day. Control the over-zealous, dangerous and/or worthless ones with robots.txt, mod_rewrite, etc.

If it's not your site -- don't know, sorry.

But a million pages? If they're all low-hanging fruit (no registration or membership required), good luck.

Receptional Andy




msg:3912025
 7:18 pm on May 12, 2009 (gmt 0)

Spidering is mostly about links - the more of them, the more frequent spidering. If you have a million pages and a strong link to each of those million pages, then expect frenzied spidering activity. With few or no decent links, expect a very slow crawl (and even poorer indexing).

incrediBILL




msg:3914780
 1:14 am on May 16, 2009 (gmt 0)

When your website is brand new and doesn't have much relevance, meaning very few inbound links, it may take a long time to get a high volume of pages completely spidered.

Megaclinium




msg:3915574
 3:44 am on May 18, 2009 (gmt 0)

I get 200 to 300 hits by spiders an hour, depending on the day, 3/4 tends to be google.

sometimes that switches over to be yahoo or other search engines that I haven't gotten ticked off at and blocked.

I have my site isn't new but is also huge and I add new content regularly.

I didn't sign up for google panel as above but I do have page speed limits in robots.txt at 10 seconds and google seems to never go faster than one page every 20 seconds.

I have leech protect on in control panel and google image regulary stupidly tries and fails to grab my images without coming from my pages. They don't therefore show up in google image search which is no problem for me.

Other bots sometimes come in and try real fast grabs and I of course block them.

DiscoStu




msg:3917293
 6:15 pm on May 20, 2009 (gmt 0)

How do you go about blocking the other bots that you say try to do fast grabs? Do they all obey the robots.txt rules or do you use something else (like mod rewrite in htaccess)?

keyplyr




msg:3917565
 5:42 am on May 21, 2009 (gmt 0)

@DiscoStu

Some do obey robots.txt, some don't. Eventually you'll gain intuitive knowledge from experience and reading these forums.

Personally, if new bots don't show immediate malice and if they have a bot info page, I'll try robots.txt first and keep an eye on them. If they screw that up, they get banned via mod_rewrite in htaccess either by UA or IP range. Better yet is at the server level if you have admin status.

DiscoStu




msg:3917980
 6:38 pm on May 21, 2009 (gmt 0)

@keyplyr

Thanks, that's what I figured but it's good to get that part confirmed. But you mention blocking at the server level? how is this done compared to doing it via .htaccess? found this thread [webmasterworld.com...] about blocking bots at the server level but it references this thread [webmasterworld.com...] and that is all about writing in .htaccess...

wilderness




msg:3918021
 7:24 pm on May 21, 2009 (gmt 0)

htaccess and httpd.config [google.com]

will both act as firewalls. Both act server-side as well, config apparently offers more options because you control the entire server, as opposed to a portion of the server which a paid host assigns to your website.

Megaclinium




msg:3919591
 4:54 am on May 25, 2009 (gmt 0)

Disco Stu -

that first link mentions the cpanel, which has an ip deny manager. that's all I use as I don't seem to have access to .htaccess file maybe as I'm on a shared server and the host co doesn't want to give me access as I might crash sherver? I even telneted in to look for it. I think the cpanel is only for non-m$ web servers, as I have a m$ based site also and no cpanel.

incrediBILL




msg:3919942
 9:56 pm on May 25, 2009 (gmt 0)

I don't seem to have access to .htaccess file

You create the .htaccess file in your httpdocs root directory, it's typically not there by default

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved