I can't address the Googlebot questions, so I'll leave that to the smarter guys here. I can tell you most relatively new, dedicated web servers can't handle more than about a thousand or so concurrent visitors before they start to die under the load. Visitors includes humans and bots. My sysadm completely freaks-out and makes me add an additional server when he sees any one server approaching those levels on a sustained basis. So I guess use that as a guide for your last question about overloading the server.
If it's your site --
1.) Sign up for Google Webmaster Tools, add and verify your site, and then, in the Settings section, you can adjust the "Crawl rate." (The adjacent "Learn more" link contains specific info about default and custom rate settings.)
2.) Analyze your logs. Every bot is different and many behave differently from day to day. Control the over-zealous, dangerous and/or worthless ones with robots.txt, mod_rewrite, etc.
If it's not your site -- don't know, sorry.
But a million pages? If they're all low-hanging fruit (no registration or membership required), good luck.
Spidering is mostly about links - the more of them, the more frequent spidering. If you have a million pages and a strong link to each of those million pages, then expect frenzied spidering activity. With few or no decent links, expect a very slow crawl (and even poorer indexing).
When your website is brand new and doesn't have much relevance, meaning very few inbound links, it may take a long time to get a high volume of pages completely spidered.
I get 200 to 300 hits by spiders an hour, depending on the day, 3/4 tends to be google.
sometimes that switches over to be yahoo or other search engines that I haven't gotten ticked off at and blocked.
I have my site isn't new but is also huge and I add new content regularly.
I didn't sign up for google panel as above but I do have page speed limits in robots.txt at 10 seconds and google seems to never go faster than one page every 20 seconds.
I have leech protect on in control panel and google image regulary stupidly tries and fails to grab my images without coming from my pages. They don't therefore show up in google image search which is no problem for me.
Other bots sometimes come in and try real fast grabs and I of course block them.
How do you go about blocking the other bots that you say try to do fast grabs? Do they all obey the robots.txt rules or do you use something else (like mod rewrite in htaccess)?
Some do obey robots.txt, some don't. Eventually you'll gain intuitive knowledge from experience and reading these forums.
Personally, if new bots don't show immediate malice and if they have a bot info page, I'll try robots.txt first and keep an eye on them. If they screw that up, they get banned via mod_rewrite in htaccess either by UA or IP range. Better yet is at the server level if you have admin status.
Thanks, that's what I figured but it's good to get that part confirmed. But you mention blocking at the server level? how is this done compared to doing it via .htaccess? found this thread [webmasterworld.com...] about blocking bots at the server level but it references this thread [webmasterworld.com...] and that is all about writing in .htaccess...
htaccess and httpd.config [google.com]
will both act as firewalls. Both act server-side as well, config apparently offers more options because you control the entire server, as opposed to a portion of the server which a paid host assigns to your website.
Disco Stu -
that first link mentions the cpanel, which has an ip deny manager. that's all I use as I don't seem to have access to .htaccess file maybe as I'm on a shared server and the host co doesn't want to give me access as I might crash sherver? I even telneted in to look for it. I think the cpanel is only for non-m$ web servers, as I have a m$ based site also and no cpanel.
|I don't seem to have access to .htaccess file |
You create the .htaccess file in your httpdocs root directory, it's typically not there by default