Forum Moderators: open
1.) Sign up for Google Webmaster Tools, add and verify your site, and then, in the Settings section, you can adjust the "Crawl rate." (The adjacent "Learn more" link contains specific info about default and custom rate settings.)
2.) Analyze your logs. Every bot is different and many behave differently from day to day. Control the over-zealous, dangerous and/or worthless ones with robots.txt, mod_rewrite, etc.
If it's not your site -- don't know, sorry.
But a million pages? If they're all low-hanging fruit (no registration or membership required), good luck.
sometimes that switches over to be yahoo or other search engines that I haven't gotten ticked off at and blocked.
I have my site isn't new but is also huge and I add new content regularly.
I didn't sign up for google panel as above but I do have page speed limits in robots.txt at 10 seconds and google seems to never go faster than one page every 20 seconds.
I have leech protect on in control panel and google image regulary stupidly tries and fails to grab my images without coming from my pages. They don't therefore show up in google image search which is no problem for me.
Other bots sometimes come in and try real fast grabs and I of course block them.
Some do obey robots.txt, some don't. Eventually you'll gain intuitive knowledge from experience and reading these forums.
Personally, if new bots don't show immediate malice and if they have a bot info page, I'll try robots.txt first and keep an eye on them. If they screw that up, they get banned via mod_rewrite in htaccess either by UA or IP range. Better yet is at the server level if you have admin status.
Thanks, that's what I figured but it's good to get that part confirmed. But you mention blocking at the server level? how is this done compared to doing it via .htaccess? found this thread [webmasterworld.com...] about blocking bots at the server level but it references this thread [webmasterworld.com...] and that is all about writing in .htaccess...
will both act as firewalls. Both act server-side as well, config apparently offers more options because you control the entire server, as opposed to a portion of the server which a paid host assigns to your website.
that first link mentions the cpanel, which has an ip deny manager. that's all I use as I don't seem to have access to .htaccess file maybe as I'm on a shared server and the host co doesn't want to give me access as I might crash sherver? I even telneted in to look for it. I think the cpanel is only for non-m$ web servers, as I have a m$ based site also and no cpanel.