|Apache overload when bots are about|
It's my first post here so I hope I'm in the right place.
I have a Fedora Core 3 webserver with Apache 2.0.54, PostgreSql 8.03, PHP 5.0.4, etc. Recently, it appears that roaming bots have been overloading Apache causing the site to hang and eventually crash. After plenty of trouble shooting, changing memory modules and finally upgrading servers, we've come to the realization that the bots are overloading the system. Here is what our Robots.TXT file contains;
Does anyone have any suggestions for me? I really need the bots to spider the site but I can't keep crashing and overloading the site every time. We currently have over 50,000 static pages and I would at least like some of them to get indexed.
Thanks in advance for any help.
Have you done any speed testing? That is, seeing how long it takes to build responses. Also, do you have session IDs in your urls or anything that tends to appear upon first visit that would change every time a new visitor comes?
I've found that if your responses take more than one second, new requests will come that cause the server to work slower and extend response time up to the point that there are so many requests being processed at once that your server crashes because it will bring the whole server to a grinding halt.
If you have session IDs or the like in your urls, bots will come again and again and create new sessions every single time and that will cause longer responses and could result in the same situation as I described above.
If you need session IDs for your application, I'd recommend that you add in a small function that checks to see if the request is coming from a bot, if it is, don't create a session nor add anything into your query strings that might change on a fresh visit.
Assuming your website is database driven, you could limit some pages to be accessible and visible to only user agents that aren't known bots.
Thanks for the advice. We do have session and session ids, but we do not pass them in urls so we don't think this is a problem.
With regard to dynamic pages, the whole navigation area of our site is dynamic and is present on every page. We did have this previously and we had no problems until we launched a new version of the site on a new server...
Any other ideas, please keep them coming. Thanks!
> we had no problems until we launched a new version of the site on a new server
So the basic question --and only you and your staff can answer it-- is, "Why is the new design or server so much slower?"
If no performance benchmarks were specified for the new design, or if they were not tested and verified, then there's some work yet to be done...
Thanks for the response.
Obviously, we would not have launched and changed the server if we had thought our testing was not complete. Unfortunately, we never thought we may have problems with bots overloading the server with requests. Our test server which is a clone had no issues but then again, it was not open for bots to crawl. I will have to put this in a manual for next time. "Never launch a site without checking how it reacts to bots".
Anyway, we know that there is work to be done, we were kinda looking for some advice on where to start. Any ideas?
Try running "wget --mirror" over your site and see it it's getting caught in a loop anywhere.
Also, check which 'bots' are giving the most problem and if they're not major-league (Googlebot, Slurp, msnbot, ...) just block them using .htaccess.
I am working with Scott.
Thank you for the ideas, wget --mirror is a great thing. So it turned out apache is going to endless loop on the such thing. If url contains these %5C%22 characters which are the codes to \" chars then it goes to #*$!. Here's the rewrite rules:
RewriteRule ^(.*)_E_(.*) $1=$2
RewriteRule ^(.*)_A_(.*) $1&$2 [N]
RewriteRule ^(.*)_PHPQ_(.*)$ $1.php?$2 [L]
So sample url which fails is:
if url does not contain those % characters it works fine. Any ideas? Thanks.
[edited by: jdMorgan at 9:14 pm (utc) on June 12, 2005]
[edit reason] Example.com [/edit]
I'm not sure why your rules loop -- it doesn't look like they should. But you should avoid (and eliminate) characters such as '\' and '"' from your URLs, as they are not valid characters in URLs. Yes, sometimes they work, but support is not 100% and they often cause problems such as you are now experiencing. See RFC2396 [faqs.org] - Uniform Resource Identifiers. These are the "laws" for generating and using URIs (URLs are one type of URI).
href=\"/folder/script.php?ext=test\"><img src='/images/my_btn.gif' border=0></a><br>");
And \" are not parsed out by bots, instead they insert it as part of url. Of course, I can get rid of the quotes at all, but it is strange that Apache can die of malformed url.
Then the question is, why do 'bots even see this JS code?
I'd suggest you enclose the code in <script></noscript> tags to keep the 'bots out of that code. If you need to provide them with the link info, then add a <noscript></noscript> section with content specifically intended for non-JS-enabled clients. And by all means, browse your site with JS disabled to see what 10% of all your visitors will see!
document.writeln("<a class=\"menu\" href=\"/folder/script.php?ext=test\"><img src='/images/my_btn.gif' border=0></a><br>");
<a class="menu" href="/folder/script.php?ext=test"><img src="/images/my_btn.gif" border="0"></a><br>
Comment: Looking at only the above code snippet, I see no reason to use JS in this case. You might consider eliminating the JS if it's not necessary, or doing dynamic page generation server-side instead of client-side.
You could at least remove the back-slashes with careful choosing of single and double quotes - either are acceptable so it is just a case of getting the combination right:
href='/folder/script.php?ext=test'><img src='/images/my_btn.gif' border='0'></a><br>");
The quotes are not the problem and I removed them as href works without them as well. The problem is why Apache goes to endless loop when those \" are in the url and url rewriting is applied - the rules I mentioned. If url is without PHPQ, _A_ and _E_ but regular .php? , & and = instead, then Apache gives 404 not found as it is expected.
The problem with "invalid characters" is that Apache must "escape" them. So %22 is escaped to %2522 on the first pass, and then to %252522 on the second pass, then to %25252522, and so on. If multiple rewrites are performed, the string grows ad-infinitum. And the reason is that Apache does not expect invalid characters in HTTP/1.1-compliant URLs.
There are three more possible answers:
1) Cloak the page so 'bots don't see the JS.
2) Sniff the browser version server-side and deliver an alternate page to NN4x, eliminating the JS dependency.
3) Get rid of those JS links. :(
Thank you, Jim and all. Your idea about continuos escape due to rewriting seems reasonable to me. I think that is that.