Forum Moderators: phranque
It's my first post here so I hope I'm in the right place.
I have a Fedora Core 3 webserver with Apache 2.0.54, PostgreSql 8.03, PHP 5.0.4, etc. Recently, it appears that roaming bots have been overloading Apache causing the site to hang and eventually crash. After plenty of trouble shooting, changing memory modules and finally upgrading servers, we've come to the realization that the bots are overloading the system. Here is what our Robots.TXT file contains;
User-agent: *
Disallow: /cgi-bin
Disallow: /shopping
Disallow: /css
Disallow: /rb
Disallow: /templates
Disallow: /download
Disallow: /temp
Disallow: /images
Disallow: /js
Disallow: /docs
Does anyone have any suggestions for me? I really need the bots to spider the site but I can't keep crashing and overloading the site every time. We currently have over 50,000 static pages and I would at least like some of them to get indexed.
Thanks in advance for any help.
I've found that if your responses take more than one second, new requests will come that cause the server to work slower and extend response time up to the point that there are so many requests being processed at once that your server crashes because it will bring the whole server to a grinding halt.
If you have session IDs or the like in your urls, bots will come again and again and create new sessions every single time and that will cause longer responses and could result in the same situation as I described above.
If you need session IDs for your application, I'd recommend that you add in a small function that checks to see if the request is coming from a bot, if it is, don't create a session nor add anything into your query strings that might change on a fresh visit.
Assuming your website is database driven, you could limit some pages to be accessible and visible to only user agents that aren't known bots.
With regard to dynamic pages, the whole navigation area of our site is dynamic and is present on every page. We did have this previously and we had no problems until we launched a new version of the site on a new server...
Any other ideas, please keep them coming. Thanks!
So the basic question --and only you and your staff can answer it-- is, "Why is the new design or server so much slower?"
If no performance benchmarks were specified for the new design, or if they were not tested and verified, then there's some work yet to be done...
Jim
Obviously, we would not have launched and changed the server if we had thought our testing was not complete. Unfortunately, we never thought we may have problems with bots overloading the server with requests. Our test server which is a clone had no issues but then again, it was not open for bots to crawl. I will have to put this in a manual for next time. "Never launch a site without checking how it reacts to bots".
Anyway, we know that there is work to be done, we were kinda looking for some advice on where to start. Any ideas?
Also, check which 'bots' are giving the most problem and if they're not major-league (Googlebot, Slurp, msnbot, ...) just block them using .htaccess.
I am working with Scott.
Thank you for the ideas, wget --mirror is a great thing. So it turned out apache is going to endless loop on the such thing. If url contains these %5C%22 characters which are the codes to \" chars then it goes to #*$!. Here's the rewrite rules:
RewriteRule ^(.*)_E_(.*) $1=$2
RewriteRule ^(.*)_A_(.*) $1&$2 [N]
RewriteRule ^(.*)_PHPQ_(.*)$ $1.php?$2 [L]
So sample url which fails is:
http://www.example.com/%5C%22/script_PHPQ_param1_E_val1_A_param2_E_val2%5C%22
or
http://www.example.com/\"/script_PHPQ_param1_E_val1_A_param2_E_val2\"
if url does not contain those % characters it works fine. Any ideas? Thanks.
[edited by: jdMorgan at 9:14 pm (utc) on June 12, 2005]
[edit reason] Example.com [/edit]
Jim
Thank you for reply. I just thought Apache was supposed to not die from the invalid characters in url but report 404 as response code or bad request. These characters appear because of javascript
document.writeln("<a class=\"menu\"
href=\"/folder/script.php?ext=test\"><img src='/images/my_btn.gif' border=0></a><br>");
And \" are not parsed out by bots, instead they insert it as part of url. Of course, I can get rid of the quotes at all, but it is strange that Apache can die of malformed url.
I'd suggest you enclose the code in <script></noscript> tags to keep the 'bots out of that code. If you need to provide them with the link info, then add a <noscript></noscript> section with content specifically intended for non-JS-enabled clients. And by all means, browse your site with JS disabled to see what 10% of all your visitors will see!
<script language="JavaScript" type="text/JavaScript">
document.writeln("<a class=\"menu\" href=\"/folder/script.php?ext=test\"><img src='/images/my_btn.gif' border=0></a><br>");
</script>
<noscript>
<a class="menu" href="/folder/script.php?ext=test"><img src="/images/my_btn.gif" border="0"></a><br>
</noscript>
Comment: Looking at only the above code snippet, I see no reason to use JS in this case. You might consider eliminating the JS if it's not necessary, or doing dynamic page generation server-side instead of client-side.
Jim
There are three more possible answers:
1) Cloak the page so 'bots don't see the JS.
2) Sniff the browser version server-side and deliver an alternate page to NN4x, eliminating the JS dependency.
3) Get rid of those JS links. :(
Jim