|Hits from Microsoft are (apparently) stress testing my website/server|
I have noticed lately how my server is being slowed down dramatically during certain parts of the day, and so I proceeded to look into the possibility of a DOS attack or particular heavy usage from some quarters.
What I learned was rather astonishing.
On one of my websites on the server, I track script processing time for each page load. Using this value, I could see what IPs were hitting the most when the processing times were the greatest.
A great number of the page loads in this review were coming from 207.46.*.* -- Microsoft. And a lot of these page loads weren't coming from their search spider, but rather apparently ordinary browser configurations. This seems to be consistent with various reports of Microsoft's various forms of testing websites.
What I then proceeded to do was compare the average script processing time for page loads coming from 207.46.*.* to pages loads coming from anywhere else.
For the last two months, the average proc time was:
MS: 2.299 secs.
Non-MS: 1.143 secs.
For the last month:
MS: 2.907 secs.
Non-MS: 1.46 secs.
For the last week, approximately:
MS: 6.078 secs.
Non-MS: 2.236 secs.
I really couldn't believe my eyes at these results.
Does anyone have ideas for strategies to deal with this beyond simply blocking Microsoft's IP range?
I also wanted to add that over the past month, less than 3% of search engine referrals to the site I tested came from Bing. I could live without those referrals. 92.5% came from Google alone.
Here's the results from my having blocked Microsoft.
The week before blocking, the average script processing time was 1.954 seconds.
The week after blocking, the number went down to 0.753 seconds.
A 61% decline in average script processing time!
Can Microsoft explain what they are doing to our sites?
Here's the complete blocking code I used in my .htaccess file:
deny from 207.46.
deny from 126.96.36.199/14
I run a script on my weblogs. It helps me keep an eye on bots etc. Looking through the records, I first spotted this stealth bot visiting in Jan. During July, it took 20 times the bandwidth it took back then. It certainly eats enough copies of the robots.txt file to be a bot. As you said though, it always shows the UA of a browser rather than a bot.
It's lower profile, but there seems to be another block of theirs with similar activity on it.
Thanks for following up!
Can you tell me the other range Microsoft is using?
- Microsoft employees have free access to MSNBot IPs
- Microsoft have been hacked. Or something.
|it always shows the UA of a browser rather than a bot |
A `MSN-Bot' tripped the fast-scraper block [webmasterworld.com] on my forums (I'm the maintainer of that script). In order to do that, it took more than 14 pages in the space of 7 seconds - definitely a bot. Here are the stats on the (single) record:
Host lookup: msnbot-65-52-108-165.search.msn.com (checks out)
Timing: 2010-08-02 02:09:33 (2 pages)
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Crazy Browser 1.0.5)
...which is NOT a bot UA.
14 pages in 7 seconds! It hasn't been that greedy on my site yet, though it has done 5 in 20 seconds. The UA details almost always start
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2
Occasionally NT 5.1 and a couple of times as MSIE 6.0
Other details vary.
If these are claiming to be browsers rather than bots, they have weird browsing habits. The log analysis script for my site is connected to the navigation database that controls the arrangement of links across the top of pages and down the left etc. During July, this stealth bot (or whatever it is) took enough pages to grab every page on the site at least twice. Yet it never took two pages in succession that were linked together directly by those navigation links. I haven't looked at figures for previous months.
Another odd thing, despite cache settings in the file headers, this visitor seems to take the stylesheet and java with almost every page - even when the requests are just a few seconds apart. It never seems to touch the graphics. It never even flags up a 304 to indicate the graphics haven't changed. On the other hand, if it did start taking the graphics as well - this could consume serious bandwidth.
In the site's robots.txt file, bots are banned from all subdirectories. So if this is a bot taking java etc, it's defying the robots.txt file.
Looking at the site as a whole, msnbot takes similar bandwidth to Googlebot. Adding in the figures for this visitor almost doubles that.
The other traffic I referred to is on the block AlexK referred to
In my case, between 65.52.104. to 65.52.108.
In this block it uses much less bandwidth, but is a mixture of msnbot requests and whatever lies behind these varying UAs. The requests are all but identical to what I described above. When claiming to be a visitor, again it takes the java and stylesheet with every single request, but never touches the graphics.
bing/msn bot has been fine on all of my sites..
ByronM, did you conduct a similar test to see what's going on under the hood of your site? I would imagine that low-traffic sites or sites that don't use heavy scripts wouldn't notice the problem. At any rate, it's a good idea to monitor what these bots are doing to your site, even if you don't notice an overall performance problem.
They are definitely misbehaving. Ignoring robots.txt is apparently just one of the bad things they're doing.
I added a new block this morning: 188.8.131.52/16
This MS address range has been reported as attempting to access phpmyadmin among other things.
Got an invisible linked page that is a bot trap, disallowed in robots.txt. I cleaned up my htaccess file today, and so far in one day, msnbot visited that disallowed page about 10 times.
So hey, if you're having problem getting some pages indexed by Bing, just disallow them in the robots.txt file, they'll spend the day hammering them. Sigh.
Yeah did I say 10 times? Make that about 100 times now.