homepage Welcome to WebmasterWorld Guest from 54.226.80.196
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 42 message thread spans 2 pages: < < 42 ( 1 [2]     
MSNBot has become a constant Fast-Scraper
7 IPs crawling at max 12 pages / sec - this is out of order
AlexK




msg:4401161
 2:22 pm on Dec 24, 2011 (gmt 0)

My site has an auto-stop, block, report system; most effective for stopping script kiddies operating abusive bots. For a number of weeks now, the MSNBot has been caught in it's net. Finally, I've got sick of it, and am also reporting it here:


Date of Abuse...... IP........... Rate
------------------- ------------- -----------------
2011-12-23 09:25:01 65.52.108.146 [forums.modem-help.co.uk] 12 pages / second
2011-12-23 01:05:57 157.55.16.219 [forums.modem-help.co.uk] 11 pages / second
2011-12-23 00:52:13 157.55.18.9 [forums.modem-help.co.uk] ..11 pages / second
2011-12-22 18:18:59 207.46.13.212 [forums.modem-help.co.uk] .7 pages / second
2011-12-22 04:02:38 207.46.195.240 [forums.modem-help.co.uk] 7 pages / second
2011-12-22 00:18:28 65.52.110.200 [forums.modem-help.co.uk] .3 pages / second
2011-12-21 04:32:16 65.52.104.26 [forums.modem-help.co.uk] ..9 pages / second


The ASN for each IP above is AS8075 [cidr-report.org] (MICROSOFT). Each event above has been auto-reported daily to the relevant abuse email address; naturally, no action. Each link above shows the relevant abusive activity from that IP.

Each IP caught in abusive activity gets banned from my site for a week, with a notice explaining why. At the moment, at the end of that week the MSNBot IP takes up it's abusive behaviour all over again, gets stopped by the Stop-Abuse routines, reported & banned.

This abusive behaviour first began on June 20 this year and was reported in this forum [webmasterworld.com]. It continued until July [webmasterworld.com] then, thanks to a WebmasterWorld member with MS contacts, it stopped. All was then quiet until a month or so back, when it started all over again.

For the record, the odd Google IP & Yahoo! IP has occasionally got caught up in this net. However, nothing like the extent of MSNBot IPs, which I think can now be classified as endemic abuse.

 

AlexK




msg:4402400
 2:49 pm on Dec 30, 2011 (gmt 0)

MSN went on the rampage on my site yesterday. Of 10 IPs that tried to abuse one of the MH sites, 8 were SE bots, and 7 of them were MSN bots. Just one of the IPs has already been reported here. That now makes a total of 18 MSN IPs caught committing abuse within just 6 days:

Date of Abuse...... IP............ Rate
------------------- -------------- -----------------
2011-12-30 03:41:40 157.55.17.201 [forums.modem-help.co.uk] .9 pages / second
2011-12-30 01:37:28 65.52.104.90 [forums.modem-help.co.uk] ..9 pages / second
2011-12-30 00:24:13 65.52.109.152 [forums.modem-help.co.uk] .8 pages / second
2011-12-29 19:35:35 157.55.16.221 [forums.modem-help.co.uk] .5 pages / second
2011-12-29 18:14:04 65.52.109.194 [forums.modem-help.co.uk] .8 pages / second
2011-12-29 16:06:17 207.46.204.241 [forums.modem-help.co.uk] 9 pages / second
2011-12-29 15:10:30 65.52.108.66 [forums.modem-help.co.uk] ..4 pages / second

2011-12-29 07:04:01 77.75.77.17 [forums.modem-help.co.uk] ...4 pages / second (seznam)

Pfui




msg:4402442
 6:18 pm on Dec 30, 2011 (gmt 0)

Alex, did you open a trouble ticket with/through Bing Webmaster Tools? I received a non-canned reply within 24 hours.

AlexK




msg:4402558
 7:52 am on Dec 31, 2011 (gmt 0)

Another day, another MSN IP. That is now 19 distinct MSN IPs abusing my site across the last week:

2011-12-30 13:05:58 :: 207.46.13.100 [forums.modem-help.co.uk] :: 6 pages / second

AlexK




msg:4402559
 8:08 am on Dec 31, 2011 (gmt 0)

@Pfui:
An email has been sent each occasion for each IP to the MSN abuse address. That email contained full details of each access by the IP in the previous 24 hours. If Bing cannot pay attention to it's own abuse address, I certainly will not pay any attention to BWT.

AWStats shows 65 different robots hitting my site so far in December (and it is actually more - the bot DB needs updating). 6 distinct IPs abusing the sites yesterday (remember, it used to be up to 50 each day). I have neither the time nor the motivation to follow up on each one. If these entities cannot behave in a decent fashion on my sites, I do not want them there. It is as simple as that.

dstiles




msg:4402721
 9:16 pm on Dec 31, 2011 (gmt 0)

AlexK - I'm assuming here that you mean 65 different miscellaneous bots NOT 65 different genuine MS bots. If the latter then I would say that is very unusual.

65 different bots per month is, in my experience, low. I get at least that many per day on my server across a few dozen sites, almost all of which are blocked either by IP (eg all known server farms and clouds are blocked by IP) or by recognition of UA (distributed bots, nutch, java, urllib etc). Most of those bots are disguised by fake browser UAs. Of bots "known" to Awstats, one site alone shows about 30 this month (December) but these include blocked ones such as urllib, alexa and jakarta.

Blocking by IP and/or UA + headers is (at least hereabouts) standard practice and you cannot run a scrape-free site without it. Even then, you can only hope to catch some of them in retrospect, after they have already eaten a site. There are so many virus-compromised servers and "home" computers around at present that adding new server farms is a continuous task but a necessary one.

tangor




msg:4402726
 10:10 pm on Dec 31, 2011 (gmt 0)

Counting SE IP addresses hitting my sites for the past year, the top 6 SE/IP counts are:

Bing 369
Yahoo 281
Baidu 270
yodao 208
Google 127
mj12 119

3, 4 and 6 only get robots.txt and honor my robots.txt (don't want them). Bing/Yahoo hit at various speeds, sometimes as high as 12pg/s, but rarely ask for more than 100 pages on any ip index visit then might shift to a different ip, asking for a different set of 100 or so pages. Once a week. Google has 12 "regular" ip, then a bunch of others... usually a half-dozen to two dozen pages then gone. Just reporting my experience with these SEs. Either a charmed life, or my sites are getting so "evergreen" that indexing has stabilized...though all new content is found very quickly.

The rest of my "robots" list for 2011 is 2700+ named bots (not just ip addresses)... none of which get more than robots.txt. Those that do not honor robots.txt get a very healthy diet of 403. I like working smart, not hard...only those I allow, and that is five...one of which has recently gone tits up (Teoma).

AlexK




msg:4402730
 11:25 pm on Dec 31, 2011 (gmt 0)

dstiles:
I'm assuming here that you mean 65 different miscellaneous bots

Correct.

65 different bots per month is ... low

Agreed.

Nevertheless, so far for December:
    humans: 324,020 pages
    other: 2,393,102 pages

I make that a ratio of 8:1 It was even higher in November.

AlexK




msg:4402795
 3:54 pm on Jan 1, 2012 (gmt 0)

It seemed time to put this topic to bed.

First, here is a summary of the (what I consider to be) abuse stats:


Date of Abuse...... IP........... Max Rate
------------------- ------------- -----------------
2011-12-21 04:32:16 65.52.104.26 [forums.modem-help.co.uk] ..9 pages / second
2011-12-30 01:37:28 65.52.104.90 [forums.modem-help.co.uk] ..9 pages / second
2011-12-23 09:25:01 65.52.108.146 [forums.modem-help.co.uk] 12 pages / second
2011-12-29 15:10:30 65.52.108.66 [forums.modem-help.co.uk] ..4 pages / second
2011-12-29 04:22:22 65.52.109.152 [forums.modem-help.co.uk] .8 pages / second
2011-12-25 16:12:09 65.52.109.194 [forums.modem-help.co.uk] .8 pages / second
2011-12-31 19:26:33 65.52.109.26 [forums.modem-help.co.uk] ..6 pages / second
2011-12-22 00:18:28 65.52.110.200 [forums.modem-help.co.uk] .3 pages / second
2011-12-23 01:05:57 157.55.16.219 [forums.modem-help.co.uk] 11 pages / second
2011-12-29 19:35:35 157.55.16.221 [forums.modem-help.co.uk] .5 pages / second
2011-12-30 03:41:40 157.55.17.201 [forums.modem-help.co.uk] .9 pages / second
2011-12-23 00:52:13 157.55.18.9 [forums.modem-help.co.uk] ..11 pages / second
2011-12-25 15:35:55 157.55.38.162 [forums.modem-help.co.uk] .4 pages / second
2011-12-30 13:05:58 207.46.13.100 [forums.modem-help.co.uk] .6 pages / second
2011-12-26 05:47:33 207.46.13.144 [forums.modem-help.co.uk] 12 pages / second
2011-12-22 18:18:59 207.46.13.212 [forums.modem-help.co.uk] .7 pages / second
2011-12-22 04:02:38 207.46.195.240 [forums.modem-help.co.uk] 7 pages / second
2011-12-29 16:06:17 207.46.204.241 [forums.modem-help.co.uk] 9 pages / second

18 IPs across 10 days
start: 2011-12-21 04:32:16
end: ..2011-12-31 19:26:33


(above already reported + 1 extra IP yesterday; duplicate IPs removed)

The above are gathered by an extension to a PHP routine originated via WebmasterWorld [webmasterworld.com] (a link to the block-algorithms used is in the first post on that page). In brief, a bot needs to take more than 14 pages in a space of 7 seconds; if so, they are blocked from that point on until they stop.

AlexK




msg:4402804
 5:00 pm on Jan 1, 2012 (gmt 0)

I view the activity detailed above as abuse on my site, and auto-report it as such to each originator's abuse email address (which they almost all ignore). The specific abuse from BingBots/MSNBots reported here falls short by a long way from the very worst abuse that has been attempted on my site. Nevertheless, I expected a storm of fellow-feeling protest from other WebMasters at my experience. How often does reality depart so radically from one's dreams!

I think that it is fair to sum up the general mood as "resigned acceptance" to this situation. Indeed, tangor said "Works for me ... It's all okay with me". Seeing tangor's stats, his attitude is perfectly reasonable:

tangor:
for the past year, the top 6 SE/IP counts are:

Bing.. 369
Yahoo. 281
Baidu. 270
yodao. 208
Google 127
mj12.. 119

Here is my comparison:


Human traffic 2011: .4,604,606 pages ..891.12 GB
Other traffic 2011: 22,338,450 pages 2,196.17 GB

Bandwidth:.... -------- Taken ------- Provided
-------------- --------- ------------ --------
Googlebot..... 7,176,072 ...184.48 GB .985,771
BaiDuSpider... 1,483,462 .1,841.91 GB .....466
(unknown)..... 1,118,229 ....19.50 GB ...5,972
MSNBot........ 1,032,889 ....97.31 GB ..23,865 (+ 1,894 MS-Live)
Yahoo Slurp... 1,001,859 .....7.61 GB ..27,997
Google AdSense ..969,377 .....9.21 GB
MJ12bot....... ..305,407 .....2.83 GB
Yandex bot.... ..167,468 .....1.18 GB ...8,851
(150 different robots)

My sites have a full complement of Content Negotiation in place (provide 304s, etc) to reduce bandwidth. It does, however, make me wonder just what value any bot provides... They are taking 83% of the pages, 71% of the bandwidth, are daily abusing the site, and giving back zilch. What a strange job being a WebMaster is.

Samizdata




msg:4402827
 7:05 pm on Jan 1, 2012 (gmt 0)

What a strange job being a WebMaster is

Being Microsoft's botmaster must be considerably stranger.

I don't think I would trade places.

...

Staffa




msg:4402833
 7:53 pm on Jan 1, 2012 (gmt 0)

Nevertheless, I expected a storm of fellow-feeling protest from other WebMasters at my experience.

We don't talk that much, we act.

dstiles




msg:4402835
 8:28 pm on Jan 1, 2012 (gmt 0)

AlexK - if, by "other traffic" you mean "they've taken the page content" then it's time to stop them. If "150 different robots" are actually unwanted or unknown bots then block them. Most bots are of no benefit whatsoever.

The chances are, if you are getting that much non-human traffic, that there is a lot of scraping going on. Reasons for this vary but since baddies realised a while ago that by putting scraped content onto a copy of a site google will push the copy above the original scraped site scraping has become a major "industry".

Bear in mind also that a LOT of non-human traffic (in my experience) is exploitable-site testing or virus-implantation attempts. In the past month such attempts have comprised about 70% of my server's security log entries. They didn't get anything, but it's still depressing looking through the logs.

On the subject of MS bot IPs - 18 in ten days is not a problem. Bots, within their milieu, are distributed and share the load so 50 or 100 bot IPs, although being unusual, would not constitute a problem PROVIDED they acted rationally together - ie took a reasonable amount of content within a reasonable time in total. You seem to be getting a very high page rate count from MS (if, indeed, ALL IPs with the bingbot UA are really bing bot IPs - see my earlier comment).

Your percentage of googlebot/bingbot is way off compared with my own tally. On a single site I have about 40% more hits from bingbot than from googlebot. Actual access rates are generally within reasonable limits (ie less that 2 pages/second) for all permitted bots, even for shopwiki, which is persistent.

This 42 message thread spans 2 pages: < < 42 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved