Welcome to WebmasterWorld Guest from 54.144.126.195

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

MSN's many cloaked bots. Again.

     

Pfui

11:44 pm on Aug 5, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

Mokita

8:40 am on Jun 26, 2011 (gmt 0)

5+ Year Member



@AlexK

Bingbot/MsnBot claim to honour a "Crawl-delay" directive in robots.txt :

[bing.com...]

Did you try that method before/instead of using the sledge-hammer approach?

If you had, your complaints to MS Abuse would probably carry more weight.

... Just a suggestion (as I haven't experienced the abuse you mention but I do have a "Crawl-delay" setting)

<edit> BingBot is crawling our sites heavily - but honouring the Crawl-delay </edit>

AlexK

5:54 pm on Jun 26, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Mokita:
(Crawl-delay) Did you try that method

OK. For the record:

User-agent: *
Crawl-delay: 90


Now, how on earth you can hear a report on msnbot hitting a site at 12 accesses per second & immediately think "Crawl delay" is, frankly, beyond me. The record on my site is currently 403 hits / sec, and I willingly accept that 12 / sec is, in comparison to 403, mediocre. However, that's rather like saying that being shot 12 times is nothing compared to being shot 403 times. You will still be dead.

If the above paragraph does not impress you, then consider that the parameter for `Crawl-delay' is an integer, and is the "interval in seconds between each request". So, my site asks MSN to wait 90 seconds between each request. They waited one-twentieth of a second. Do you think that reasonable?

I'm beginning to foam at the mouth. Had better stop typing...

PS
I hope that you picked up the point that MSN completely ignores it's own public statements & robots.txt directive. It is just PR & blather. Ignore it.

dstiles

9:38 pm on Jun 26, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



AlexK: Thanks for the information.

In my own system the UA is recorded in a short-form log, alnog with IP, date.time, URL, referer etc. I find the UA invaluable.

Bing sometimes uses a corrupt UA (trailing underscore). Could that have been the source?

Tangor: as far as my experience goes, bingbot is a good bot with the proviso that several scans are made using IPs that fail rDNS tests: these are banned. Also the UA is sometimes corrupted (see above) so is also banned. Otherwise they seem to scan well enough.

tangor

3:11 am on Jun 28, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



@dstiles and others... My recent above was a bit tongue-in-cheek, Bing is a good bot... but I also disallow non-resolving rDNS and botched UA, neither of which has affected my Bing traffic. Just as G has had it's share of fake UA by scrapers, so to is Bing.

As for crawl-delay, I've found that 10 or less is a number bots that honor any crawl-delay seem to routinely honor... delay times greater than 10 seem problematical. I doubt there is a "cut off" or "ignore", but who knows? Just sharing my experience that 10 or less seems to work 90% of the time. That said, I have sites were I don't have a crawl-delay (most are 2,000 pages or less) as there seems no point.

AlexK

5:14 am on Jun 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles:
Bing sometimes uses a corrupt UA (trailing underscore). Could that have been the source?

I had a look in the logfiles to find out:

207.46.13.98 - - [22/Jun/2011:04:50:45 +0100] "GET /faq.php HTTP/1.1" 200 13386 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" In:- Out:-:-pct. "-"
(13 x 200 OK accesses)
207.46.13.98 - - [22/Jun/2011:04:50:52 +0100] "GET /profile.php?mode=viewprofile&u=3 HTTP/1.1" 403 132 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" In:137 Out:114:83pct. "-"

...so, the answer is `no', standard UA.

Curiously, none of the first 14 accesses--before the block--were accepting compressed pages (thanks, Bing). I d/checked to make sure and indeed my own browser gets a gzipped page.

dstiles

9:46 pm on Jun 28, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



AlexK - looks like a genuine bot.

I wonder if the problem lies in the high delay factor, as tangor says. Specifically, I wonder if there could be some kind of MS development test thingy that says, "if delay >9 then it's milliseconds" or some such. Or, more likely, you've hit a numerical storage limit that flips round to 0?

AlexK

12:34 am on Jun 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles:
I wonder if the problem lies in the high delay factor

So you reckon that BingBot attempting to hit my site once every eight milliseconds (1 sec / 12) is *my* fault. With 4 different IPs. And 5,000 attempted accesses across a 24 hour period, all at the same rate. MY FAULT? Do you think that, just possibly, there could be a minor flaw in your logic there? Such as the assumption that scraping a site at greater than once a second may be acceptable?

Well, whether it was my fault or not, each IP got blocked from my site for a week, and reported as an abusive IP to a RBL. It they continue to do so the IP(s) will go in the site firewall, and if it re-occurs from enough of their IPs the entire damn ASN will go in the Firewall.

I get very few searches passed through to my site from MS, and they are sucking down bandwidth (or attempting to) like nobody's business.

At the moment, MS sits for me in the same camp as all the spam-scraper bots. If they are comfortable with that, it's fine by me.

tangor

1:29 am on Jun 29, 2011 (gmt 0)

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



AlexK... perhaps the only "fault" (which no one is saying) is an integer in crawl-delay. Might try something much smaller... any relief is better than no relief. That said, Banning Bing or MSNbot is always an option (which kind of takes care of Yahoo, too). No sense in having high blood-pressure over something like a SE bot.

Mokita

11:34 am on Jun 29, 2011 (gmt 0)

5+ Year Member



AlexK wrote:

I'm beginning to foam at the mouth.


Yes - got your message, loud and clear!

For whatever it's worth, I try to keep emotion out of bot identification/blocking - YMMV.

AlexK

7:52 pm on Jun 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



YMMV == "Your mood may vary"

dstiles

8:14 pm on Jun 29, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



AlexK - I'm sure I didn't say it was your fault!

My point was: there may be a fault in the MS delay feature OR it may be a deliberate MS test mode.

In any case, I doubt they were expecting such a long delay between pages. 1000 pages at one page per 90 seconds takes over a day. If you have fewer than 1000 pages then scan rate is not really a problem. If you have significantly more the scan could never finish.

The best way to delay the frequency of scan is to set the set the Expires and CacheControl values to be several days, assuming (as I must from your 90s delay) that your content is fairly static. I have some sites I want scanned every day, others are fine at ten or more days.

AlexK

4:40 am on Jun 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For the record, the robots.txt figure quoted was on my `www' site. Looking at it again, the abuse was on the `forums' site, and robots.txt there does *not* contain a `Crawl-Delay' parameter. Neither does www now.

65.52.110.13 [forums.modem-help.co.uk] tried to take 3,232 pages at another max of 12 pages / second yesterday. Simply, that is website abuse. The suggestion that such abuse can be prevented by a Crawl-Delay parameter is, to my mind, so ludicrous that it crosses the line to become bizarre. Hence my use of extreme humour in response.

I'm hoping that this is a simple mistake by the Microsoft engineers. The response to my abuse reports (nothing other than an acknowledgement of receipt) suggests that nothing is going to change quickly.

g1smd

8:09 am on Jun 30, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Whatever the issue, this looks like a serious programming error that Microsoft should be made aware of at the highest level.

dstiles

9:57 pm on Jun 30, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



One thing I may have missed here, AlexK - I assume the crawl rates you give are for pages not pages + images?

I have to say I haven't seen anything like this degree of abuse. Typically I get a repeat crawl of one single site (amongst others) of about 120 pages every day (I have expiry set to 24 hours), total rate about 1.5 pages/second (SEs usually default to max of 2 per second). Note that is pages and does not include images etc.

Obviously, if there was no robots.txt on your site with the high scan rate then there is something very wrong and my conjecture re: robot test mode above 10 no longer applies.

I assume there is only a single point access to each page: eg forum.example.com/index.htm and not forum2.example.com/index.htm as well, which might look like two or more sites on the same server. Would this be possible on the forum site?

One other thought: I assume no crawl rate was specified for the site in the MSN/Bing Control Panel.

AlexK

11:09 pm on Jun 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles:
I assume the crawl rates you give are for pages not pages + images

Correct.

if there was no robots.txt on your site

There is a robots.txt to prevent 404 errors, but it is a default:

User-agent: *
Disallow:

two or more sites on the same server. Would this be possible on the forum site?

Redirects in place to prevent it. Which work.

I assume no crawl rate was specified for the site in the MSN/Bing Control Panel

Do not use their CP.

Pfui

7:22 pm on Jul 1, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Erm... Shouldn't that be this?

User-agent: *
Disallow: /

(Methinks no / means nothing's disallowed a.k.a. everything's allowed.)

Staffa

9:29 pm on Jul 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



User-agent: *
Disallow: /

With "/" as above means nothing is allowed. This is the content of robots.txt in place since about 6 months on one of my domains and no bot has entered except a few rogue ones but they are then dealt with differently.

PS : usually when an IP number of an SE really badly misbehaves then I just block that number. It will come again a few more times and give up, next comes a new IP number with good manners again ;o)

AlexK

5:37 am on Jul 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pfui:
Methinks no / means nothing's disallowed a.k.a. everything's allowed

Correct. That's why it's a `default' robots.txt.

As I said, the file is there purely to prevent endless 404s.

I'm not trying to prevent bots visiting (unlike Staffa) although, as my latest site stats for June [forums.modem-help.co.uk] indicate that just 1 page in every 5 accessed from my site is being taken by a human being (17%), that may need reconsidering.

g1smd

8:35 am on Jul 2, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



There are several bots you should block by default.

A few dozen lines of code can knock 30% to 40% off total bandwidth served per month (the error messages when blocked are very small compared to the page they would have pulled).

AlexK

2:53 pm on Jul 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



All the top-25 worst-sites are firewalled off plus, the most that any bot can get on my site is 7 pages before the on-site routines jump in. That is, after all, how my site caught these wretched msnbots in the first place.

wilderness

3:19 am on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



One would be inclined to believe that MSN, even when traveling in disguise, would use a working UA, rather than one that that has three sections with trailing-double-spaces.

207.46.195.205 - - [02/Jul/2011:21:03:34 -0600] "GET /MyFolder/MyPage.html HTTP/1.1" 403 412 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)"

g1smd

4:15 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Is there a definitive list of IPs that identify as bingbot/MSN/whatever?

I have a chunk of code in .htaccess that blocks any access that identifies as Googlebot but which comes from any non-Google IP address.

I'm thinking about adding the same for other bots.

wilderness

6:07 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Rotsa luck at such a thing for MSN ;)

It wouldn't surprise me in the least if they are still using 131.107. from their "entrance in 2003" to crawl anoym.

AlexK

9:56 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



g1smd:
Is there a definitive list of IPs that identify as bingbot/MSN/whatever?

I cannot answer for the UA, since that could change at any moment, but here is the comprehensive list of IPs & prefixes:

ASN report for AS8075 [cidr-report.org]

dstiles

10:01 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



No 131's (at least, not in my database) but I've seen a lot of 157's with proper bot UA and improper rDNS - they get blocked.

I've blocked 131.107/16 for a long time. It was always a mess and seemed to be mostly non-MS people using it (eg hosting).

I have 143 msnbot (ie bingbot) IP ranges in my database. There may be other VALID ones but they have not drawn themselves to my attention.

Pfui

10:14 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



If it helps... I okay official MSN bots --

bingbot, msnbot (includes msnbot-media and msnbot-webmaster)

-- only if they come from these Hosts/IPs:

RewriteCond %{REMOTE_HOST} !\.(bing|live|msn)\.com$
RewriteCond %{REMOTE_HOST} !\.phx\.gbl$
RewriteCond %{REMOTE_ADDR} !^65\.54\.
RewriteCond %{REMOTE_ADDR} !^65\.55\.
RewriteCond %{REMOTE_ADDR} !^157\.55\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.

(Note: That's part of a larger section and is in conjunction with white-listing -- anything else from those hosts/IPs is blocked.)

AlexK

5:57 am on Jul 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got a far simpler (and quicker) attitude: if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are.

2 weeks now, and the msn/bingbot onslaught continues unabated. The latest two IPs to be blocked:

msnbot-65-52-110-87.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,121 pages total
msnbot-207-46-204-239.search.msn.com [forums.modem-help.co.uk] : max 3 pages / sec; 5 pages total

(3 pages / second is the trip speed on my site)

Pfui

2:41 pm on Jul 4, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



"if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are."

I doubt you'd find many in this forum that don't agree with that:)

So how about the simplest and quickest route -- a firewall rule?

dstiles

10:39 pm on Jul 4, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I agree with pfui - MANY scraper and other bad bots seem to behave themselves but their activities are insidious and odious.

AlexK - I would seriously get in touch with MS about your problem. There must be a reason for their bot's behaviour.

AlexK

5:34 am on Jul 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles:
I would seriously get in touch with MS about your problem

I've got in touch with MS each & every day that this has happened for the last 2 weeks (auto-report to abuse address). That is the address for these reports. It's not my issue if they ignore it.

Pfui:
So how about the simplest and quickest route -- a firewall rule?

As stated earlier, that is reserved for the most egregious [forums.modem-help.co.uk]. The bot-blocker is the last-line defence to weed out the abusive from the normal browsers.

"if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are."
I doubt you'd find many in this forum that don't agree with that:)

How many pay more than lip service?

Latest catch yesterday: crawl-66-249-72-240.googlebot.com [forums.modem-help.co.uk] max 3 pages / sec; 215 pages

Apparently the googlebot is going back to former bad habits. There are also two more msnbot at a max 4 / sec, all of which pales besides two IPs from Telstra, probably scraping whilst bonded, since they were blocked within 2 seconds of each other, at a combined 291 pages / second. For the record, the worst-ever on my site since I introduced these auto-reports last September was 403 pages / second. I'm proud of my site for even being able to handle that rate of scraping.
This 152 message thread spans 6 pages: 152
 

Featured Threads

Hot Threads This Week

Hot Threads This Month