homepage Welcome to WebmasterWorld Guest from 54.197.94.241
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 [4] 5 6 > >     
MSN's many cloaked bots. Again.
Pfui




msg:4182832
 11:44 pm on Aug 5, 2010 (gmt 0)

Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

 

Mokita




msg:4331164
 8:40 am on Jun 26, 2011 (gmt 0)

@AlexK

Bingbot/MsnBot claim to honour a "Crawl-delay" directive in robots.txt :

[bing.com...]

Did you try that method before/instead of using the sledge-hammer approach?

If you had, your complaints to MS Abuse would probably carry more weight.

... Just a suggestion (as I haven't experienced the abuse you mention but I do have a "Crawl-delay" setting)

<edit> BingBot is crawling our sites heavily - but honouring the Crawl-delay </edit>

AlexK




msg:4331246
 5:54 pm on Jun 26, 2011 (gmt 0)

Mokita:
(Crawl-delay) Did you try that method

OK. For the record:

User-agent: *
Crawl-delay: 90


Now, how on earth you can hear a report on msnbot hitting a site at 12 accesses per second & immediately think "Crawl delay" is, frankly, beyond me. The record on my site is currently 403 hits / sec, and I willingly accept that 12 / sec is, in comparison to 403, mediocre. However, that's rather like saying that being shot 12 times is nothing compared to being shot 403 times. You will still be dead.

If the above paragraph does not impress you, then consider that the parameter for `Crawl-delay' is an integer, and is the "interval in seconds between each request". So, my site asks MSN to wait 90 seconds between each request. They waited one-twentieth of a second. Do you think that reasonable?

I'm beginning to foam at the mouth. Had better stop typing...

PS
I hope that you picked up the point that MSN completely ignores it's own public statements & robots.txt directive. It is just PR & blather. Ignore it.

dstiles




msg:4331293
 9:38 pm on Jun 26, 2011 (gmt 0)

AlexK: Thanks for the information.

In my own system the UA is recorded in a short-form log, alnog with IP, date.time, URL, referer etc. I find the UA invaluable.

Bing sometimes uses a corrupt UA (trailing underscore). Could that have been the source?

Tangor: as far as my experience goes, bingbot is a good bot with the proviso that several scans are made using IPs that fail rDNS tests: these are banned. Also the UA is sometimes corrupted (see above) so is also banned. Otherwise they seem to scan well enough.

tangor




msg:4331895
 3:11 am on Jun 28, 2011 (gmt 0)

@dstiles and others... My recent above was a bit tongue-in-cheek, Bing is a good bot... but I also disallow non-resolving rDNS and botched UA, neither of which has affected my Bing traffic. Just as G has had it's share of fake UA by scrapers, so to is Bing.

As for crawl-delay, I've found that 10 or less is a number bots that honor any crawl-delay seem to routinely honor... delay times greater than 10 seem problematical. I doubt there is a "cut off" or "ignore", but who knows? Just sharing my experience that 10 or less seems to work 90% of the time. That said, I have sites were I don't have a crawl-delay (most are 2,000 pages or less) as there seems no point.

AlexK




msg:4331916
 5:14 am on Jun 28, 2011 (gmt 0)

dstiles:
Bing sometimes uses a corrupt UA (trailing underscore). Could that have been the source?

I had a look in the logfiles to find out:

207.46.13.98 - - [22/Jun/2011:04:50:45 +0100] "GET /faq.php HTTP/1.1" 200 13386 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" In:- Out:-:-pct. "-"
(13 x 200 OK accesses)
207.46.13.98 - - [22/Jun/2011:04:50:52 +0100] "GET /profile.php?mode=viewprofile&u=3 HTTP/1.1" 403 132 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" In:137 Out:114:83pct. "-"

...so, the answer is `no', standard UA.

Curiously, none of the first 14 accesses--before the block--were accepting compressed pages (thanks, Bing). I d/checked to make sure and indeed my own browser gets a gzipped page.

dstiles




msg:4332257
 9:46 pm on Jun 28, 2011 (gmt 0)

AlexK - looks like a genuine bot.

I wonder if the problem lies in the high delay factor, as tangor says. Specifically, I wonder if there could be some kind of MS development test thingy that says, "if delay >9 then it's milliseconds" or some such. Or, more likely, you've hit a numerical storage limit that flips round to 0?

AlexK




msg:4332322
 12:34 am on Jun 29, 2011 (gmt 0)

dstiles:
I wonder if the problem lies in the high delay factor

So you reckon that BingBot attempting to hit my site once every eight milliseconds (1 sec / 12) is *my* fault. With 4 different IPs. And 5,000 attempted accesses across a 24 hour period, all at the same rate. MY FAULT? Do you think that, just possibly, there could be a minor flaw in your logic there? Such as the assumption that scraping a site at greater than once a second may be acceptable?

Well, whether it was my fault or not, each IP got blocked from my site for a week, and reported as an abusive IP to a RBL. It they continue to do so the IP(s) will go in the site firewall, and if it re-occurs from enough of their IPs the entire damn ASN will go in the Firewall.

I get very few searches passed through to my site from MS, and they are sucking down bandwidth (or attempting to) like nobody's business.

At the moment, MS sits for me in the same camp as all the spam-scraper bots. If they are comfortable with that, it's fine by me.

tangor




msg:4332332
 1:29 am on Jun 29, 2011 (gmt 0)

AlexK... perhaps the only "fault" (which no one is saying) is an integer in crawl-delay. Might try something much smaller... any relief is better than no relief. That said, Banning Bing or MSNbot is always an option (which kind of takes care of Yahoo, too). No sense in having high blood-pressure over something like a SE bot.

Mokita




msg:4332477
 11:34 am on Jun 29, 2011 (gmt 0)

AlexK wrote:

I'm beginning to foam at the mouth.


Yes - got your message, loud and clear!

For whatever it's worth, I try to keep emotion out of bot identification/blocking - YMMV.

AlexK




msg:4332690
 7:52 pm on Jun 29, 2011 (gmt 0)

YMMV == "Your mood may vary"

dstiles




msg:4332702
 8:14 pm on Jun 29, 2011 (gmt 0)

AlexK - I'm sure I didn't say it was your fault!

My point was: there may be a fault in the MS delay feature OR it may be a deliberate MS test mode.

In any case, I doubt they were expecting such a long delay between pages. 1000 pages at one page per 90 seconds takes over a day. If you have fewer than 1000 pages then scan rate is not really a problem. If you have significantly more the scan could never finish.

The best way to delay the frequency of scan is to set the set the Expires and CacheControl values to be several days, assuming (as I must from your 90s delay) that your content is fairly static. I have some sites I want scanned every day, others are fine at ten or more days.

AlexK




msg:4332882
 4:40 am on Jun 30, 2011 (gmt 0)

For the record, the robots.txt figure quoted was on my `www' site. Looking at it again, the abuse was on the `forums' site, and robots.txt there does *not* contain a `Crawl-Delay' parameter. Neither does www now.

65.52.110.13 [forums.modem-help.co.uk] tried to take 3,232 pages at another max of 12 pages / second yesterday. Simply, that is website abuse. The suggestion that such abuse can be prevented by a Crawl-Delay parameter is, to my mind, so ludicrous that it crosses the line to become bizarre. Hence my use of extreme humour in response.

I'm hoping that this is a simple mistake by the Microsoft engineers. The response to my abuse reports (nothing other than an acknowledgement of receipt) suggests that nothing is going to change quickly.

g1smd




msg:4332936
 8:09 am on Jun 30, 2011 (gmt 0)

Whatever the issue, this looks like a serious programming error that Microsoft should be made aware of at the highest level.

dstiles




msg:4333315
 9:57 pm on Jun 30, 2011 (gmt 0)

One thing I may have missed here, AlexK - I assume the crawl rates you give are for pages not pages + images?

I have to say I haven't seen anything like this degree of abuse. Typically I get a repeat crawl of one single site (amongst others) of about 120 pages every day (I have expiry set to 24 hours), total rate about 1.5 pages/second (SEs usually default to max of 2 per second). Note that is pages and does not include images etc.

Obviously, if there was no robots.txt on your site with the high scan rate then there is something very wrong and my conjecture re: robot test mode above 10 no longer applies.

I assume there is only a single point access to each page: eg forum.example.com/index.htm and not forum2.example.com/index.htm as well, which might look like two or more sites on the same server. Would this be possible on the forum site?

One other thought: I assume no crawl rate was specified for the site in the MSN/Bing Control Panel.

AlexK




msg:4333340
 11:09 pm on Jun 30, 2011 (gmt 0)

dstiles:
I assume the crawl rates you give are for pages not pages + images

Correct.

if there was no robots.txt on your site

There is a robots.txt to prevent 404 errors, but it is a default:

User-agent: *
Disallow:

two or more sites on the same server. Would this be possible on the forum site?

Redirects in place to prevent it. Which work.

I assume no crawl rate was specified for the site in the MSN/Bing Control Panel

Do not use their CP.

Pfui




msg:4333800
 7:22 pm on Jul 1, 2011 (gmt 0)

Erm... Shouldn't that be this?

User-agent: *
Disallow: /

(Methinks no / means nothing's disallowed a.k.a. everything's allowed.)

Staffa




msg:4333866
 9:29 pm on Jul 1, 2011 (gmt 0)

User-agent: *
Disallow: /

With "/" as above means nothing is allowed. This is the content of robots.txt in place since about 6 months on one of my domains and no bot has entered except a few rogue ones but they are then dealt with differently.

PS : usually when an IP number of an SE really badly misbehaves then I just block that number. It will come again a few more times and give up, next comes a new IP number with good manners again ;o)

AlexK




msg:4334032
 5:37 am on Jul 2, 2011 (gmt 0)

Pfui:
Methinks no / means nothing's disallowed a.k.a. everything's allowed

Correct. That's why it's a `default' robots.txt.

As I said, the file is there purely to prevent endless 404s.

I'm not trying to prevent bots visiting (unlike Staffa) although, as my latest site stats for June [forums.modem-help.co.uk] indicate that just 1 page in every 5 accessed from my site is being taken by a human being (17%), that may need reconsidering.

g1smd




msg:4334075
 8:35 am on Jul 2, 2011 (gmt 0)

There are several bots you should block by default.

A few dozen lines of code can knock 30% to 40% off total bandwidth served per month (the error messages when blocked are very small compared to the page they would have pulled).

AlexK




msg:4334142
 2:53 pm on Jul 2, 2011 (gmt 0)

All the top-25 worst-sites are firewalled off plus, the most that any bot can get on my site is 7 pages before the on-site routines jump in. That is, after all, how my site caught these wretched msnbots in the first place.

wilderness




msg:4334336
 3:19 am on Jul 3, 2011 (gmt 0)

One would be inclined to believe that MSN, even when traveling in disguise, would use a working UA, rather than one that that has three sections with trailing-double-spaces.

207.46.195.205 - - [02/Jul/2011:21:03:34 -0600] "GET /MyFolder/MyPage.html HTTP/1.1" 403 412 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648)"

g1smd




msg:4334432
 4:15 pm on Jul 3, 2011 (gmt 0)

Is there a definitive list of IPs that identify as bingbot/MSN/whatever?

I have a chunk of code in .htaccess that blocks any access that identifies as Googlebot but which comes from any non-Google IP address.

I'm thinking about adding the same for other bots.

wilderness




msg:4334482
 6:07 pm on Jul 3, 2011 (gmt 0)

Rotsa luck at such a thing for MSN ;)

It wouldn't surprise me in the least if they are still using 131.107. from their "entrance in 2003" to crawl anoym.

AlexK




msg:4334585
 9:56 pm on Jul 3, 2011 (gmt 0)

g1smd:
Is there a definitive list of IPs that identify as bingbot/MSN/whatever?

I cannot answer for the UA, since that could change at any moment, but here is the comprehensive list of IPs & prefixes:

ASN report for AS8075 [cidr-report.org]

dstiles




msg:4334586
 10:01 pm on Jul 3, 2011 (gmt 0)

No 131's (at least, not in my database) but I've seen a lot of 157's with proper bot UA and improper rDNS - they get blocked.

I've blocked 131.107/16 for a long time. It was always a mess and seemed to be mostly non-MS people using it (eg hosting).

I have 143 msnbot (ie bingbot) IP ranges in my database. There may be other VALID ones but they have not drawn themselves to my attention.

Pfui




msg:4334589
 10:14 pm on Jul 3, 2011 (gmt 0)

If it helps... I okay official MSN bots --

bingbot, msnbot (includes msnbot-media and msnbot-webmaster)

-- only if they come from these Hosts/IPs:

RewriteCond %{REMOTE_HOST} !\.(bing|live|msn)\.com$
RewriteCond %{REMOTE_HOST} !\.phx\.gbl$
RewriteCond %{REMOTE_ADDR} !^65\.54\.
RewriteCond %{REMOTE_ADDR} !^65\.55\.
RewriteCond %{REMOTE_ADDR} !^157\.55\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.

(Note: That's part of a larger section and is in conjunction with white-listing -- anything else from those hosts/IPs is blocked.)

AlexK




msg:4334669
 5:57 am on Jul 4, 2011 (gmt 0)

I've got a far simpler (and quicker) attitude: if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are.

2 weeks now, and the msn/bingbot onslaught continues unabated. The latest two IPs to be blocked:

msnbot-65-52-110-87.search.msn.com [forums.modem-help.co.uk] : max 8 pages / sec; 3,121 pages total
msnbot-207-46-204-239.search.msn.com [forums.modem-help.co.uk] : max 3 pages / sec; 5 pages total

(3 pages / second is the trip speed on my site)

Pfui




msg:4334808
 2:41 pm on Jul 4, 2011 (gmt 0)

"if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are."

I doubt you'd find many in this forum that don't agree with that:)

So how about the simplest and quickest route -- a firewall rule?

dstiles




msg:4334908
 10:39 pm on Jul 4, 2011 (gmt 0)

I agree with pfui - MANY scraper and other bad bots seem to behave themselves but their activities are insidious and odious.

AlexK - I would seriously get in touch with MS about your problem. There must be a reason for their bot's behaviour.

AlexK




msg:4334975
 5:34 am on Jul 5, 2011 (gmt 0)

dstiles:
I would seriously get in touch with MS about your problem

I've got in touch with MS each & every day that this has happened for the last 2 weeks (auto-report to abuse address). That is the address for these reports. It's not my issue if they ignore it.

Pfui:
So how about the simplest and quickest route -- a firewall rule?

As stated earlier, that is reserved for the most egregious [forums.modem-help.co.uk]. The bot-blocker is the last-line defence to weed out the abusive from the normal browsers.

"if they behave themselves they are ignored; if they are abusive they are stopped/blocked/reported. Whoever they are."
I doubt you'd find many in this forum that don't agree with that:)

How many pay more than lip service?

Latest catch yesterday: crawl-66-249-72-240.googlebot.com [forums.modem-help.co.uk] max 3 pages / sec; 215 pages

Apparently the googlebot is going back to former bad habits. There are also two more msnbot at a max 4 / sec, all of which pales besides two IPs from Telstra, probably scraping whilst bonded, since they were blocked within 2 seconds of each other, at a combined 291 pages / second. For the record, the worst-ever on my site since I introduced these auto-reports last September was 403 pages / second. I'm proud of my site for even being able to handle that rate of scraping.

tangor




msg:4334980
 6:14 am on Jul 5, 2011 (gmt 0)

Color me confused... what SIZE are these 3pgs or 4pgs pages? HOW OFTEN? (weekly)
I like to be indexed weekly... but most of my pages go a bit slower (bigger than most).

MSNbot/Bing come from many IPs, but most only ask for a dozen pages at a time... not the whole site (my experience)

And on pages that have not been updated a nice 304 is given, not the page...

Have you investigated throttling rapid access?

This 152 message thread spans 6 pages: < < 152 ( 1 2 3 [4] 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved