Dijkgraaf

msg:4222537 | 1:27 am on Oct 27, 2010 (gmt 0) |
You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore.
|
jdMorgan

msg:4222545 | 1:52 am on Oct 27, 2010 (gmt 0) |
I for one am not yet sure that this is a real msnbot actually controlled by MSN. The "._" at the end of the UA string and the "different" value in the "From" header push me toward "no." Everyone posting to this thread, please be explicit as to the full UA-string, the "From" header (if you log it), the IP address, and the rDNS if you check it. Thanks, Jim
|
Dijkgraaf

msg:4222552 | 2:28 am on Oct 27, 2010 (gmt 0) |
In fact two of the topics should be in other threads, as this one started about MSN cloaked bots. To discuss the MSN bot with the "._" at the end use [webmasterworld.com...] And maybe start a new thread on msnbot requesting /logs or /access_logs
|
Pfui

msg:4222604 | 5:38 am on Oct 27, 2010 (gmt 0) |
@Dijkgraf, 1.) This thread's OP is my observing/complaining about hits from 'bare MSN IPs and not bona fide MSN bots'. Sure, threads overlap but I think subsequent posts are on-point with my OP because they're describing hits from, well, 'bare MSN IPs and not bona fide MSN bots' -- and that includes the iffy bot with the "._" 2.) You said: "add /logs and /access_log to robots.txt and you won't get any more hits on those" and similarly: "You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore." I wish. But as thousands of posts in this forum attest, simply including dirs and files in robots.txt is definitely not a sure-fire way to eliminate hits to same, neither by bad bots nor, increasingly, 'good' ones. Bottom line for me -- It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way. :)
|
Staffa

msg:4222895 | 5:47 pm on Oct 27, 2010 (gmt 0) |
| It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way. |
| Amen and that goes for other Bots as well ;o)
|
Dijkgraaf

msg:4223623 | 10:46 pm on Oct 28, 2010 (gmt 0) |
@Pfui But that's the point. You have not told MSN or other bots that they are not allowed to ask for those URL's. With the current standard, anything that isn't explicitly disallowed is allowed. I've seen some very odd requests sometimes, even from GoogleBot including asking for some .exe files. I've put those down to either someone creating fake inbound links to get Google to scan for vulnerabilities or GoogleBot checking for a compromised sites. Yes, there are bots that won't obey those rules, but those ones you will want to ban outright anyway. I monitor the 404's occurring on my site, and if a bot starts re-visiting them I disallow it in my robots.txt file or give it a 301 to the resource they should be requesting. Either way it solves the problem, which isn't a big one to start of with, as all that would be happening is that they are getting a 404.
|
Samizdata

msg:4223648 | 12:08 am on Oct 29, 2010 (gmt 0) |
| all that would be happening is that they are getting a 404 |
| No, the bot also takes the robots.txt file. If it is a genuine Microsoft bot - and I take Jim's point that the jury is still out on that - then phishing for access logs is disgraceful behaviour for a reputable company. If it is a fake Microsoft bot using a genuine Microsoft IP then I certainly don't want it reading the robots.txt file or getting any information whatsoever. Either way, I'd say a 403 is what it deserves. ...
|
Pfui

msg:4223723 | 5:21 am on Oct 29, 2010 (gmt 0) |
Jim, about the iffy "._" bot -- see also my observations in this thread's predecessor, "MSN's many cloaked bots." (2009): [webmasterworld.com...] and yours in "Wanted: Crawler Quality Assurance Engineer" (2010): [webmasterworld.com...]
|
Pfui

msg:4224716 | 2:18 am on Nov 1, 2010 (gmt 0) |
Doggoneit. MSN just ran bingbot from a bare IP: 157.55.16.229 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) robots.txt? NO
|
keyplyr

msg:4224734 | 3:01 am on Nov 1, 2010 (gmt 0) |
| MSN just ran bingbot from a bare IP |
| Thanks for the heads-up Pfui. Guess I was asleep at the wheel and blocked this one. 157.55.16.231 - - [30/Oct/2010:01:50:32 -0700] "GET www.example.com HTTP/1.1" 403 479 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
|
Mokita

msg:4227351 | 11:04 pm on Nov 5, 2010 (gmt 0) |
Just seen coming from rdns msnbot-65-52-49-143.search.msn.com (65.52.49.143), requests for one page plus css, but no images. UA was exactly: Mozilla/4.0 (compatible (no closing bracket)
|
Pfui

msg:4228756 | 9:27 am on Nov 10, 2010 (gmt 0) |
Akin to the OP, and also mssg #:4203105, yet another hit-and-run with no UA, no robots.txt, no REF, no nothing. To yet another file. Eleven times. For the third time! 65.52.32.17 - 11/10 00:15:45 /dir/filename.html 11/10 00:15:56 /dir/filename.html 11/10 00:16:07 /dir/filename.html 11/10 00:16:17 /dir/filename.html 11/10 00:16:28 /dir/filename.html 11/10 00:16:39 /dir/filename.html 11/10 00:16:50 /dir/filename.html 11/10 00:17:00 /dir/filename.html 11/10 00:17:11 /dir/filename.html 11/10 00:17:22 /dir/filename.html 11/10 00:17:32 /dir/filename.html Too weird.
|
Pfui

msg:4229050 | 3:32 am on Nov 11, 2010 (gmt 0) |
Same ol', same ol', the dreaded 65.52., albeit a different Host and UA than previously mentioned in this thread: msnbot-65-52-50-54.search.msn.com Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707) robots.txt? NO Just went for root.
|
Staffa

msg:4229767 | 11:19 pm on Nov 12, 2010 (gmt 0) |
I noticed a weird event today. A regular visitor (via G.fr search) comes and views a few pages. Next comes 207.46.204.nn (msnbot IP) with exactly the same UA as the previous visitor which was what caught my attention. UA for both : Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; HPDTDF; Tablet PC 2.0; .NET4.0C; Creative AutoUpdate v1.40.01) My log files are not public and it is either a mighty coincidence or ... a case where msn is following the visitor from France ?
|
Pfui

msg:4232259 | 6:36 am on Nov 19, 2010 (gmt 0) |
Fresh from one of my logs, emphasis mine: msnbot-65-52-49-143.search.msn.com - - [18/Nov/2010:22:24:15 -0800] "GET / HTTP/1.1" 403 1468 "-" "Mozilla/4.0 (compatible" Crawlus interruptus?
|
Mokita

msg:4232311 | 10:22 am on Nov 19, 2010 (gmt 0) |
Pfui: Exactly the same details as my post above on Nov 6, (including the rDNS).
|
Pfui

msg:4232499 | 7:46 pm on Nov 19, 2010 (gmt 0) |
(slaps head) Thank you for pointing that out:) Curious how it's even the same IP. And BvsB shows the exact same oddity, along with six out of seven UAs not identified as msnbot- or bing-related from 65.52.49.143: [botsvsbrowsers.com...] Unfortunately that IP's not the only one using the truncated UA. Here are 22 more: [botsvsbrowsers.com...] Clearly they know what they're doing. But until I know, non-search UAs from .search.msn.com Hosts/IPs will just keep getting 403s. Transparency. What a concept. (Note the date...) [bing.com:80...]
|
Pfui

msg:4232679 | 9:54 am on Nov 20, 2010 (gmt 0) |
More bare IPs from 65.52. Hit in a post-tweet swarm, 20 minutes apart. All using: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) 65.52.4.249 65.52.2.10 65.52.17.79 robots.txt? NO
|
Pfui

msg:4235821 | 2:24 am on Nov 28, 2010 (gmt 0) |
Ten same-second hits to the same site. So much for MSN stating msnbot supports "Crawl-delay" in robots.txt: Crawl Delay And The Bing Crawler, MSNBot [bing.com] msnbot-65-55-190-9.search.msn.com msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm) 11/27 03:45:34 /robots.txt 11/27 03:45:34 /file1 65.55.190.19 msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm) 11/27 03:45:34 /robots.txt 11/27 03:45:34 /robots.txt 11/27 03:45:34 /file2 11/27 03:45:34 /file3 11/27 03:45:34 /robots.txt 11/27 03:45:34 /robots.txt 11/27 03:45:34 /file4 11/27 03:45:34 /file5
|
Pfui

msg:4242980 | 5:40 am on Dec 16, 2010 (gmt 0) |
Same Hostname caribguy reported above. At least asked for robots.txt and heeded same. But crawled 20 documents whereas bing/msnbot usually crawls 1 or 2 per server session. Also, the same weird UA was in play: gig4-2.tuk2f-gsr-a.us.msn.net msnbot/2.0b (+http://search.msn.com/msnbot.htm)._
|
dstiles

msg:4243326 | 9:21 pm on Dec 16, 2010 (gmt 0) |
I block all those UAs. Apart from anything else they seem to come from non-rDNS IPs.
|
Pfui

msg:4254124 | 9:09 am on Jan 17, 2011 (gmt 0) |
The odd one came around on the 16th -- msnbot-65-55-55-205.search.msn.com msnbot/2.0b (+http://search.msn.com/msnbot.htm)._ robots.txt? Yes but... "Crawl-delay" directive ignored. -- and alternated with the usual suspects in a 15-minute period with hits from Hosts and bare IPs. Here's a partial listing: msnbot-65-52-110-69.search.msn.com Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) msnbot-207-46-194-144.search.msn.com msnbot-media/1.1 (+http://search.msn.com/msnbot.htm) 157.55.16.230 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) FWIW
|
incrediBILL

msg:4254400 | 10:19 pm on Jan 17, 2011 (gmt 0) |
I'm guessing that bingbot from bare IPs is either: - running from a test lab - run via some MS cloud services - being spoofed via some MS proxy Doesn't matter, if the IP doesn't resolve to a crawler rDNS, I block it. FYI - these bare IPs are showing up on Project Honeypot: [projecthoneypot.org...]
|
dstiles

msg:4263929 | 9:45 pm on Feb 7, 2011 (gmt 0) |
Microsoft seems to have legitimized a few more bot IPs by setting up proper rDNS for them. The range below may not be ALL of the new ones but seems to be correct for 157.55.116 (although the bot is still coming round with the "invalid" underscore UA and ignoring robots.txt on this and other ranges). 157.55.116.7 - 157.55.116.97 Bots still hitting hard but with no valid rDNS are in the range: 157.55.16.0 - 157.55.18.255
|
AlexK

msg:4330348 | 4:36 am on Jun 24, 2011 (gmt 0) |
4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots. Those routines have been augmented to record & report them (both to the tornvall RBL & also on a public webpage as a permanent record). Here's the reports; so far, the same behaviour for 4 days on the trot: 24 June: 65.52.110.75 [forums.modem-help.co.uk] : msnbot-65.52.110.75.search.msn.com : max: 9 / sec : total: 26 pages 207.46.204.238 [forums.modem-help.co.uk] : msnbot-207.46.204.238.search.msn.com : max: 11 / sec : total: 2,214 pages 207.46.199.43 [forums.modem-help.co.uk] : msnbot-207.46.199.43.search.msn.com : max: 8 / sec : total: 548 pages 23 June: 207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 3,272 pages 22 June: 207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 1,395 pages 65.52.110.64 [forums.modem-help.co.uk] : msnbot-65.52.110.64.search.msn.com : max: 7 / sec : total: 25 pages 65.52.110.72 [forums.modem-help.co.uk] : msnbot-65.52.110.72.search.msn.com : max: 12 / sec : total: 1,888 pages 207.46.195.242 [forums.modem-help.co.uk] : msnbot-207.46.195.242.search.msn.com : max: 12 / sec : total: 1,201 pages 21 June: 65-52-110-72 [forums.modem-help.co.uk] : msnbot-65-52-110-72.search.msn.com : max: 12 / sec : total: 1,424 pages 207-46-195-242 [forums.modem-help.co.uk] : msnbot-207-46-195-242.search.msn.com : max: 12 / sec : total: 926 pages 207-46-199-38 [forums.modem-help.co.uk] : msnbot-207-46-199-38.search.msn.com : max: 8 / sec : total: 26 pages A full report has also been auto-emailed to the MS abuse address each day for each IP. Frankly, I do not expect any action.
|
dstiles

msg:4330766 | 9:02 pm on Jun 24, 2011 (gmt 0) |
Had a lot of hits from msnbot recently - coming round on some sites once a day - but behaviour seems reasonable. That's specific known msnbot UA plus IP, of course. There are always a few non-bot UAs or badly formatted ones, plus the odd bad rDNS IP (your IP examples are good).
|
AlexK

msg:4330817 | 11:16 pm on Jun 24, 2011 (gmt 0) |
@dstiles It's a Microsoft ASN for each IP. I've seen reports in the past for such behaviour from the msnbot, back in the day before the bingbot, but this is the first time that I've picked it up personally. The stop-block-report routines do not pick up the UA, only info (such as port number) that the hostmaster requires to confirm the abuse. Not that I expect--or have received--anything more than an acknowledgement of receipt from MS (abuse@hotmail.com).
|
dstiles

msg:4331082 | 9:27 pm on Jun 25, 2011 (gmt 0) |
Not sure what you mean by stop-block-report. Is this something on your machine that reports problems? I get my info from the site logs (I also record abuses in a separate log): that shows UA and IP. If I suspect a problem I check the rDNS using the linux Network Tools or online robtex. I've never bothered to report msnbot/bingbot abuse.
|
AlexK

msg:4331153 | 6:36 am on Jun 26, 2011 (gmt 0) |
dstile: | Not sure what you mean by stop-block-report. Is this something on your machine that reports problems? |
| The original Bad-Bot-Blocker [forums.modem-help.co.uk], with added vitamins! Originated on WebmasterWorld [webmasterworld.com] & developed by myself. The bot-blocker is the `stop-block' part. I've added the capability to record the abuse, auto-report (both to abuse email + to RBL site) & provide permanent evidence. I'm a bit proud of it. The constant scrapes have now almost entirely gone, although that could be because each IP now gets an auto-block. It still beggars the question: what on earth are MS doing?
|
tangor

msg:4331157 | 6:59 am on Jun 26, 2011 (gmt 0) |
| 4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots. |
| Is Bing a "bad bot"? Just asking. Google is proving short on supply and Bing on the rise... which is the "bad bot"? Just curious.
|
Mokita

msg:4331164 | 8:40 am on Jun 26, 2011 (gmt 0) |
@AlexK Bingbot/MsnBot claim to honour a "Crawl-delay" directive in robots.txt : [bing.com...] Did you try that method before/instead of using the sledge-hammer approach? If you had, your complaints to MS Abuse would probably carry more weight. ... Just a suggestion (as I haven't experienced the abuse you mention but I do have a "Crawl-delay" setting) <edit> BingBot is crawling our sites heavily - but honouring the Crawl-delay </edit>
|
| This 152 message thread spans 6 pages: < < 152 ( 1 2 [3] 4 5 6 ) > > |
|
|