livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm

Forum Moderators: open

Message Too Old, No Replies

livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm

MSN strikes again

Pfui

11:13 pm on Apr 28, 2009 (gmt 0)

65.55.212.159
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)

robots.txt? YES*

*But...
Three out of three times in as many months, Microsoft's Live Search has contained our blocked-by-robots.txt content. Each time I've written and requested removal; each time I get an auto-confirm notice and the content's removed. Then mere weeks later, more blocked content appears in their results. Like this week, when I just discovered more. I'd 403 msnbot and ALL of its kin but for the fact that doing so doesn't remove the content.

incrediBILL

7:07 pm on May 5, 2009 (gmt 0)

Pfui, check a few other threads, seems to be a lot of complaints about Live/MSNbot lately.

I think MS is having growing pains.

enigma1

9:17 am on May 6, 2009 (gmt 0)

I believe listing a disallowed folder or script inside the robots.txt file is ineffective. Spiders won't do anything about it if the link to a restricted area of a site is published in someway.

So if you are listing in robots.txt say
User-agent: *
Disallow: /restricted/

where:
example.com/restricted/ is the folder to restrict...
someone else can publish the url directly like http://example.com/restricted or via a redirect force the spider to follow. And spiders will attempt to access the url without checking the robots.txt first. It's a very simple thing to test if you want and as far I've tested these methods, every popular spider behaves the same.

tangor

10:34 am on May 6, 2009 (gmt 0)

URI found one way AND PUBLISHED all the rest follow. Spiders spider the spiders. And it is getting worse.

blend27

2:58 pm on May 6, 2009 (gmt 0)

-- And spiders will attempt to access the url without checking --

I haven't seen that from GoogleBot(s).

I have 2 Spider trap URIs linked to from several pages within one of the sites. Both URIs are disallowed in Robots.txt.

Slurp occasionally tries to fetch them but very rare.

MSN Bots, All of them, try to fetch these URIs on weekly bases. When they try to fetch them they get a plain simple 403.

Here is the kicker. IF you do site:thatdomain.tld on MSN or LIVE, thouse URIs are listed on the first page.

There is another URI that has IDs attached to it, and most of the pages link to at list one variation of URI?ID=N3. Same thing here, Disallowed in Robots.txt, constant attempts to crawl, gets 403(none of you bus., you know), and listed in SITE command, Hundreds of them.

The fun part is when MSN Bot that FAKES Referrers comes along.
[webmasterworld.com...]

A Normal MSN Bot will visit the page that is OK to see for the bots(just a page).

Couple of seconds later, DA FAKER is here. Visits page and pulls down several files that are in Disallowed Folder and refferenced in Allowed page; JS, CSS, and some Images that are disallowed as well. DA FAKER Does NOT pull down the JS, CSS and images that are allowed to all visitors including Bots.

WHY?

enigma1

4:17 pm on May 6, 2009 (gmt 0)

I haven't seen that from GoogleBot(s).

well I have.

See what happens if you force a redirect. In other words have the 1st server emit the 301 headers to the restricted area on the 2nd when it sees the googlebot and then check the log on the 2nd to see whether the page was accessed. Should be much faster to replicate.

I haven't seen a spider that doesn't do that yet. They all follow.

blend27

8:37 pm on May 6, 2009 (gmt 0)

hmmm, hmm, hmmm...., stuffed with Minced Pimiento...

When you say "1st Server", is it on the same domain? or ...

dstiles

1:22 am on May 7, 2009 (gmt 0)

I'm seeing hits on robots-banned URLs by msnbot. I'd love to block all MS access but clients would object.

The redirect may be a good one: is it MSN on a banned URL? Ok, redirect to the MSN's bot or SE page...

enigma1

8:35 am on May 7, 2009 (gmt 0)

try 2 servers on 2 different domains and 2 different ips. I mean you want to see how the spider reacts in regular situations going from a random site A to a random site B.

Pfui

4:18 am on May 23, 2009 (gmt 0)

Looks like MSN's got itself a new tag team. The same 30-second span showed "livebot-searchsense" again --

65.55.189.114
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)
robots.txt? YES

-- and two fellow travelers, the egregiously misbehaved v2.0b --

msnbot-65-55-106-163.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
robots.txt? YES

-- and this new-newcomer:

bay20-ts0.bay20.hotmail.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)
robots.txt? YES

Hotmail?!

Hooboy. Too many bots, too little time.

dstiles

10:01 pm on May 23, 2009 (gmt 0)

Also, today, at:
65.55.189.nnn
64.4.22.nnn
64.4.54.nnn

Various IPs within each range are still showing rDNS of hotmail bays, others are msnbot. Typical shambles.

wilderness

12:27 am on May 31, 2009 (gmt 0)

207.46.127.166 - - [30/May/2009:23:53:08 +0100] "GET /robots.txt HTTP/1.1" 200 4777 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)"

Case errors when making page requests.
Not so sure I'm ready to allow yet another lame bot from MSN.