Forum Moderators: open
robots.txt? YES*
*But...
Three out of three times in as many months, Microsoft's Live Search has contained our blocked-by-robots.txt content. Each time I've written and requested removal; each time I get an auto-confirm notice and the content's removed. Then mere weeks later, more blocked content appears in their results. Like this week, when I just discovered more. I'd 403 msnbot and ALL of its kin but for the fact that doing so doesn't remove the content.
So if you are listing in robots.txt say
User-agent: *
Disallow: /restricted/
where:
example.com/restricted/ is the folder to restrict...
someone else can publish the url directly like http://example.com/restricted or via a redirect force the spider to follow. And spiders will attempt to access the url without checking the robots.txt first. It's a very simple thing to test if you want and as far I've tested these methods, every popular spider behaves the same.
I haven't seen that from GoogleBot(s).
I have 2 Spider trap URIs linked to from several pages within one of the sites. Both URIs are disallowed in Robots.txt.
Slurp occasionally tries to fetch them but very rare.
MSN Bots, All of them, try to fetch these URIs on weekly bases. When they try to fetch them they get a plain simple 403.
Here is the kicker. IF you do site:thatdomain.tld on MSN or LIVE, thouse URIs are listed on the first page.
There is another URI that has IDs attached to it, and most of the pages link to at list one variation of URI?ID=N3. Same thing here, Disallowed in Robots.txt, constant attempts to crawl, gets 403(none of you bus., you know), and listed in SITE command, Hundreds of them.
The fun part is when MSN Bot that FAKES Referrers comes along.
[webmasterworld.com...]
A Normal MSN Bot will visit the page that is OK to see for the bots(just a page).
Couple of seconds later, DA FAKER is here. Visits page and pulls down several files that are in Disallowed Folder and refferenced in Allowed page; JS, CSS, and some Images that are disallowed as well. DA FAKER Does NOT pull down the JS, CSS and images that are allowed to all visitors including Bots.
WHY?
I haven't seen that from GoogleBot(s).
See what happens if you force a redirect. In other words have the 1st server emit the 301 headers to the restricted area on the 2nd when it sees the googlebot and then check the log on the 2nd to see whether the page was accessed. Should be much faster to replicate.
I haven't seen a spider that doesn't do that yet. They all follow.
65.55.189.114
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)
robots.txt? YES
-- and two fellow travelers, the egregiously misbehaved v2.0b --
msnbot-65-55-106-163.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)
robots.txt? YES
-- and this new-newcomer:
bay20-ts0.bay20.hotmail.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)
robots.txt? YES
Hotmail?!
Hooboy. Too many bots, too little time.
Case errors when making page requests.
Not so sure I'm ready to allow yet another lame bot from MSN.