Blocking Site Server 3.0 Robot

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking Site Server 3.0 Robot

How?

Matt Probert

5:33 pm on Jun 14, 2007 (gmt 0)

Sounds silly, but what is the UA name to add to robots.txt to identify the Microsoft Site Server 3.0 Robot? I have tried:

User-Agent: Site Server
Disallow: /

User-Agent: Site Server 3.0
Disallow: /

User-Agent: Site Server 3.0 Robot
Disallow: /

But to no avail.

Matt

incrediBILL

9:12 pm on Jun 15, 2007 (gmt 0)

See, this is why the OPT-OUT method is a waste of time.

If you do OPT-IN method, that bot is banned by default!

Example:

# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin
# everyone else jump off a cliff
User-agent: *
Disallow: /

I'll just keep preaching OPT-IN until people pay attention ;)

motorhaven

4:04 am on Jun 21, 2007 (gmt 0)

Dynamic robots.txt files are great for this sort of thing. Feed known "good" bots (and IP addresses of stealth bots from "good" networks such as Google/Yahoo) a robots.txt file that allows them in.

Feed all others a robots.txt file which disallows them.

Use the dynamic file to log accesses, agents, etc. so you can monitor what hits and what to allow, disallow, etc. I've got an extremely detailed and feature-rich dynamic program I've written -- it also monitors bot-traps and automatically updates the .htaccess file to block bots which ignore robots.txt directives.

Bambarbia

5:35 pm on Jun 21, 2007 (gmt 0)

If you do OPT-IN method, that bot is banned by default!

A well-known site (few millions visitors monthly) uses this method. It allows access to probably 25 robots, and disallows explicitly to 25 others, at the end it even has
User-Agent: *
Disallow: /

Probably this site does not need robots.txt at all, it is well known site. Such a stupid idea to allow Googlebot and disallow Teleport! Teleport users can change robot signature to User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 GoogleToolbarFF 3.0.20070420

What about this guy, isn't it stupid:

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /

Wow, Internet Explorer does not check robots.txt before browsing! And Teleport has a setting "do not honor robots.txt".

What about other not-so-well-known robots which may appear on the market and generate good traffic for you?

Bambarbia

6:01 pm on Jun 21, 2007 (gmt 0)

For those who uses dynamically generated content for robots.txt, with dependency on IP, User-Agent, and etc:

1. It IS very risky.
2. It IS cloaking.

Having static file in a file system is preferable way.

What about returning 404 if your 'dynamic service' is temporary broken? Have you tested it? Are you sure that it won't reply 50x? 403? Are you sure that your dynamic content correctly handles HTTP HEAD request? What about 304? Are you sure that it generates correct version of robots.txt in case of redirection from another site?...

You have a lot of dependencies... I'd suggest to treat all robots the same in our world of democracy, and to disallow access to 'shopping cart', for instance:

User-Agent: *
Disallow: /addToCart
Disallow: /sendFeedaback
Disallow: /search

Bambarbia

6:14 pm on Jun 21, 2007 (gmt 0)

I really hate word 'cloaking'; it is necessity to have a content based on User-Agent and other HTTP Headers, and some clever guys at Google still do not understand this. We need different content for cell phones, for instance.

Solution with dynamically generated robots.txt is not good by other more important meanings: scalability. It is just plain static text!

If you want to cloak not-welcomed User-Agents, including those not-honoring, just do it explicitly! without adding meaningless value to robots.txt conventions.

incrediBILL

11:52 pm on Jun 21, 2007 (gmt 0)

For those who uses dynamically generated content for robots.txt, with dependency on IP, User-Agent, and etc:
1. It IS very risky.
2. It IS cloaking.
Having static file in a file system is preferable way.

Cloaking means showing the USER something different than the SEARCH ENGINE, and the USER should never see robots.txt in the first place, so it's not cloaking.

It's also not risky, no more risky than running a database driven website, such as a forum, blog, or anything else that uses MySQL on the backend. Actually, the cloaked robots.txt is probably less risky because most people just serve up a series of static files dynamically based on the user agent, so it's far less likely to fail than the site itself which is prone to a MySQL database crash.

FWIW, you're arguing the wrong point on the wrong website as WebmasterWorld itself uses a dynamic robots.txt file:
[webmasterworld.com...]

Sorry, but all that fear, uncertainty and doubt about dynamic robots.txt doesn't fly here.

Bambarbia

2:33 am on Jun 23, 2007 (gmt 0)

What about Internet Explorer? As all SEOs know, it has Bookmark button. As all

White Hat

knows, it has even 'synchronize' functionality. Suppose I need to read up to 5-th level depth of WebmasterWorld website during my long trip when Internet is not available... Can I generate locally cached copy of your site before such trip?

Yes, I am creating a lot of work for everyone at this Forum. Please remove my profile!
`!`

Bambarbia

2:43 am on Jun 23, 2007 (gmt 0)

Nothing is wrong with dynamic robots.txt; what about redirects from another site? Can you add real value to your script to prevent content stealing?
BTW, many SEs still have bugs with redirects mostly related to robots.txt: which version of that file to use? with initial URL, or with final URL? What to do with 302? with 301?
Robots are human managed, and they trust you. The do not want to crawl 'submit feedback' forms, and they do not want to crawl 'your shopping cart'.
I noticed many posts in this forum related to 'someone caches content from my site!'. This is the main Use Case, and suggested solution is not the best.

P.S.
Mozilla also has a cache! Mozilla does not honor robots.txt.

Bambarbia

2:48 am on Jun 23, 2007 (gmt 0)

WebmasterWorld itself uses a dynamic robots.txt

You published it probably after checking your server logs. I accessed it yesterday. Very nice promotion! Especially when someone replies to me via private Email using real name.

vincevincevince

3:40 am on Jun 23, 2007 (gmt 0)

Bambarbia - would be nice if you could put all your replies into one post... makes it much easier for me to follow.

There is a big difference between a browser, a robot, a cache, and a crawl.

Cache = keeping something you already downloaded
Browser = directly human operated viewing
Robot = automatic downloader of content
Crawl = following links automatically to discover pages and download them

Browsers are not robots, so they do not need to respect robots.txt. Disallowing well-known browser User Agents in the robots.txt file trips up poorly created robots - no genuine browser will check the robots.txt file - so anyone claiming to be IE4 which reads the robots.txt file deserves to see Disallow.

Robots.txt however is only one part of a bigger picture - good management includes honey pots, IP based blocks and similar.

incrediBILL

5:47 am on Jun 23, 2007 (gmt 0)

You published it probably after checking your server logs. I accessed it yesterday. Very nice promotion! Especially when someone replies to me via private Email using real name.

I'm not sure what you mean as I have no affiliation with WebmasterWorld whatsoever other than being a member, so I couldn't check any logs at all.

The information I posted is all public knowledge which you can find by viewing the WebmasterWorld blog which *IS* a robots.txt file, see the info at the top of the document:

[webmasterworld.com...]

Now, any questions?

incrediBILL

5:53 am on Jun 23, 2007 (gmt 0)

In response to your other items:

Can you add real value to your script to prevent content stealing?

Robot.txt isn't designed to stop content stealing, it's designed to tell well behaved robots how to crawl your site. If your goal is to stop content stealing using robots.txt then you have lost the battle already.

BTW, many SEs still have bugs with redirects mostly related to robots.txt: which version of that file to use? with initial URL, or with final URL? What to do with 302? with 301?

Dynamic robots.txt uses no redirects, at least nothing I do is redirected, therefore those issues aren't a problem.