Welcome to WebmasterWorld Guest from 54.162.157.136

Forum Moderators: goodroi

Message Too Old, No Replies

why use a robots.txt

     
11:28 pm on Feb 19, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



im debating wether or not to implement a robots.txt on our site...

ive been studying SEO for over 3 years now, have my site usually first page of most if not all of our key phrases, and know the ins and outs.

however, i have never seen a real big benefit to implementing a robots.txt file on my sites.

why do you use a robots.txt file?

11:31 pm on Feb 19, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



1. Provide instructions for robots
2. Stop 404s for robots.txt requests
1:31 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Reason #3 - Protect yourself in corporate meetings from someone saying you don't know what you are doing since you don't have a robots.txt
3:06 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



@p1

why provide them instructions? why not let them access all files on the site?

what files do you not let them access? javascript files?
includes?

3:15 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I use the Whitelisting Method and block generic folders like /nav/, /js/, etc. There is no way I'm putting a road map in there for prying eyes to start probing our structure. I keep it simple. I only allow the 4 major SEs to crawl and allow access to a few online validation tools. Anything else that adheres to the standard is not permitted.

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /js/
Disallow: /nav/

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /
8:26 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



thats your entire robots.txt file?

wow... a partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it...

[edited by: tonynoriega at 8:35 pm (utc) on Feb. 20, 2009]

8:34 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



A partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it.

It all depends on the size of the site and the volume of dynamics. We like to handle things at the page level with a noindex, nofollow directive instead of relying on robots.txt. Also, if there were a 100 lines in that file, that is surely providing information to prying eyes that maybe they shouldn't have quick access to?

The robots.txt is fine for "general" good bot blocking but I wouldn't rely on it for managing crawler activity at the page level. We're also handling requests through various other routines and redirecting those to their appropriate destinations.

8:37 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



also, we have an includes file that has a bunch of asp includes in it... footer, header...etc...

if i disallow crawlers in there, are they not picking up my navigation?

or since they are included in the index.asp, they are still presented to the bots?

9:59 pm on Feb 20, 2009 (gmt 0)

10+ Year Member



You can also put a link to your sitemaps in it. Every bot that reads the robots then can access the sitemap
10:16 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



would that be as easy as adding a "Allow" tag?:

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator

Allow: /sitemap/

Disallow: /js/
Disallow: /nav/

10:21 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



> would that be as easy as adding a "Allow" tag?

Only if *all* robots in your User-Agent list recognize "Allow," which is NOT part of the Standard for Robot Exclusion, but rather a semi-proprietary "extension" to the protocol.

Be sure to check the "webmaster info" page for each robot to be sure it supports "Allow," and before using any other "extension" which is not universally-supported, such as wild-card paths.

Jim

[edited by: jdMorgan at 10:26 pm (utc) on Feb. 20, 2009]

10:46 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I want to point out that my robots.txt file is not for everyone. I've taken a blunt force approach this year to blocking unwanted visitors, the robots.txt was the first step. There is a bunch more in the works with firewalls and such. IncrediBILL and Ocean10000 got me started! Stay away from those two. :)
11:42 pm on Feb 20, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



well i see no need for all of those additional bots to crawl my site... literally i get traffic from 3 SE's and thats it... throw in a lycos every once in a while...
2:54 pm on Feb 23, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



tonynoriega, I am curious, so what you do if a bot or visitor accesses one of the disallowed directories? (like /nav/ for example).

I use a robots.txt but it's blank and instead I've setup other means to direct incoming traffic.

3:29 pm on Feb 23, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



well if my robots.txt is correctly setup, they shouldnt access any directories i put in there... of course, i believe there are bots that just ignore the robots.txt and access that directory anyway...

i really cant do anything i guess... im hoping what this could do is show google and the other SE crawlers that im taking the time to implement this, and take more ownership of my site.

and also, it may shave a few seconds of time from crawlers in my site.

3:44 pm on Feb 23, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



yes ok, some sites use the disallow folders as honeypots and ban further access to the site which leads to undesired results.

My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance. Same goes for everyone really who uses a browser and is one of the reasons I do not use the robots.txt content to direct traffic.

3:49 pm on Feb 23, 2009 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance.

I think you'll want to be careful here and make sure there are no root level pages at the Disallowed directories. If there are, you will see URI only listings when you perform site: searches.

Any directories that shouldn't be for public consumption of course would be password protected. You don't need to list those in the robots.txt file. You're only providing a map if you do. Place a noindex, nofollow directive at the login page and be done with it.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month