Welcome to WebmasterWorld Guest from 54.145.55.135

Forum Moderators: goodroi

Message Too Old, No Replies

why use a robots.txt

     
11:28 pm on Feb 19, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


im debating wether or not to implement a robots.txt on our site...

ive been studying SEO for over 3 years now, have my site usually first page of most if not all of our key phrases, and know the ins and outs.

however, i have never seen a real big benefit to implementing a robots.txt file on my sites.

why do you use a robots.txt file?

11:31 pm on Feb 19, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12169
votes: 56


1. Provide instructions for robots
2. Stop 404s for robots.txt requests
1:31 pm on Feb 20, 2009 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3121
votes: 111


Reason #3 - Protect yourself in corporate meetings from someone saying you don't know what you are doing since you don't have a robots.txt
3:06 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


@p1

why provide them instructions? why not let them access all files on the site?

what files do you not let them access? javascript files?
includes?

3:15 pm on Feb 20, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12169
votes: 56


I use the Whitelisting Method and block generic folders like /nav/, /js/, etc. There is no way I'm putting a road map in there for prying eyes to start probing our structure. I keep it simple. I only allow the 4 major SEs to crawl and allow access to a few online validation tools. Anything else that adheres to the standard is not permitted.

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /js/
Disallow: /nav/

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /
8:26 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


thats your entire robots.txt file?

wow... a partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it...

[edited by: tonynoriega at 8:35 pm (utc) on Feb. 20, 2009]

8:34 pm on Feb 20, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12169
votes: 56


A partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it.

It all depends on the size of the site and the volume of dynamics. We like to handle things at the page level with a noindex, nofollow directive instead of relying on robots.txt. Also, if there were a 100 lines in that file, that is surely providing information to prying eyes that maybe they shouldn't have quick access to?

The robots.txt is fine for "general" good bot blocking but I wouldn't rely on it for managing crawler activity at the page level. We're also handling requests through various other routines and redirecting those to their appropriate destinations.

8:37 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


also, we have an includes file that has a bunch of asp includes in it... footer, header...etc...

if i disallow crawlers in there, are they not picking up my navigation?

or since they are included in the index.asp, they are still presented to the bots?

9:59 pm on Feb 20, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2005
posts:614
votes: 0


You can also put a link to your sitemaps in it. Every bot that reads the robots then can access the sitemap
10:16 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


would that be as easy as adding a "Allow" tag?:

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator

Allow: /sitemap/

Disallow: /js/
Disallow: /nav/

10:21 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


> would that be as easy as adding a "Allow" tag?

Only if *all* robots in your User-Agent list recognize "Allow," which is NOT part of the Standard for Robot Exclusion, but rather a semi-proprietary "extension" to the protocol.

Be sure to check the "webmaster info" page for each robot to be sure it supports "Allow," and before using any other "extension" which is not universally-supported, such as wild-card paths.

Jim

[edited by: jdMorgan at 10:26 pm (utc) on Feb. 20, 2009]

10:46 pm on Feb 20, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12169
votes: 56


I want to point out that my robots.txt file is not for everyone. I've taken a blunt force approach this year to blocking unwanted visitors, the robots.txt was the first step. There is a bunch more in the works with firewalls and such. IncrediBILL and Ocean10000 got me started! Stay away from those two. :)
11:42 pm on Feb 20, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


well i see no need for all of those additional bots to crawl my site... literally i get traffic from 3 SE's and thats it... throw in a lycos every once in a while...
2:54 pm on Feb 23, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 30, 2007
posts:1394
votes: 0


tonynoriega, I am curious, so what you do if a bot or visitor accesses one of the disallowed directories? (like /nav/ for example).

I use a robots.txt but it's blank and instead I've setup other means to direct incoming traffic.

3:29 pm on Feb 23, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 8, 2006
posts:1232
votes: 0


well if my robots.txt is correctly setup, they shouldnt access any directories i put in there... of course, i believe there are bots that just ignore the robots.txt and access that directory anyway...

i really cant do anything i guess... im hoping what this could do is show google and the other SE crawlers that im taking the time to implement this, and take more ownership of my site.

and also, it may shave a few seconds of time from crawlers in my site.

3:44 pm on Feb 23, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 30, 2007
posts:1394
votes: 0


yes ok, some sites use the disallow folders as honeypots and ban further access to the site which leads to undesired results.

My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance. Same goes for everyone really who uses a browser and is one of the reasons I do not use the robots.txt content to direct traffic.

3:49 pm on Feb 23, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12169
votes: 56


My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance.

I think you'll want to be careful here and make sure there are no root level pages at the Disallowed directories. If there are, you will see URI only listings when you perform site: searches.

Any directories that shouldn't be for public consumption of course would be password protected. You don't need to list those in the robots.txt file. You're only providing a map if you do. Place a noindex, nofollow directive at the login page and be done with it.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members