pageoneresults

msg:3853520 | 11:31 pm on Feb 19, 2009 (gmt 0) |
1. Provide instructions for robots 2. Stop 404s for robots.txt requests
|
goodroi

msg:3853924 | 1:31 pm on Feb 20, 2009 (gmt 0) |
Reason #3 - Protect yourself in corporate meetings from someone saying you don't know what you are doing since you don't have a robots.txt
|
tonynoriega

msg:3853986 | 3:06 pm on Feb 20, 2009 (gmt 0) |
@p1 why provide them instructions? why not let them access all files on the site? what files do you not let them access? javascript files? includes?
|
pageoneresults

msg:3853996 | 3:15 pm on Feb 20, 2009 (gmt 0) |
I use the Whitelisting Method and block generic folders like /nav/, /js/, etc. There is no way I'm putting a road map in there for prying eyes to start probing our structure. I keep it simple. I only allow the 4 major SEs to crawl and allow access to a few online validation tools. Anything else that adheres to the standard is not permitted. User-agent: googlebot User-agent: slurp User-agent: msnbot User-agent: teoma User-agent: W3C-checklink User-agent: WDG_SiteValidator Disallow: /js/ Disallow: /nav/ User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /
|
tonynoriega

msg:3854242 | 8:26 pm on Feb 20, 2009 (gmt 0) |
thats your entire robots.txt file? wow... a partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it... [edited by: tonynoriega at 8:35 pm (utc) on Feb. 20, 2009]
|
pageoneresults

msg:3854262 | 8:34 pm on Feb 20, 2009 (gmt 0) |
| A partner of mine sent me one that was bout 100 lines long and had all kinds of stuff in it. |
| It all depends on the size of the site and the volume of dynamics. We like to handle things at the page level with a noindex, nofollow directive instead of relying on robots.txt. Also, if there were a 100 lines in that file, that is surely providing information to prying eyes that maybe they shouldn't have quick access to? The robots.txt is fine for "general" good bot blocking but I wouldn't rely on it for managing crawler activity at the page level. We're also handling requests through various other routines and redirecting those to their appropriate destinations.
|
tonynoriega

msg:3854266 | 8:37 pm on Feb 20, 2009 (gmt 0) |
also, we have an includes file that has a bunch of asp includes in it... footer, header...etc... if i disallow crawlers in there, are they not picking up my navigation? or since they are included in the index.asp, they are still presented to the bots?
|
netchicken1

msg:3854328 | 9:59 pm on Feb 20, 2009 (gmt 0) |
You can also put a link to your sitemaps in it. Every bot that reads the robots then can access the sitemap
|
tonynoriega

msg:3854346 | 10:16 pm on Feb 20, 2009 (gmt 0) |
would that be as easy as adding a "Allow" tag?: User-agent: googlebot User-agent: slurp User-agent: msnbot User-agent: teoma User-agent: W3C-checklink User-agent: WDG_SiteValidator Allow: /sitemap/ Disallow: /js/ Disallow: /nav/
|
jdMorgan

msg:3854353 | 10:21 pm on Feb 20, 2009 (gmt 0) |
> would that be as easy as adding a "Allow" tag? Only if *all* robots in your User-Agent list recognize "Allow," which is NOT part of the Standard for Robot Exclusion, but rather a semi-proprietary "extension" to the protocol. Be sure to check the "webmaster info" page for each robot to be sure it supports "Allow," and before using any other "extension" which is not universally-supported, such as wild-card paths. Jim [edited by: jdMorgan at 10:26 pm (utc) on Feb. 20, 2009]
|
pageoneresults

msg:3854366 | 10:46 pm on Feb 20, 2009 (gmt 0) |
I want to point out that my robots.txt file is not for everyone. I've taken a blunt force approach this year to blocking unwanted visitors, the robots.txt was the first step. There is a bunch more in the works with firewalls and such. IncrediBILL and Ocean10000 got me started! Stay away from those two. :)
|
tonynoriega

msg:3854398 | 11:42 pm on Feb 20, 2009 (gmt 0) |
well i see no need for all of those additional bots to crawl my site... literally i get traffic from 3 SE's and thats it... throw in a lycos every once in a while...
|
enigma1

msg:3855809 | 2:54 pm on Feb 23, 2009 (gmt 0) |
tonynoriega, I am curious, so what you do if a bot or visitor accesses one of the disallowed directories? (like /nav/ for example). I use a robots.txt but it's blank and instead I've setup other means to direct incoming traffic.
|
tonynoriega

msg:3855833 | 3:29 pm on Feb 23, 2009 (gmt 0) |
well if my robots.txt is correctly setup, they shouldnt access any directories i put in there... of course, i believe there are bots that just ignore the robots.txt and access that directory anyway... i really cant do anything i guess... im hoping what this could do is show google and the other SE crawlers that im taking the time to implement this, and take more ownership of my site. and also, it may shave a few seconds of time from crawlers in my site.
|
enigma1

msg:3855840 | 3:44 pm on Feb 23, 2009 (gmt 0) |
yes ok, some sites use the disallow folders as honeypots and ban further access to the site which leads to undesired results. My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance. Same goes for everyone really who uses a browser and is one of the reasons I do not use the robots.txt content to direct traffic.
|
pageoneresults

msg:3855859 | 3:49 pm on Feb 23, 2009 (gmt 0) |
| My experience is that SEs, won't access restricted folders listed in the robots.txt by default, but they can be forced to access them by other means, via external links for instance. |
| I think you'll want to be careful here and make sure there are no root level pages at the Disallowed directories. If there are, you will see URI only listings when you perform site: searches. Any directories that shouldn't be for public consumption of course would be password protected. You don't need to list those in the robots.txt file. You're only providing a map if you do. Place a noindex, nofollow directive at the login page and be done with it.
|
|