Forum Moderators: goodroi
I have a store with two occurences of a CGI script - one for the main store and the other for pulling filtered data to group my products. The bottom line is that I can get to the same product via two methods - by design or by product type - and I end up with the same content on two different pages. The only difference is in the URL - the CGI script name. I am concerned that I'll get penalized for duplicate content from Google so I thought about putting the CGI script that I use for filtering in my Robots.txt file. I'm not sure how to do this though without some help.
My store is located in a sub-domain:
http:// subdomain.domain.com
The sub-domain is physically located off my domain as a folder. The sub-domain points to this folder.
I only have one CGI bin off the domain but it is linked to the sub-domain.
The URL to the filtered product is located:
http:// subdomain.domain.com/cgi-bin/store/filterXXX.cgi
To ban this directory do I put the following in my Robots.txt file?
User-agent: *
Disallow: /cgi-bin/store/filterXXX.cgi/
Also, when I set up the html pages in the sub-domain sub-directory I placed a Robots.txt file in there as well. I assume this is the one I would add this to.
Thanks for any help.
1.) Please know that robots.txt files do not ban bad bots -- they only 'work' with those bots respecting robots.txt. So while you can try to keep all bots out of any directory, it's not a sure thing, sorry.
2.) Just in case... Be sure that your robots.txt file is always lowercase (not Robots.txt), and that you upload it as ASCII/text.
3.) I have a domain and a subdomain and I have one robots.txt file at the top level of each one. The domain and sub-domain have different IPs so they're different sites, but the robots.txt files are identical, and identically accessible at the top level:
http:// www.domain.com/robots.txt
http:// subdomain.domain.com/robots.txt
It sounds like your domain-subdomain setup is somewhat dissimilar (nested directories as opposed to separate server accounts, public_html and cgi-bin directories, etc.), but I still recommend the same system of using identical robots.txt files, one for each 'site,' one in each directory...
public_html/robots.txt
public_html/subdomain/robots.txt
...thus resulting in the following URLs:
http:// www.domain.com/robots.txt
http:// subdomain.domain.com/robots.txt
4.) The idea of any robot, good or bad, directly accessing /cgi-bin is not a good thing. Thus I have the following in ALL of my robots.txt files, whether the instructions pertain to generic or specific bots:
GENERIC Example:
User-agent: *
Disallow: /cgi-bin
SPECIFIC Example:
User-agent: Googlebot
Disallow: /cgi-bin
(I use other techniques to limit access to cgi-bin but they're server-specific. If you're on an Apache server, check out that Forum [webmasterworld.com] and bone up on the crazily complex but effective feature that is "mod_rewrite".)
5.) Last but not least... You'll find more info in this Forum and at the The Web Robots Pages [robotstxt.org].
Here's hoping all of the above is more illuminating than confusing!
I do have this setup and it results in the same URL examples you listed. I am not on separate server accounts though. My robots.txt files are a little different because I have different stores in the domain and sub-domain.
"4.) The idea of any robot, good or bad, directly accessing /cgi-bin is not a good thing."
I've added this to both. Do I need to add a specific if I have the general? Thanks.
"(I use other techniques to limit access to cgi-bin but they're server-specific. If you're on an Apache server, check out that Forum and bone up on the crazily complex but effective feature that is "mod_rewrite".)"
Uh-oh - it sounds there is more to worry about.
1.) "Do I need to add a specific if I have the general?"
If by general you mean...
/cgi-bin
...that's A-OK because it will Disallow the directory as well as all files with "cgi-bin" in their paths. (This is my preference; your mileage may vary.)
2.) "Uh-oh - it sounds there is more to worry about."
There is ALWAYS more to worry about:)
But you're paying attention to details and that will stand you in good stead no matter what. So will paying attention to each of your domains' server logs because they tell you your sites' vital statistics -- and also which bots run roughshod over robots.txt (thus becoming immediately ban-worthy).
Does anyone have any feedback for my original post? I actually have things set up under the same CGI-Bin but the URL is different because the filter adds a directory off the script directory. I have my products showing up with two different URL's. This is not related to my domain at all but because I have a regular product page and a "by product type" filter. I need to ban one so it is not spidered.
These two URL's end up on the same product page:
http:// sub-domain.domain.com/cgi-bin/store/shop.cgi/product1
http:// sub-domain.domain.com/cgi-bin/store/shop.cgi/filtered_section/product1
Thanks