Forum Moderators: goodroi

Message Too Old, No Replies

Disallow for sub-domain directory?

         

palmpal

7:28 am on Jan 22, 2006 (gmt 0)

10+ Year Member



Hello,

I have a store with two occurences of a CGI script - one for the main store and the other for pulling filtered data to group my products. The bottom line is that I can get to the same product via two methods - by design or by product type - and I end up with the same content on two different pages. The only difference is in the URL - the CGI script name. I am concerned that I'll get penalized for duplicate content from Google so I thought about putting the CGI script that I use for filtering in my Robots.txt file. I'm not sure how to do this though without some help.

My store is located in a sub-domain:
http:// subdomain.domain.com

The sub-domain is physically located off my domain as a folder. The sub-domain points to this folder.

I only have one CGI bin off the domain but it is linked to the sub-domain.

The URL to the filtered product is located:
http:// subdomain.domain.com/cgi-bin/store/filterXXX.cgi

To ban this directory do I put the following in my Robots.txt file?

User-agent: *
Disallow: /cgi-bin/store/filterXXX.cgi/

Also, when I set up the html pages in the sub-domain sub-directory I placed a Robots.txt file in there as well. I assume this is the one I would add this to.

Thanks for any help.

Pfui

3:26 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your questions have been sitting for a bit so I'll take my best shot:)

1.) Please know that robots.txt files do not ban bad bots -- they only 'work' with those bots respecting robots.txt. So while you can try to keep all bots out of any directory, it's not a sure thing, sorry.

2.) Just in case... Be sure that your robots.txt file is always lowercase (not Robots.txt), and that you upload it as ASCII/text.

3.) I have a domain and a subdomain and I have one robots.txt file at the top level of each one. The domain and sub-domain have different IPs so they're different sites, but the robots.txt files are identical, and identically accessible at the top level:

http:// www.domain.com/robots.txt
http:// subdomain.domain.com/robots.txt

It sounds like your domain-subdomain setup is somewhat dissimilar (nested directories as opposed to separate server accounts, public_html and cgi-bin directories, etc.), but I still recommend the same system of using identical robots.txt files, one for each 'site,' one in each directory...

public_html/robots.txt
public_html/subdomain/robots.txt

...thus resulting in the following URLs:

http:// www.domain.com/robots.txt
http:// subdomain.domain.com/robots.txt

4.) The idea of any robot, good or bad, directly accessing /cgi-bin is not a good thing. Thus I have the following in ALL of my robots.txt files, whether the instructions pertain to generic or specific bots:

GENERIC Example:

User-agent: *
Disallow: /cgi-bin

SPECIFIC Example:

User-agent: Googlebot
Disallow: /cgi-bin

(I use other techniques to limit access to cgi-bin but they're server-specific. If you're on an Apache server, check out that Forum [webmasterworld.com] and bone up on the crazily complex but effective feature that is "mod_rewrite".)

5.) Last but not least... You'll find more info in this Forum and at the The Web Robots Pages [robotstxt.org].

Here's hoping all of the above is more illuminating than confusing!

Dijkgraaf

9:01 pm on Jan 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In the main robots.txt file you will probably want to disallow the sub-domain's directory to avoid the duplicate contents penalty.
Otherwise it could be possible for a robot to request
http:// domain.com/sub-domain/...
being the same as
http:// subdomain.domain.com/...

palmpal

2:07 pm on Jan 26, 2006 (gmt 0)

10+ Year Member



Thanks for responding!


"I have a domain and a subdomain and I have one robots.txt file at the top level of each one."

I do have this setup and it results in the same URL examples you listed. I am not on separate server accounts though. My robots.txt files are a little different because I have different stores in the domain and sub-domain.

"4.) The idea of any robot, good or bad, directly accessing /cgi-bin is not a good thing."

I've added this to both. Do I need to add a specific if I have the general? Thanks.

"(I use other techniques to limit access to cgi-bin but they're server-specific. If you're on an Apache server, check out that Forum and bone up on the crazily complex but effective feature that is "mod_rewrite".)"

Uh-oh - it sounds there is more to worry about.

palmpal

2:12 pm on Jan 26, 2006 (gmt 0)

10+ Year Member



Dijkgraaf - I actually did have this in my domain robots.txt file. I deleted the trailing slash on my directory disallow though. It sounds like this is better than:

/sub-domain directory/

Pfui

3:30 pm on Jan 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



palmpal, hi back atcha!

1.) "Do I need to add a specific if I have the general?"

If by general you mean...

/cgi-bin

...that's A-OK because it will Disallow the directory as well as all files with "cgi-bin" in their paths. (This is my preference; your mileage may vary.)

2.) "Uh-oh - it sounds there is more to worry about."

There is ALWAYS more to worry about:)

But you're paying attention to details and that will stand you in good stead no matter what. So will paying attention to each of your domains' server logs because they tell you your sites' vital statistics -- and also which bots run roughshod over robots.txt (thus becoming immediately ban-worthy).

palmpal

4:18 pm on Jan 26, 2006 (gmt 0)

10+ Year Member



That is on my list of things to do. Apparently my FastStats program is analyzing both the domain and sub-domain together. I have to make a change in order to separate the log files. My web host does not offer support (although they have done quite a bit for me over the years) for this setup.

Does anyone have any feedback for my original post? I actually have things set up under the same CGI-Bin but the URL is different because the filter adds a directory off the script directory. I have my products showing up with two different URL's. This is not related to my domain at all but because I have a regular product page and a "by product type" filter. I need to ban one so it is not spidered.

These two URL's end up on the same product page:

http:// sub-domain.domain.com/cgi-bin/store/shop.cgi/product1

http:// sub-domain.domain.com/cgi-bin/store/shop.cgi/filtered_section/product1

Thanks