Exclude by filename, not directory?

Forum Moderators: goodroi

Message Too Old, No Replies

Exclude by filename, not directory?

joergnw10

2:33 pm on Sep 29, 2005 (gmt 0)

Google has started to index printable versions of my product pages, ignoring the robots metatag.
I would now like to exclude these files with the robots.txt file - but each of them is in a different (product-)folder. Is there a way of disallowing all files ending in
/product1/productname1PRINT.htm
/product2/productname2PRINT.htm
....

by just including something that has the effect of *PRINT.htm?

tommytx

3:02 pm on Sep 29, 2005 (gmt 0)

Isn't that a good thing, even though you don't necessairly want folks to go there, one more page in google cant' be bad unless it has private info you don't want shown. Does it actually start printing which would of course piss them off? Can you use the javascript command at the top saying <a href=javascript:(print)">Print</a> to prevent printing without their selection? The syntax may not be exact, but you get the idea.

joergnw10

3:22 pm on Sep 29, 2005 (gmt 0)

Thanks Tommytx. The printable pages are basically a duplicate of the product page, just with smaller pictures. And there are no links on them at all, so if someone gets straight to one of these pages from google there is nowhere else for them to go. The printer dialog comes up automatically with javascript, but you still need to select whether to print or to cancel.
I could change all this if necessary - but for the moment it would be much easier if I could just add a line to the robots file.

joergnw10

8:18 am on Oct 4, 2005 (gmt 0)

Right, so I guess there is no way to make this work. Don't suppose /print.htm would solve this?

Dijkgraaf

11:34 pm on Oct 4, 2005 (gmt 0)

No easy solution I'm afraid. Possibly you need to restructure your files a bit. Possibly create a print folder, e.g. /print/product1/productname1.htm, and then you can disallow: /print

Lord Majestic

1:12 pm on Oct 5, 2005 (gmt 0)

Google supports pattern matching (that is not in de facto standard and won't be supported by others), so if you want to target Google then you can use * sign in section for User-agent: Googlebot

joergnw10

6:49 pm on Oct 6, 2005 (gmt 0)

Thanks everyone. I have decided to put in some work after all and have moved the files to a different folder. So now all the bots are disallowed. I will keep the Google solution in mind for the future though. Would the correct syntax have been
/*print.htm
?

Lord Majestic

6:52 pm on Oct 6, 2005 (gmt 0)

Yes that looks like correct syntax for Google, even though I can't recommend this approach.

joergnw10

7:01 pm on Oct 6, 2005 (gmt 0)

Why not?

Lord Majestic

7:09 pm on Oct 6, 2005 (gmt 0)

Its not standard extention -- you have to cater for other bots anyway, so using it is kind of pointless unless you just want Googlebot to crawl your site and nobody else.

joergnw10

7:25 pm on Oct 6, 2005 (gmt 0)

I see what you mean. It's just that I though it might come in handy if you think Google has a stricter policy regarding duplicate content - then you could disallow Google, but the other bots could still crawl the whole site (including like in this case the printable / duplicate versions of pages)and you might get more pages indexed with the other search engines.

Lord Majestic

11:37 pm on Oct 6, 2005 (gmt 0)

My line of reasoning is this - if you think you need to exclude some pages from crawling then you will need to use standard robots.txt methods, which means no wildcards or pattern matching. Thus if you do it for all search engines (and why won't you want otherwise?) then doing wild card entries just for Google becomes redundant.

idolw

5:55 pm on Oct 30, 2005 (gmt 0)

so does that mean that it is impossible to prevent certain files or even URLs from being crawled?

Lord Majestic

6:04 pm on Oct 30, 2005 (gmt 0)

No, it does not mean that - however it does not mean that you can EASILY do that without adding lots of lines to robots.txt.

idolw

6:52 pm on Oct 30, 2005 (gmt 0)

thanks Lord Majestic! how about the following situation:
i have following mess on my site.
5 subdomains that are virtually created for programming purposes. In fact, they are not real subdomains (different categories) but only rewrite of the URLs. All files are kept in one directory, no matter which subdomain they are. In fact, our files are only certain parts of the pages visited by the user as menus have been prepared once and are loaded accroding to URL user types in.
Example:
if you type in URL: subdomain1.mysite.com/pagesubdomain1 it will show you a page belonging to the topic of subdomain1 and menus characteristic to subdomain1
however, you type in URL: subdomain2.mysite.com/pagesubdomain1 it will show you the same page but with different menus.

that solution is very convenienent for programming (no need to transfer information within real subdomains and no need to purchase multiple SSL certificates, one per subdomain). Unfortunately, it is very susceptible to any mistake in link placing.
we did not realize that until recently, when a mistake was made. SEs indexed same pages with various subdomains making our own pages duplicate content, that may lead to a penalty.

here comes the question:
how do I use robots.txt in that particular situation? do we just need to create 'real' subdomains (directories on server), create several robots.txt files, one per subdomain and disallow URLs not characteristic and put them into the purpose-created subdomain directories?

Lord Majestic

7:18 pm on Oct 30, 2005 (gmt 0)

Standard requires checking for robots.txt for full domain name (including subdomain and optional port), so if you don't want to have SEs crawling data on those subdomains then you will have to have robots.txt for this.

idolw

7:23 pm on Oct 30, 2005 (gmt 0)

so creating real directory per each virtual subdomain only to hold robots.txt file is the solution?
and each robots.txt file shall have a list of URLs that we do not want to be crawled?
am I right?

Lord Majestic

7:30 pm on Oct 30, 2005 (gmt 0)

Well, you can have a script under name of robots.txt and it would check whether its subdomain in the requested domain and if so return robots.txt for subdomains (say disallowing all urls), but if its main domain then return standard robots.txt

idolw

7:36 pm on Oct 30, 2005 (gmt 0)

thanks for that Lord Majestic!
I apologize for spamming you with sticky