Forum Moderators: phranque

Message Too Old, No Replies

Banning all bots from URL's containing certain words

One final effort at un-indexing a bunch of query-stringed URL's

         

MatthewHSE

12:39 pm on Jul 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As I've explained in a few previous posts dealing with this topic, my site uses a lot of query string URL's to perform certain actions. For instance, users can click a link to highlight a forum post. The URL to highlight the post is dynamically-generated just for that post, and has a long query string to it. Naturally Google (and other bots) "click" all those links. But, the links don't actually "go" anywhere; they only reload that same page. There are other situations similar to this on my site. Therefore I have tons of pages listed on Google that really don't need to be there. I'm concerned that I may eventually get penalized for duplicate content.

So, is there any way I can use .htaccess to serve all robots a 403 when they try to visit a URL that contains certain words? If so, how would I do it? I'd like to be able to filter for a few key words so those URL's don't get indexed anymore; after that I'll get going on the long and tedious task of removing all those URL's from the Google index . . .

Unless, of course, duplicate content won't be a problem in this case and the extra "pages" might help my rankings? ;)

Thanks,

Matthew

robotsdobetter

12:48 pm on Jul 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you tried cloaking it? Or how about using JavaScript for these links?

Cloaking would be best and just have the search engine's bots redirected to the main page for that subject.

MatthewHSE

1:05 pm on Jul 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Unfortunately javascript isn't an option; I didn't write the scripts that generate those links. I'm also reluctant to try cloaking since this would be my first try with it and I know you can wind up with serious SE woes if cloaking backfires. (With that said, is there a simple, safe way to cloak for this purpose, and where would I start to get some good info?)

robotsdobetter

1:24 pm on Jul 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think you should have a problem with this because you are not using it to "spam" the search engines. Not only that, but the search engines have a hard time detecting this.

Take a look at these...
[webmasterworld.com...]

[dmoz.org...]

[webmasterworld.com...]

Sanenet

3:31 pm on Jul 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If one of the query strings is present in the URL, insert the "noindex, noarchive, nofollow" tag into the Metas. So, Google will go there once, but won't keep copies, and won't keep indexing off it.

rubenski

7:44 am on Jul 14, 2004 (gmt 0)

10+ Year Member



What Sanenet suggests would be a good solution. If that is not possible you should use mod_rewrite (assuming you are using Apache web server).

I don't exactly know what rule you would need as I don't know how your URLs look. Anyway, you need a two part rule using RewriteCond and RewriteRule. In this case you would use Rewritecond to check the Agent Name and see if it is a bot.

There is one issue with this however. Serving different content to SE bots is something you can get penalized for too. I don't know if this goes for a 403 too however.