homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt for Query Strings
Can I stop bots from following links with query stings
Nick_W




msg:1529380
 6:38 am on Oct 3, 2003 (gmt 0)

So it could index and follow on index.php

...but NOT follow/index index.php?x=43&y=454

with me?

A shove in the right direction most appeciated ;-)

Nick

 

BlueSky




msg:1529381
 6:58 am on Oct 3, 2003 (gmt 0)

I think Googlebot is the only one that reads regular expressions in robots.txt, and he does it quite well too. That's what I use to keep him in line on my site. If index is in the upper directory, you could do something like this:

User-agent: Googlebot
Disallow: /index.php?x=*$

For the others, can you move the script into one directory so you can disallow it? If not, I think you can stop them via rewrite rules feeding them say a 410 or something when they try to access the link with variables.

Nick_W




msg:1529382
 7:08 am on Oct 3, 2003 (gmt 0)


User-agent: Googlebot
Disallow: /index.php?x=*$

Sound great! Could it be expanded to disallow any query string on index.php?

Nick

BlueSky




msg:1529383
 7:30 am on Oct 3, 2003 (gmt 0)

Sure. I think only the wild card * and the end of string $ are allowed. Maybe do it like this:

Disallow: /index.php?*$

One thing you might want to consider is modifying your script to add the noindex, nofollow metatag on pages with variables in the URL. That is what I did on certain features, and the bots don't touch those pages except little Googlebot. So, I ended up using regex in the robots.txt to keep him away from those.

[edit corrected typos]

[edited by: BlueSky at 7:35 am (utc) on Oct. 3, 2003]

Nick_W




msg:1529384
 7:33 am on Oct 3, 2003 (gmt 0)

Unfortunately modifying is not an option. It's a very complex pre-made script and could take me a week to do ;)

I'll give your suggestion a try though. Thanks very much for the help, much appreciated!

Nick

Yidaki




msg:1529385
 8:07 am on Oct 3, 2003 (gmt 0)

Nick, actually i have the same problem. On September, 21 i asked the same question:
Robots.txt disallow: /index.php? [webmasterworld.com] Then /index.php?param=example still allowed?.

It looks like a greay area where nobody seems to have a definite answer - not even the robots specs cover this. From a look at Google's own robots.txt it seems that at least Google has a answer for this:

Disallow: /mac?

But www.google.com/mac [google.com] is indexed.

So i *guess* that index.php will get indexed but index.php?param=foo will not if index.php? is disallowed. I suppose you wouldn't even have to use a asterix. OTOH Google treats robots.txt not the same like other bots so i'm not sure how they would behave ...

I really need a answer to this because i want to avoid being crawled for dup content (rewritten url's + dynamic url's). Might be a good idea to run a test ...

Nick_W




msg:1529386
 8:26 am on Oct 3, 2003 (gmt 0)

Let ya know in a few days ;)

Nick

Yidaki




msg:1529387
 8:40 am on Oct 3, 2003 (gmt 0)

>Let ya know in a few days

So you gonna run the test!? Great - thanks - awaiting your report! (:

Nick_W




msg:1529388
 7:35 am on Oct 15, 2003 (gmt 0)

Well,

Disallow index.php?

does NOT stop G indexing index.php with any query string. Trying index.php?*$ now...

Nick

Yidaki




msg:1529389
 6:08 pm on Oct 15, 2003 (gmt 0)

>does NOT stop G indexing index.php with any query string.

Hm... would mean that google expects to not getting crawled in such cases (since they disallow /mac?) but GBot himself DOES crawl such. Funny!

Thanks for testing, Nick!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved