Dynamic pages - what a headache! - Website Technology Issues forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Dynamic pages - what a headache!

mrdch

2:44 pm on Feb 23, 2003 (gmt 0)

I am trying to figure out how to block Google from indexing dynamic pages. I looked on this site here [webmasterworld.com] and here [webmasterworld.com] and even here [webmasterworld.com] and I still don't know how to do it :(

What is the right sytax - if there is one...

The dynamic pages are all in one directory. I still want it to work for users, but don't want Google to grab them. On the other hand, I want it to pick the page WITHOUT any parameters! So, the following is what I need...

OK for users and Google:
/mysite/dynamic/page.php

OK for users but NOT for Google
/mysite/dynamic/page.php?version=1

So how to do it?

Thanks for any help!

hakre

2:53 pm on Feb 23, 2003 (gmt 0)

hi mrdch,

robots.txt might not be your best friend on that case, use meta tags for this:

and place it into the head of the html returned by /mysite/dynamic/page.php?version=1 only.

mrdch

11:57 pm on Feb 23, 2003 (gmt 0)

Hi Hakre,

Thanks for the advice - I can see how it can be really handy!

Problem is - I don't ACTUALLY have pages with a parameter - but I don't want them to be made into an error for users - in case someone still has them in bookmarks or something like that.

So my question still remains:

Does anyone know of the correct SYNTAX to keep Googlebot from spidering dynamic pages - using the robots.txt file?

Thanks

aspdesigner

10:31 am on Feb 25, 2003 (gmt 0)

If Google supports the "newer" working draft version (if you can call 1997 "new"!) for robots.txt here -

[robotstxt.org...]

and if Google's description of their extensions is correct, then this should work -

User-agent: Googlebot
Allow: /mysite/dynamic/page.php$
Disallow: /mysite/dynamic

Per the working draft spec above, the robot should use the first matching pattern. If the URL has no parameters, then the first pattern should match, and the page will be read. If the URL contains parameters, then the first pattern will NOT match, but the second pattern (the disallow) will.

You will need to try this to see if it works, as it depends on Google supporting the later "working draft" spec, as well as supporting the "$" extension in an "Allow" line.

hakre

10:40 am on Feb 25, 2003 (gmt 0)

hi aspdesigner,

dosn't the disallow section will override the allow section in that working draft version? do you know?

mrdch

11:03 am on Feb 25, 2003 (gmt 0)

Hi,

I asked Lisa, a Forum member, how exactly did she block dynamic pages from being spidered.

Her reply was :

User-Agent: googlebot
Disallow: /*.html?

For my own purposes, I am going to try

User-Agent: googlebot
Disallow: /*.php?

as I don't need any dynamic pages to be spidered.

Hope that helps

aspdesigner

11:40 am on Feb 25, 2003 (gmt 0)

dosn't the disallow section will override the allow section in that working draft version? do you know?

No, it says first matching line -

"To evaluate if access to a URL is allowed, a robot must attempt to match the paths in Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used."

jmilk

6:07 pm on Mar 4, 2003 (gmt 0)

At the top of your .php scripts, simply check for $HTTP_USER_AGENT and if it's google, clear all your parameters...
<?php
if( strstr( strtoupper( $HTTP_USER_AGENT ), "GOOGLE" ) )
{ unset( $id );
unset( $param );
unset( $req ); // etc...
}

...rest of your content here...
?>

andreasfriedrich

6:44 pm on Mar 4, 2003 (gmt 0)

Welcome to WebmasterWorld [webmasterworld.com] jmilk.

Be sure to read Marcia`s WebmasterWorld Welcome and Guide to the Basics [webmasterworld.com] post.

Could you please explain what you are trying to achieve with this approach. AFAIK this won�t really help with query strings in the URL since a) when you reset the parameter Googlebot will have already requested that page and you cannot change the URL unless you return a 301 status code and give the new location in the Location header field and b) most scripts rely on the parameters to know which content to produce.

Andreas

jmilk

8:36 am on Mar 5, 2003 (gmt 0)

Thanks for the kind welcome. I guess I was a bit quick to the keys there, misreading the first message "...don't want Google to grab them. On the other hand, I want it to pick the page WITHOUT any parameters!"

I understood it that he didn't want google to supply any parameters to his scripts -- this is similar to a problem I ran into myself recently, where I basically clear parameters if the page is accessed without a valid referrer.

I'll be sure to read the _entire_ thread from now on before answering :)