disallowing based on part of a query string

Forum Moderators: goodroi

Message Too Old, No Replies

disallowing based on part of a query string

jibbajabba

12:55 pm on Nov 25, 2003 (gmt 0)

I have php scripts that have urls like:

[mydomain.com...]

I want robots to crawl those pages, but not pages where the query string is:

[mydomain.com...]

It's a bit unusual, but basically there are pages with numbers for filenames, e.g. 1, 2, 3... and each page is a php script that's built from another page publishing app. As far as I could tell, there's a way to do the exclusion via robots.txt, but thought I'd check. I appreciate other suggestions as well.

Thanks for any help.

bakedjake

5:45 pm on Nov 28, 2003 (gmt 0)

jdMorgan brings up a good point here [webmasterworld.com]:

Since a query string is not technically part of a URL (it is instead an argument passed to an agent at a specific URL), then is a robot expected/required to recognize different query string values as part of the URL for the purposes of matching a Disallow directive? My guess is that it is not a good idea to depend on any standard behaviour of different robots with respect to query strings. This may be another good argumant in favor of using URL rewriting to make dynamic URLs look like static ones.

I'd recommend following his advice and rewriting the URLs if you can. Barring that, you could deny access to spiders in PHP by detecting the UserAgent for known spiders, and only allowing them access to the print friendly page.

I know in the past that I have had mixed luck with parameters in the robots.txt. It's simply not reliable.

drisol

4:39 pm on Dec 2, 2003 (gmt 0)

Hello,

I have a similar problem and I don't want to open a new thread.

I want the spiders not to crawl my dynamic pages.
They should only be allowed to crawl my "static" web pages with the following pattern:
- domain.com/xx,xx,xx,xx.html
- domain.com/faq,xx,xx,****x.html
- domain.com/forum,xx,xx,****x.html

I tried it in the following way:


User-agent: *
Disallow: /?p=
Disallow: /?l=
Disallow: /?refID=
Disallow: /?sesID=

(all of my dynamic pages start with one of these URLs)

After adding this code into my robots.txt, GoogleBot spidered only my homepage "/" and none of my other pages.

Can someone explain it?
Is there a solution for my problem?
It should work with all crawlers, not only whith GoogleBot ( Disallow: /*? )

Thanks in advance,
Daniel