homepage Welcome to WebmasterWorld Guest from 54.237.71.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt disallow: /index.php?
then /index.php?param=example still allowed?
Yidaki




msg:1527474
 6:20 pm on Sep 20, 2003 (gmt 0)

I want to allow googlebot to crawl my index page named index.php but i want to disallow index.php?param=example. So if i make a robots.txt entry Disallow: /index.php? will index.php still get crawled? I saw in google's own robots.txt [google.com] that they themself exlude /mac? but i assume this would allow /mac ...!? Is this valid? Can i use it? Couldn't find ANY info about it neither on google nor within the w3c specs ...

In clear words:

Disallow: /this.php?

-> /this.php?param=example
==> DOES NOT get crawled
-> /this.php
==> DOES get crawled?

 

keeper




msg:1527475
 2:26 am on Sep 21, 2003 (gmt 0)

Following on from your Google example,
A direct request to:
/mac
Shows this URL has both pagerank and cache info.
However:
/groups
(which is blocked in the robots.txt) has neither.

From this behaviour I would take the leap and say that:
/this.php will be crawled and indexed if your robots.txt has a directive to disallow: /this.php?

At least for Google, not sure how the other engines will take it. I checked the robots documentation as well and couldn't find any specific examples..

moltar




msg:1527476
 3:42 am on Sep 21, 2003 (gmt 0)

Hmm. I have a reversed situation. I blocked every SE from accessing /cgi-bin/script.pl, but Googlebot still took all the pages with parameters (/cgi-bin/script.pl?something=here), now there are bunch of them in the index, but they have no info.

jdMorgan




msg:1527477
 4:44 am on Sep 21, 2003 (gmt 0)

Regarding Moltar's comment:
> but they have no info.

Google and Ask Jeeves have a behaviour which is different from most other spiders: If either of these spiders finds a link to a page, they will list the page, regardless of whether robots.txt disallows crawling of that page. If the page is disallowed, they won't crawl (fetch) it, but they will list it by URL. Other search engine spiders interpret a Disallow as meaning "don't mention this page at all," but the "listing" behaviour of Google and AJ is not explicitly defined by A Standard for Robots Exclusion; all it describes is fetching behaviour.

Yidaki brings up another grey-area question: Since a query string is not technically part of a URL (it is instead an argument passed to an agent at a specific URL), then is a robot expected/required to recognize different query string values as part of the URL for the purposes of matching a Disallow directive? My guess is that it is not a good idea to depend on any standard behaviour of different robots with respect to query strings. This may be another good argumant in favor of using URL rewriting to make dynamic URLs look like static ones.

Just some comments...
Jim

voltus




msg:1527478
 12:17 pm on Sep 21, 2003 (gmt 0)

In my case google will allow you using parameter
like this

subcategory.php?param=16&subcat=blablaa ==> allow
subcategory.php?param=16 ==> allow
or even just page.php ==> allow

so dont be worry using param as long you feel that your pages will crawling by googlebot

i have more 2000 pages and still increasing, using php,including parameter.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved