Forum Moderators: goodroi

Message Too Old, No Replies

mod_rewrite and robots.txt

General discussion on use mod_rewrite with robots.txt

         

Vaddy

1:06 am on Apr 6, 2005 (gmt 0)

10+ Year Member



Hi to all pro and newbies!

I think this is probably not easy topic to discuss, but lets try to get an answer.

Let say we have dynamic pages kind of

/index.php?page=1 and
/index.php?page=2

lets assume we use mod_rewrite to make it more user and bot friendly and convert to something like:

/apples.html and
/oranges.html

our http.conf record will look something like that:

RewriteRule \/apples\.html /index.php?page=1
RewriteRule \/oranges\.html /index.php?page=2

we want to make our site clean and in addition to replacing dynamic links to psevdo-static(apples.html and oranges.html) we want to play safe and add following record to robots.txt:

Disallow: /index.php

Question: let say we have huge site and possibly missed a few links to dynamic pages. What is going to happen:

1. Spider will index all html pages and ignore all index.php
2. Spider will index all pages, including index.php
3. Spider will ignore all pages due to rewriterule?

Do we have anyone here with CLEAR understanding.

Thank you,

Vaddy

ThomasB

12:25 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Vaddy, first of all welcome to WebmasterWorld!

The second option will take place. The spider will index all pages except index.php. The 3rd option is very unlikely to happen as the spider doesn't know about the URL rewritting unless there are links to the un-rewritten URLs which would not be indexed due to the correct robots.txt file. You might want to change index.php to content.php or whatever as some Search Engines had problems in the past with differenctiating between the following urls:
exmample.com/
example.com/index.html
example.com/index.php
example.com/index.asp
example.com/default.html

At some occassions they were all matched together. In this case you might lose your index-page, which you might not like.

Vaddy

3:06 pm on Apr 6, 2005 (gmt 0)

10+ Year Member



Hi Thomas,

thank you for reply, I guess you mean "first option"?

Spider will crawl "all pages with exceprion to those begins with index.php" Those will exclude any index.php?...... variables. In other words, crawler will not index any pages begining with index.php.....

What about site default page?

lets say site has default page as index.php...

www.site.com/ >>>> index.php is set as default page.

I guess good work around would be to correct robots.txt to:

Disallow: /index.php?

Vad

ThomasB

7:02 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



sorry, I meant the 1st option indeed. Sorry about that. Your workaround should be fine for Google if you add an "*" at the end:
/index.php?*

Vaddy

9:04 pm on Apr 6, 2005 (gmt 0)

10+ Year Member



Thank you!

Do I really need wildcard "*"?

ThomasB

4:18 pm on Apr 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd say so, though I haven't tested it myself.

Vaddy

4:34 pm on Apr 7, 2005 (gmt 0)

10+ Year Member



I have test it yesterday - no you do not need it.