Forum Moderators: phranque

Message Too Old, No Replies

Block Search engines from Spidering My Dynamic Pages

Invision Power Board

         

chopin2256

5:52 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



I have an invision power board forum, and I used a mod to convert all dynamic pages to html. I want to only allow the html files to be spidered, however I am confused which forum files to block from being spidered in the robots.txt. The forum is stored like this:

www.example.com/forum/

I blocked one file so far, which is /forum/index.php, but invision board shows that Google is still spidering all my dynamic links. I don't want this. I want only the html pages spidered. How do I know which files to block?

There are 2 files that came with the mod, and both are stored in this directory: www.example.com/forum/

There is an htaccess file, and a php file....so does this mean I should block ALL files except the 2 mod files? Can someone help me figure out which files to block in the robots.txt. I'd be willing to write out all the directory folders and files if someone thinks they can help me. Thanks!

jdMorgan

8:24 pm on Aug 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The most effective approach - the one that works across all search engines, regardless of their support for query strings in robots.txt - is to 301-redirect the dynamic URLs to the static URLs. It's a bit tricky to do this while avoiding an infinite loop, but can be done.

RewriteCond {THE_REQUEST} ^[A-Z]+\ /path_to_dynamic_pages\.php
RewriteRule ^path_to_dynamic_pages.php$ http://example.com/path_to_static_pages.html [R=301,L]

The above is intended to show the RewriteCond/RewriteRule construct needed to do the redirect without causing it to loop when combined with your existing rewrite. It does not show the back-references needed to redirect to the correct static page.

The use of {THE_REQUEST} prevents the redirect from taking place if the original HTTP request was for a static URL, but was rewritten to a dynamic URL by your existing rule.

For reference, the value of {THE_REQUEST} is exactly what you see in your raw access logs, something like

GET /widgets.php?color=blue HTTP/1.1


Jim

chopin2256

11:24 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



Thank you for your response. Actually this way seems to be better than blocking the spiders! So thank you for bringing this to my attention. I'd like to give you some example code, so maybe you can help me with how to integrate what you said, into this code. This is just a partial code.

-------------code-------------------
RewriteEngine On

# DO THE TOPIC URLS
RewriteRule ^(.*)-t([0-9]*)-s([0-9]*)\.html(.*)$ index.php?showtopic=$2&st=$3
RewriteRule ^(.*)-t([0-9]*)\.html(.*)$ index.php?showtopic=$2$3
--------------end of code--------------------

So this changes all topic urls to html. However, the coder forgot to 301 redirect the urls! How would I integrate the 301's into this code?

jdMorgan

11:52 pm on Aug 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The coder did it almost right; You might want to add an [L] flag to each of those rules for the sake of efficiency.

Note above that I said you need to add additional rules to 301-redirect client-requested dynamic URLs to the static ones. So you need to add to what you have, not modify it or replaces it.

You'll need to do the 'reverse transform function' of the patterns in the rules you already have, and plug that into the example code I posted. Post your best effort and we can help.

The major problem I see is that it may be impossible to determine the part of the .html filepath that precedes "-t" from the information in the dynamic URL. That is a problem you'll need to solve at the page-naming level before implementing the redirect code. In order to make this work, the static and dynamic URLs must contain the same information; It can be in a different form, but each must contain all the information needed to unambiguously reconstruct the other. Maybe this is not a problem; If any dynamic URL that contains "-t" maps to a static URL that begins with "topic" then there is no problem, but I can't tell from the examples posted here.

Jim

chopin2256

1:27 am on Aug 4, 2005 (gmt 0)

10+ Year Member



If any dynamic URL that contains "-t" maps to a static URL that begins with "topic" then there is no problem, but I can't tell from the examples posted here.

An example of how the -t works:

The forum is labeled "Music" so thats tagged with -f htmls. So the html will look like music-f32.html

Lets say the topics in the forum are labeled as follows:

Mikes Piano Composition
Chopin's sonata
Listen to my music now

The html's will look like this:

Mikes-Piano-Composition-t45.html
Chopins-sonata-t23.html
Listen-to-my-music-now-t345.html

Now the dynamics may look like this:

index.php?showtopic=214

The 214 would be what would follow the -t. (example-t214.html)

Everything follows this format. Everything works beautifully, except now I have duplicate content. I don't know if this helps, but let me know what else I can do. Do you need more code?

PS. I sent you the URL of my site if it would help.

jdMorgan

3:48 am on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



On many sites, it is not possible to rewrite from static to dynamic and also redirect from dynamic to static, using only a mod_rewrite solution, because the original rewrite was coded so that information was lost.

A mod_rewrite solution only works with static and dynamic URLs that mirror each other, so that each type of URL can be constructed, given only the other. The static and dynamic URLs must be 'symmetrical' -- easy to convert from one to the other and back again with no loss of information.

So the problem is that given "Chopins-sonata-t23.html" mod_rewrite can correctly produce "index.php?showtopic=t23".

But given only "index.php?showtopic=t23" mod_rewrite cannot produce "Chopins-sonata-t23.html"; The "Chopins-sonata" information is lost.

The simplest answer may be to move the dynamic-to-static 301-redirect function into php, where you can look up the topic number (t23) in your database, produce the full static URL (Chopins-sonata-t23.html), and then issue a 301-redirect from within PHP. You will still need to use the server variable {THE_REQUEST} and pass that to your php script, in order to avoid the rewrite-redirect loop problem. From your posted examples, it looks like you can't do it with a pure mod_rewrite approach.

Jim

chopin2256

4:28 am on Aug 4, 2005 (gmt 0)

10+ Year Member



I don't think im going to be able to do this, (this is very complicated), but a few more attempts. Remember, the mod came with 2 files, htaccess, and php file.

This is what I am thinking to put in the htaccess file for the topics section, according to what you said:

------------htaccess code--------------------------
#Do the topics
RewriteCond {THE_REQUEST} ^[A-Z]+\ /index.php?showtopic=$2$3\.php

RewriteRule ^(.*)-t([0-9]*)-s([0-9]*)\.html(.*)$ index.php?showtopic=$2&st=$3 [R=301,L]

RewriteRule ^(.*)-t([0-9]*)\.html(.*)$ index.php?showtopic=$2$3 [R=301,L]
-------------------end-----------------

Along with this code, the corresponding php code exists...and this is it:

------------php code---------------------------
// Do the topics
$ibforums->skin['_wrapper'] = preg_replace("#index.php\?showtopic=([0-9]*)\"#ie","\$FURL->create_topic_url('\\1')", $ibforums->skin['_wrapper'],1);
$ibforums->skin['_wrapper'] = preg_replace("#index.php\?showtopic=([0-9]*)&hl=\"#ie","\$FURL->create_topic_url('\\1')", $ibforums->skin['_wrapper'],1);
$ibforums->skin['_wrapper'] = preg_replace("#index.php\?showtopic=([0-9]*)&st=([0-9]*)\"#ie","\$FURL->create_topic_url('\\1','\\2')", $ibforums->skin['_wrapper'],1);
-------------------end----------------------------

Now, from what you say, I have to modify these php lines in order to make the 301's work?

jd01

9:27 pm on Aug 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Guys-

This is a slight variation of what I cam up with over in the Yahoo forum that appeared to accomplish the goal and be more efficient than the code that was being presented - By removing the (.*) catch all from the beginning of the rule, we have the ability to only verify the end of the rule for a match - we aren't passing the beginning anyway.

I had not seen the other letters (EG -f for forum) involved, so my example was not quite accurate, but I think we could accomplish the same thing in a better way for the future growth of the board:

RewriteRule ([a-z]{1})([0-9]+)\.html$ /index.php?showtopic=$2 [L]

Added: Obviously won't work the way it is, but wanted to let you know the direction I was going.

We could also use the same rules, and just remove the hard line beginning and the 'catch-all' - I am not sure why the (.*) is after the .html, but if the URL's end in .html this one is also unnecessary. I have removed the R=301 flag from below, since our goal is static and it was saying the static URL had permanently moved to the dynamic version.

RewriteRule -t([0-9]+)-s([0-9]+)\.html(.*)$ index.php?showtopic=$2&st=$3 [L]

RewriteRule -t([0-9]+)\.html(.*)$ index.php?showtopic=$2$3 [L]

Chopin - Sorry, I have been so long in getting back to you - I tried to think of a solution that you could implement without a great deal of work and learning a new language or two, but the only thing I could come up with is Jim's idea - do it in the php or you really can't do it...

My suggestion is this:
1. Make sure every link points to the static version of the page.
2. Double check and make sure all rewrites to static URL's are working.
3. Deny access to the php file with a query string using:

RewriteCond %{THE_REQUEST} index\.php\? [NC]
RewriteRule \.php$ - [F]

This will deny access to any file ending in .php, followed by a? character, so you will be able to access the file index.php, but if the is a? the access will be denied.

Since the forum is new and there probably are not too many URL's I think it's best to just send anyone who would like to access them away.

If you do not want to block access, you could always redirect back to the index, not ideal, but it would send the spiders back through the site instead of sending them away.

RewriteCond %{THE_REQUEST} index\.php\? [NC]
RewriteRule \.php$ /index.php [R=301,L]

Justin