Forum Moderators: phranque

Message Too Old, No Replies

problem rewriting Googlebot to HTML mirror page for flash site

         

Jijwilhosting

12:02 pm on Apr 13, 2010 (gmt 0)

10+ Year Member



Hi,

I need to create an HTML mirrored version of a Flash site, in order to improve its SERP ranking. The HTML page(s) will only restructure the content that is already present in the Flash site, so no black hatting is involved here.

I've been experimenting with various RewriteCond and RewriteRule directives, but to no avail. Here's what I currently have:

RewriteEngine On
RewriteBase /

RewriteCond %{HTTP_USER_AGENT} .*Googlebot.*
RewriteCond %{QUERY_STRING} !.*spider_delegate.php.*
RewriteRule ^(.*)$ spider_delegate.php?url=$1


The back-reference $1 always contains the script's own filename, rather than what I'm expecting, namely the actual REQUEST_URI of the original request. The Flash pages are obtainable through URLs formatted like so: http://www.example.com/#/projects . I want to provide HTML content at the same place for Google, but the rest of the world would see the Flash site.

It would be greatly appreciated if anybody could point me in the right direction, because I'm stumped!

Thanks, Ro

jdMorgan

1:48 pm on Apr 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks like you checked the wrong variable in your second RewriteCond. You likely want to check $1 itself or %{REQUEST_URI} for loop-prevention (to prevent spider_delegate.php from being rewritten to itself recursively, as it is doing now). The two methods are identical in this case, except for the presence of a leading slash on the value in %{REQUEST_URI}.

You may wish validate gBots first (as shown in second code snippet below) before doing this rewrite.

RewriteCond %{HTTP_USER_AGENT} Googlebot/
RewriteCond $1 !^spider_delegate\.php$
RewriteRule ^(.*)$ /spider_delegate.php?url=$1 [L]

Note that it is not necessary to put ".*" at the start of a pattern if that pattern is not start-anchored with a "^" and it is similarly unnecessary to put ".*" at the end of a pattern if that pattern is not end-anchored with a "$". See the regular-expressions tutorial cited in our Apache Forum Charter.

Literal periods (and other reserved characters) in regex patterns should be escaped with a "\" as shown.

Neither RewriteRule patterns nor RewriteConds examining %{REQUEST_URI} will "see" query strings; These are data appended to the URL, and not considered part of the URL itself when handled by mod_rewrite.

Optional: Block spoofed Googlebots with simple validation before doing the rewrite:

# Return 403-Forbidden to GoogleBot spoofers except for requests
# for robots.txt and the custom 403 error response page itself.
RewriteCond %{HTTP_USER_AGENT} Googlebot/
RewriteCond %{HTTP:From} !^googlebot\(at\)googlebot\.com$
RewriteCond $1 !^robots\.txt$
RewriteCond $1 !^path-to-custom-403-error-page\.html$
RewriteRule ^(.*)$ - [F]

If you don't have a custom 403 error document declared, you won't need the exclusion for its URL-path.

Jim

Jijwilhosting

8:05 pm on Apr 13, 2010 (gmt 0)

10+ Year Member



Hey Jim,

Thanks for your quick reply! It's all working swimmingly now. Only I just found out that the Flash developer decided not to work with individual pages (he used a 'http://www.example.com/#/section' structure), so in the end my scenario won't pan out the way I figured anyway.

Oh well, thanks again for helping me so quickly!

Bye, Ro