Forum Moderators: phranque

Message Too Old, No Replies

Old google indexed pages

what to do with, how to deal ...

         

editordude

7:27 pm on Jun 5, 2005 (gmt 0)

10+ Year Member



When first making my site it had a navigation where every page was in the home directory and linked to like so: domain.com/page=videos and so on.

Now google still has indexed all these old pages but not the new equivalents. How would I best be able to deal with these old urls?

Basically I want google to remove them old pages and index the new ( which are now linked as such: videos/ ) and so on.

Would it be best to set up a permanent re-direct or is there a better method?

As it stands those old pages load to my homepage ( seeing as it's index.php?page=videos ) which is why I doubt nothing will be done on G's side.

Thanks in advance for any help.

jd01

8:42 pm on Jun 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi editordude,

You are correct, a permanent redirect to the 'new' corresponding file is usually the best and most effective method of forcing the changes to take place. One of the benefits is, you will (at some point in time), get credit for any inbound links to the old page transferred to the new pages.

Justin

editordude

9:17 pm on Jun 5, 2005 (gmt 0)

10+ Year Member



Thanks for the response.

The main problem here is that redirecting won't be easy, unless I was to have a rule for every single page which has now moved as there was no 'forumula' I used when linking pages so regular expressions won't work here, I shall give it a go, thanks. =D

So just to confirm the correct code to be used in this case would be something like so:

RewriteRule  ^index\.php?page=(.*)/$ http://www.domain.com/newlocation/ [R=301,G]

Or shold I omit the G as a permanent re-direct says all that needs to be said?

jd01

9:59 pm on Jun 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This line always confuses me...

'forumula' I used when linking pages so regular expressions won't work here

(.*) = Regular expression.

This will not redirect, except index.php to newlocation, because RewriteRule does not see the query string (stuff after the ?), and there is no information passed to the rule:
RewriteRule ^index\.php?page=(.*)/$ http://www.domain.com/newlocation/ [R=301,G]

To redirect everything to a single location, you would need this:
RewriteCond %{QUERY_STRING} ^page=.+
RewriteRule ^index\.php$ http://www.domain.com/newlocation/ [R=301,L]

It seems though that if your example is accurate in your first post page=videos to /videos/ this would work:

RewriteCond %{QUERY_STRING} ^page=(.+)
RewriteRule ^index\.php$ http://www.domain.com/%1/ [R=301,L]

(.+) = stores 1 or more of 'anything except a line break' after page= in %1
/%1/ = retrieves anything stored in %1 and puts it in as a directory name.

Maybe I am missing something, but if you used names on your original page= and there is a corresponding (sometimes only a portion of the original string) new location, that you can usually creatively use regular expressions to catch most of your changes...

The reason for emphasis on trying to find a way, is that it is *much* more effective to redirect to a corresponding 'new' location, than to send everything to a single page.

Hope this helps.

Justin

editordude

10:09 pm on Jun 5, 2005 (gmt 0)

10+ Year Member



Sure does, thank you very much!
The query string help has been great and I'm sure with a bit of playing I'll have it.

For most part page=videos pretty much matches its new location as in /videos/ which is neat and should save time. There are few cases that don't so I'll make a seperate rule for each page.

Have a related question on an issue I've yet to come across till now, the bottom of my root .htaccess looks like this:


RewriteCond %{HTTP_REFERER}!^http://www.domain.com/.*$ [NC]

RewriteRule .*\.(avi¦rar¦mp3¦wmv)$ http://www.domain.com [R,NC]

Now how would I add rules for the permanent redirect of old pages after this? As I assume adding further rewrite rules will be interfered by the rewriteCond above it?

Your help is greatly appreciated!

jd01

10:42 pm on Jun 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From your other thread:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://(www\.)?domain\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?otherdomain\.com/ [NC]
RewriteRule .*\.(avi¦rar¦mp3¦wmv)$ http://www.domain.com [R=301,NC]

1. By removing the .* and $ you can free up some processing by using the implicit 'and everything else'

2. By making the www optional, you free additional processing.
(www\.)? =? means 0 or 1 of the immediately preceding character or string of characters.

3. The default of the R flag is 302, temporary... with all the recent issues concerning 302's it is advisable to define any R with a 301, permanent, value.

4. Adding additional rule should follow the above ruleset with a line space between each individual set:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://(www\.)?domain\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?otherdomain\.com/ [NC]
RewriteRule .*\.(avi¦rar¦mp3¦wmv)$ http://www.domain.com [R=301,NC]

RewriteCond %{QUERY_STRING} ^page=(.+)
RewriteRule ^index\.php$ http://www.domain.com/%1/ [R=301,L]

5. Critical: If you are redirecting pages with a query string to specific locations, you MUST insert those redirects after the first ruleset, and before the second. If you add them after the second ruleset, they will never be processed, because the second ruleset uses a 'catch all' and will redirect everything it qualifies.

By inserting the specific page redirects in the middle, you can redirect those and stop processing before you get to the 'catch all', so they will be processed accurately.

Justin

editordude

11:25 pm on Jun 5, 2005 (gmt 0)

10+ Year Member



Thank you so much for the help, clearing up many things I've wanted to know!

All you've said makes perfect sense and I've made the changes you recommended as well as learning from them so thanks. =D

Have a few questions in response to your reply:

1. You replaced ¦ with ¦ in the following line:
(avi¦rar¦mp3¦wmv)

Why was this? My keyboard shows the same symbol for the 'pipe' key but when used comes out as so: ¦. Tried switching to what you have shown but it failed to work, where as the ¦ does. Just curious as to wether they serve a different purpose.
Edit: Seems as if this is a webmasterworld thing changing the pipe symbol to this version: '¦', no worries there then.

2. For the 'catch-all' following line:
RewriteRule ^index\.php$ [domain.com...] [R=301,L]
You used a permanent re-direct, wouldn't this cause problems when it comes to search engines? I assume google would think that the re-directed page is its new home and update it's index to whichever is shown, which in most cases will be a 404 error?

( as there will be links such as page=aboutstaff for example which do not exsist at all anymore or if they do could now be at staff.php in the homeroot for example and not neccessarily in the staff/ directory.

3. So just to confirm putting spaces between rules splits them up and apache see's them as different rules altogether? ( as if they were different .htaccess files )

That is all for now, thank you for the amazing help, really do feel like I've come an awfully long distance tonight thanks to your help!

Edit: I apologize if I'm asking to much questions but here's a baffling thought I can't get my head round:

How does apache know which directorys .htaccess to run and match?

For example we have this url:
domain.com/lyrics/album/song/

and this successfully internally rewrites too:
domain.com/lyrics/view.php?title=song

with the .htaccess being in the lyrics directory.

Now how does apache know that the .htaccess to use was in the lyrics directory as was the one to use? What is there was a lyrics/album/ directory which also has a .htaccess and the rule matched?

jd01

2:31 am on Jun 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1. You are correct, it is the board.

2. A permanent redirect says, 'the page has moved here... include this 'new' page (the location the page is moved to) instead of the old page.' The point of the 301 is to AVOID a 404...

3. Yes, a line space will indicate to Apache it is a new ruleset.

htaccess works recursively throughout the directory structure. This means if you have an htaccess file in your 'root' or main directory, it applies to all directories, unless there is another htaccess along the line, then that htaccess file supercedes the original and the processing happens there... basically the htaccess closest to the destination file wins.

This can be extremely processor intensive, because if a file qualifies for a rule in the root, then a rule in the sub, then another rule in the actual directory, you have just wasted a good deal of processing along the way... the result will be correct, but the method can be made more efficient. That is why it is extremely important to 'disqualify' anything except the actual rule for the end result as quickly as possible.

Upon any request for a page, Apache will check each directory along the way for an htaccess file and apply rules accordingly. EG

When a request is made for yoursite.com/your/file/here.html Apache checks like this:
yoursite.com/.htaccess
yoursite.com/your/.htaccess
yoursite.com/your/file/.htaccess // this is the one I use.

Before it will ever return the actual file.

So, depending on the structure of the site and the number of rewrites, you can be more effective by putting your .htaccess in specific directories than in the main directory... Again, this depends on structure, number of rewrites and other variables.

I personally use this method extensively, because I serve a little over 20k files from about 20 php pages... To avoid processing unnecessary rules or loading them into memory, I use directory specific .htaccess files, so if you access a part of my site that does not use rewrites, the .htaccess is never loaded or processed and, before applying any rules, Apache checks the entire path anyway, so by only putting the files (rules) at the end of the path, I only load the specific rules for the actual directory the file is placed in, and can 'fail' or 'disqualify' rules based on the first letter of a file (in most cases).

If I placed the rules in my main htaccess, I would have to process to the second sub-level (variable/variable/what-I-need-to-know) before I could even disqualify one rule, then repeat that process until a rule qualified. This would add exponentially to my processing time. By letting Apache go straigh to the actual directory, then failing rules based on the first letter of the file name, I can be extremely efficient, even though I use some extensive rewrites.

If I had access to the httpd.conf file, I would use it exclusively for the rules, because there is actually more benefit in having them in the actual server configuration, but since this is not an option for me I try to avoid loading or processing anything that is not absolutely necessary.

Hope this helps again, and is not too confusing.

Justin

Edited for Clarity

Added: In reading back through the last few paragraphs, I did not explain the process directory information very well, but am not sure how to do it any better... Sorry. Just know there are times for both in the main directory, and in the end of the path, and which is more efficient depends on how quickly you can move through the rules.

editordude

8:07 am on Jun 6, 2005 (gmt 0)

10+ Year Member



I can't thank you enough Jd01, makes perfect sense and it's surprising how easy it sinks in when explains well rather them spending ours on the apache document pages. Not saying I shall get lazy and rely on people, but now i have many basics in hand will feel more comfortable finding my way around the manual.

The only minor problem with the code you thankfully helped me reach is when it re-directs the query string ( which is does so well ) I add up with this for example:

RewriteCond %{QUERY_STRING} ^page=discography
RewriteRule ^index\.php$ [domain.com...] [R=301,L]

Results in:

[domain.com...]

Nothing major but I would much prefere it without the query string appended onto the end of the url.

jd01

8:12 am on Jun 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, forgot...

To append a blank query string, add a question mark to the end of the rule:

RewriteRule ^index\.php$ http://www.domain.com/sub/discography.php? [R=301,L]

jd01

8:12 am on Jun 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, forgot...

To append a blank query string, add a question mark to the end of the rule:

RewriteRule ^index\.php$ http://www.domain.com/sub/discography.php? [R=301,L]

Justin

Glad I could help =)

editordude

9:37 am on Jun 6, 2005 (gmt 0)

10+ Year Member



Thank you! The perfect solution, I originally thought it would add a? to the end of the url, thankfulyl it never.

Seems to have caused another little 'problem'.

Rather then using cpanel I've decided upon using .htaccess files set up manually ( also to serve as a learning experience ), everything seems to work well but within cpanels 'hotlink protection' page I have the following:

^page=disco
^page=disco2
^page=videos/
[(www\.)?domain\.com...]
[(www\.)?domain\.com...]

Doesn't seem to effect anything, but would like to know why cpanel recognises this and includes the other urls which are on seperate lines and should be nothing to do with the protection?

Thanks again