Forum Moderators: phranque

Message Too Old, No Replies

I need 301 of bad pages

redirect pages to sitemap

         

Ivanna

4:21 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



I am now admin for a website with many 404 errors. Looking back through three years AWStats I see the problem is from a long time ago. There are many pages that are non-existing but are in the SE indices. I want to remove them from SE index or redirect them to another page.

How I think this happened:
the pages suggest they are links to very rude and inappopriate sites. This was perhaps a virus that is long gone but the links remain in SE index. Now when SE crawl, they give back a lot of 404 errors. All pages are in the same non-existing folder /orderform_files/www/

I think if I redirect to the sitemap.php it will be better for the site. Is this true?

Is this the correct line to write for all pages with url /orderform_files/www/ to move to /sitemap.php

redirectMatch 301 ^/orderform_files/www/ /sitemap.php


thank you for your help.

g1smd

5:10 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The 404 response from your server tells the search engine that there is nothing to see at those URLs.

There is nothing further to do.

You certainly do not want to "adopt" those URLs by redirecting requests for them to some other place.

Google crawls every URL that it has ever seen, forever, just in case the status ever changes again in the future. There is nothing to fix.

Ivanna

5:54 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



Thank you.
I think I read that 404 is bad for a site, that is the reason for my question.

I have one more question. This site has canonical issues so I use code I found on this forum that you write two days ago :-)

Should this be best practice to add to .htaccess for every site as standard procedure?

# Redirect index.html and .htm to folder
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]

# Redirect non-canonical to www
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

g1smd

6:24 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, that code should be on almost every site.

Do adapt the
\.html?
part to specify actual extensions in use, such as
.php
and so on.

Those redirects should appear after both RewriteRules that block malicious requests and after other RewriteRules that redirect more-specific requests and should appear before any RewriteRules that perform internal rewrite functions.

Ivanna

6:35 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



So
*index\.html?\

becomes
*index\.php?\


yes?

g1smd

6:38 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The question mark is needed only if you want to match both
index.ph
and
index.php
requests.

Don't forget there are two places within the ruleset that need to be modified in exactly the same way.

You can also use the "|" OR operator to cater for both
.html
and
.php
requests.

Ivanna

7:29 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



The question mark is needed only if you want to match both index.ph and index.php requests.
So it is not wrong to leave the question mark in - for safety?

Don't forget there are two places within the ruleset that need to be modified in exactly the same way.
yes, they are both index\.html?

You can also use the "|" OR operator to cater for both .html and .php requests.
Like this?
*index\.html?\ |*index\.php?\ HTTP/

g1smd

7:34 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No.

index\.(html?|php)


Having found
index.
once, why search for it again?

Put the one that is most likely to match, first in the list.

Ivanna

7:46 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



Now I understand. I will add this to my saved code folder :-)

Thank you very much for your help and patients with me.

Ivanna

Ivanna

8:17 pm on Oct 26, 2012 (gmt 0)

10+ Year Member



One more thing please
should the $ be inside or outside
RewriteRule ^(([^/]+/)*)index\.(php? |html)$ http://www.example.com/$1 [R=301,L]

g1smd

9:45 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The $ is in the right place.

Look at the question mark and the following space again. The question mark is in the wrong place.

Don't forget the preceding RewriteCond too.

lucy24

9:59 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$ in patterns means "This is the very end of the utterance" so if it is present at all it would normally be the last thing-- unless you're setting up a choice like
/(blah$|foobar)
where you are looking for either "/blah and then nothing more" or "/foobar, possibly followed by other things".

Similarly ^ if used goes at the very beginning.

g1smd

10:18 pm on Oct 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd probably write
/(blah$|foobar)

as
/(blah|foobar.*)$

so as to avoid a later addition like
/(blah$|foobar)new-bits-on-the-end$

which would 500 the server.

Ivanna

4:58 pm on Oct 27, 2012 (gmt 0)

10+ Year Member



Ok, so now I have this:

# Redirect index.php and .ph or html and .htm to folder

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(php|html?)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php|html?)$ http://www.example.com/$1 [R=301,L
]

Question:
is there no spaces in this (php|html?)
I want to check for php and ph or html and htm. Do I only need one question mark ?
Will this now be generic code I can add to all .htaccess files?

I understand ^ is the start of the code and $ is the end. What is $1 ?

Thank you again
Ivanna

g1smd

6:34 pm on Oct 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There should be no spaces in the php or html bit.

You don't need to check for .ph at all. :)

You should be able to add this code to most sites.

$1 represents the folder name (if present) captured by the outermost ( ) in the rule pattern and this is then immediately re-used in the rule target. This is the power of RewriteRule, being able to re-use bits of the original URL request.

After the index rule, you'll need the standard non-www/www canonical rule (in the post near the top of the page).

Ivanna

6:44 pm on Oct 27, 2012 (gmt 0)

10+ Year Member



I have saved this to use on all websites

# Redirect index.php and .ph or html and .htm to folder
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(php|html?)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php|html?)$ http://www.example.com/$1 [R=301,L]

# Redirect non-canonical to www
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]


Thank you for all your help
Ivanna

lucy24

9:55 pm on Oct 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Final question, so long as we're here:

In the /index.thingummy rule, why does the Request condition need to look at the complete request? Are there situations where the element "index.thingummy" can occur at the end of the filename (as specified in the Rule) but somewhere else in the original Request? And the "somewhere else" is permissible (that is, doesn't trigger the Condition)?

g1smd

10:26 pm on Oct 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It looks at the complete request to ensure that the bits we want to match are in the right part of the request, and are framed right.

All hell can break loose when a bot sends a completely scrambled malicious request header that happens to contain the one thing that a miminalist RegEx pattern was looking out for.

In the same way you'd specify ^bar$ in a RewriteRule RegEx pattern to ensure that requests for foobar and barf did not match, you specify
^[A-Z]{3,9}\ /
and
\ HTTP/
when looking at THE_REQUEST with a RewriteCond. The bit in between those items is then an exactly specified path, and if you don't make provision for parameters, requests with parameters will not be matched.

The
^[A-Z]{3,9}\ 
also reminds you that all methods are being processed. Sometimes you might replace that with the literal
GET
or
POST
or a NOT variant of those.


RewriteCond %{THE_REQUEST} index
RewriteRule .* ................. [L]

This rule will be quite happy with a request like
GET ../../../../../index?eval{some_malicious_code} HTTP/1.1

lucy24

1:40 am on Oct 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sometimes you might replace that with the literal GET or POST or a NOT variant of those.

Heh. I've got a general block on POST -- except for analytics -- because I don't use it. Same for .php in requests.

In the same way you'd specify ^bar$ in a RewriteRule RegEx pattern to ensure that requests for foobar and barf did not match

Or, in my case, the generic 410 on "paintings/paintings" also matched the not-at-all-gone "paintings/paintingstyles". Ahem.

lucy24

5:23 am on Oct 28, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Awright, what's the WebmasterWorld equivalent of a Treppenwitz? *

Using my current wording, with hasty detour to add $ to Pattern:

#1. Request meets mod_rewrite before mod_dir has run (the more likely pattern):

RewriteCond %{THE_REQUEST} index\.html
RewriteRule ^(([^/]+/)*)index\.html$ http://www.example.com/$1 [R=301,L]

= requested filename must end in "index.html" (because it can't have come from anywhere else), though it may also contain "index.html" elsewhere. Redirect does not check whether the directory really exists.

If evil robot wants to carry out its schemes, it must subtract "index.html" from its request and try again.

#2. Request meets mod_rewrite after mod_dir has run:

RewriteCond %{THE_REQUEST} index\.html
RewriteRule ^(([^/]+/)*)index\.html$ http://www.example.com/$1 [R=301,L]

=
(A) request was for an actual directory, which either contains a named index.html file or has auto-indexing enabled (in my case this applies only to a few image-only directories), causing mod_dir to append "index.html", AND the element "index.html" occurs somewhere in the query string (it can't be in the path, or mod_dir wouldn't have found the directory and served up an index)
OR
(B) request was for a filename ending in "index.html" as above. If evil robot, et cetera as above.

#3 Either way: If the condition is not met, the rule fails and the request is sent along unchanged rather than being redirected. In the case of evil robots, is this a better outcome?

It also occurs to me that since I don't in fact use query strings -- and certainly not for anything ending in "index.html" -- I could perfectly well change the target to

...example.com/$1?

thereby whacking off one more line of attack from Ukraine and points east.


* The additional question you think of after the time to edit your post has passed.