Forum Moderators: phranque

Message Too Old, No Replies

Returning 410s for select .phps

         

holidayscalendar

2:24 pm on Nov 19, 2011 (gmt 0)

10+ Year Member



Hello,

I am having a similar issue to the person in this thread

[webmasterworld.com...]

but I would like to get 410s for certain .php URLs. If I kill all the .php it will take my site down as I am still using .php.

I have no idea what much of the .htaccess code means. So far I have this - I needed to do a 301 redirect from one site to another because the search engines wouldn't settle on one or the other.

Options +Indexes +FollowSymLinks

RewriteEngine on

# 301 Redirect from example.com to www.example.com
rewritecond %{http_host} ^example.com
rewriteRule ^(.*) http://www.example.com/$1 [R=301,L]

ErrorDocument 404 /pagenotfound.php
ErrorDocument 404 /index.php*pagenotfound.php

The certain URLs I would like to get rid of are of this type: http://www.example.com/index.php?option=com_jevents&task=month.calendar&catids=2&Itemid=2&year=2012&month=04&day=19 These all contain option=com_jevents&task=month...

The ones I would like to keep are like this: http://www.example.com/index.php?mo=12&yr=2011

Any thoughts or suggestions would be greatly appreciated.

[edited by: engine at 9:25 am (utc) on Nov 22, 2011]
[edit reason] examplified [/edit]

lucy24

11:45 pm on Nov 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some basics: Always use example.com in your examples. This applies to all Forums, but it's especially important here because we need to see exactly what you typed. Yes, it works both with and without www.

Put general rules like ErrorDocument near the very top of your htaccess. If you put two consecutive "ErrorDocument 404" directives, either the second will overwrite the first or your server will complain. (I don't know offhand which one, but you don't want either ;))

With mod_rewrite, start with the most specific rules and end with the most general. So the with/without www. redirect comes at the very end, picking up only those requests that haven't already been redirected for other reasons.

Here is the boilerplate on query strings. Study it and then let's see what you come up with.

Query Strings

The Query String, also known as a Parameter, is the part of an url after the question mark. Question = query.

By default, rewrites simply ignore the query string. That is, mod_rewrite stashes the query in a safe place, does its stuff to the part before the question mark, and then reappends the original query.

Changing a Query

#1 To delete a query, add a ? to the end of your rewrite target.
#2 To replace a query—or create a new one—add ?blahblah to the rewrite target. The blahblah can be either literal text, or stuff you captured earlier. (#1 and #2 are really the same thing: you're just replacing the query with either something or nothing.)
#3 To add to an existing query, again put ?blahblah at the end of the target, but also add [QSA] to your flags (the bracketed items at the end of the Rule). It stands for "Query String Append", meaning that the blahblah is to be added to the existing query—if any—instead of replacing it.

Getting the Query

You only need to retrieve the original query if
#1 you want the rewrite to behave differently depending on what the query was
or
#2 you need to change or delete the query

Add a Condition that says

RewriteCond %{QUERY_STRING} blahblah


using your ordinary Regular Expressions, anchors and ! as needed.

To test whether there was a query at all

RewriteCond %{QUERY_STRING} .


which simply means "If the query contains at least one character of any kind".

If you need to capture any of the query, use parentheses as usual. In the rewrite target, the captures will be %1, %2 etc instead of $1, $2 etc, because they are coming from a Condition instead of the Rule. Each set is separately numbered, so the first capture from the Rule will still be $1.

holidayscalendar

8:04 am on Nov 21, 2011 (gmt 0)

10+ Year Member



Thank you for your help. I made a few changes but I had difficulty understanding the bottom section re: the query information.

Options +Indexes +FollowSymLinks 

RewriteEngine on

ErrorDocument 404 /pagenotfound.php

# 301 Redirect from example.com to www.example.com
rewritecond %{http_host} ^example.com
rewriteRule ^(.*) http://www.example.com/$1 [R=301,L]

phranque

9:49 am on Nov 21, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, holidayscalendar!

# 301 Redirect from example.com to www.example.com
rewritecond %{http_host} ^example.com
rewriteRule ^(.*) http://www.example.com/$1 [R=301,L]


i would suggest it's better to do your hostname canonicalization this way:
# 301 Redirect not www.example.com to www.example.com
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^(.*) http://www.example.com/$1 [R=301,L]

this will fix the all the non-www, ip address, wildcard subdomain, default port specified, etc cases.

regarding the 410 response, you can do that using the G flag with the RewriteRule directive.
in your case something like this should do it:
# 410 status code for certain query strings
RewriteCond %{QUERY_STRING} [whatever pattern]
RewriteRule ^.* - [G]

g1smd

5:00 pm on Nov 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Minor corrections:
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$


Escape literal periods and allow HTTP/1.0 requests.

holidayscalendar

8:06 pm on Nov 21, 2011 (gmt 0)

10+ Year Member



Thank you all for your help. I somewhat feel like a high school student probing for the answer but I am not sure what all of the symbols mean. Phranque, if I am looking to selectively get a 410 response for only some php pages - if those pages included the word jevents - is that even possible? If so, would that be what goes in the [pattern] area and is there a certain way I need to write that with the symbols, such as an asterisk or dollar sign before and / or after the word?

lucy24

11:34 pm on Nov 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if I am looking to selectively get a 410 response for only some php pages - if those pages included the word jevents - is that even possible?

Sure. That's what Regular Expressions are all about. When you say "those pages", are you still talking about the query string as in your original (obfuscated) example?

And just to make sure: anything containing this query string is simply gone, gone? That's not the same thing as deciding that you're no longer going to process that specific query so you need to tell the search engines to ignore it.

If you really mean that anything containing "jevents" is gone with the wind, it becomes a lot easier because you don't need to capture anything. You will need a RewriteCond looking at the query string; go back and read the boilerplate. Since the Rule itself doesn't see the query, you'll want to constrain it in some other way: at a minimum, by saying something like
\.php$
instead of
.*
so Apache doesn't waste its time looking at Conditions that will never apply. (That is, if it has received a request for images or a stylesheet, it means the parent document has already been declared Good To Go, so you don't need to check further.)

holidayscalendar

7:54 am on Nov 22, 2011 (gmt 0)

10+ Year Member



If the word jevents comes after .php do you think it will still work?

Also, will it be something to the effect of

Rewrite Cond %{HTTP_HOST} \.php\(.*)jevents(.*)
Rewrite Rule ^.* - [G]

lucy24

8:40 am on Nov 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If something comes after .php then either you have severely malformed URLs or we are talking about query strings. And if your hostname includes the string ".php?" I'm getting out of here because I don't want to be standing too close when your server melts down.

There are only so many different ways we can say: use a Condition that looks at the %{QUERY_STRING}. There are lots of recent threads about it if you need to swipe someone else's code to use as a starting point.

g1smd

8:55 am on Nov 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is your domain name
http://something.php.something.jevents.something.com/
?

That's what your condition is looking for.

phranque

9:37 am on Nov 22, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



note that RewriteCond and RewriteRule are each one word, not two.
make sure whatever pattern you use in a RewriteCond or RewriteRule directive is as specific as possible.
i.e. if it's possible to have "djevents" in a query string then you better look for something more specific than "jevents", or add a condition to exclude the "djevents" case.

regarding your last sample of a regular expression:
- a backslash is an escape character so \(.*) isn't going to do anything useful for you
- a (.*) at the end of that pattern is also useless unless you are capturing a group of characters for subsequent usage in a mod_rewrite directive.

lucy24

11:08 pm on Nov 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



a backslash is an escape character so \(.*) isn't going to do anything useful for you

Ooh, how did I miss that? It changes the whole thing from "capture any old stuff" to "there has to be a literal open-parenthesis in this location" ... and I don't care to speculate what happens with the stray closing parenthesis. The text editor would simply dig in its heels and inform me that I've got an invalid Regular Expression. Apache might, at its discretion, return 500 errors. Give or take ;) I am not prepared to put this particular question to the test.

holidayscalendar

8:28 am on Nov 30, 2011 (gmt 0)

10+ Year Member



Hello again I am very sorry that I have not been able to respond in over a week - this is my part time job while my full time job is a 4 month old baby. I greatly appreciate your help with my question.

Lucy - I finally caught what the query string part meant - that that is what comes after the .php in the URL. Basically I would be looking to see if the word jevents was in the query string. There would be no way of finding that word again accidentally in any of my other URLs. I will check to see if I can find similar code in other threads. Will post what I come up with - I am starting to understand this a little finally thanks to your responses.

holidayscalendar

8:44 am on Nov 30, 2011 (gmt 0)

10+ Year Member



Holy crap it worked. I am forever indebted to you - I have spent months working on this.

# 410 Response for jevents URLs
Rewritecond %{QUERY_STRING} [\.php$jevents]
RewriteRule ^.* - [G]

Thank you so, so much.

phranque

11:48 am on Nov 30, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i doubt that's what you wanted.
it "works" (410 status code response) for that url as well as many others that were unintended.
the square brackets define a character class and within those brackets most special characters lose their meanings.
therefore your pattern will match any query string that contains any of the following letters or characters:
ehjnpstv$./
including jevents=foo.
and by the way, the .php would not typically be in the query string.
the query string is only that part of the url after the first '?'.

lucy24

9:40 pm on Nov 30, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Come to think of it, if "jevents" is the only query that will ever contain the letter j, then that's all you have to look for :)

Rewritecond %{QUERY_STRING} j

without anchors. And you can put \.php$ in the "pattern" part of the Rule so your server only has to check requests for pages.

Y'know, I was staring at that RewriteCond thinking "What the ### do the brackets mean?" Didn't even think of grouping!

phranque

10:08 am on Dec 1, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Didn't even think of grouping!

not grouping in the sense of capture - it is for classification of characters.
for example the class of alphabetic characters would be [A-Za-z]
(the hyphen is a special character in this case)

holidayscalendar

1:55 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



So if I wanted exactly the word jevents, and all of those characters in that order would I leave the brackets off?

Rewritecond %{QUERY_STRING} jevents

g1smd

2:33 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jevents - "contains" jevents - there could be other stuff before or after.

^jevents - "begins" jevents - there could be other stuff after.

jevents$ - "ends" jevents - there could be other stuff before.

^jevents$ - "exactly" jevents - nothing else.

You need to read some RegEx tutorials that explain how this works.

You need to know ^stuff stuff$ [stuf] [stuf]+ [stuf]* [^stuf] [^stuf]+ [^stuf]* (stuff) ^(stuff) ^(stuff)+ ^(stuff)* (stuff)$ (stuff)+$ (stuff)*$ (this|that) . .+ .* (.) (.+) (.*) and all of those preceded with ! for "NOT" and all of those in any combination.

holidayscalendar

5:11 pm on Dec 1, 2011 (gmt 0)

10+ Year Member



I found a good one here that helps understand this even more, should anybody else come across this thread.

[webmasterworld.com...]

lucy24

8:16 pm on Dec 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



not grouping in the sense of capture - it is for classification of characters.

Yes, that's what I meant. But anything containing php in brackets would match for every page. So I guess the rule would "work" if you only tested it on requests that you wanted to work, and didn't think of reverse-testing on requests that you didn't want to work.

I found a good one here

Wow, that's serious searching.
:: shuffling papers ::
Whoops, no it isn't, it's reading the Forums Library. Um. Possibly I should read it myself. Ahem.

phranque

5:46 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You need to know ^stuff stuff$ [stuf] [stuf]+ [stuf]* [^stuf] [^stuf]+ [^stuf]* (stuff) ^(stuff) ^(stuff)+ ^(stuff)* (stuff)$ (stuff)+$ (stuff)*$ (this|that) . .+ .* (.) (.+) (.*) and all of those preceded with ! for "NOT" and all of those in any combination.

and for conditional patterns used in mod_rewrite directives you also need to know about all of those preceded by < > or =.
and you also need to know about stuff itself, which in this case is more precisely called a perl compatible regular expression:
http://perldoc.perl.org/perlre.html [perldoc.perl.org]

phranque

5:47 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



But anything containing php in brackets would match for every page.

not if you are matching against QUERY_STRING.

lucy24

6:47 am on Dec 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops, right, query string. But that bracketed material also contains the letter "e"-- twice, for good measure-- and "s" and "t". That should cover at least 3/4 of your queries :)