Forum Moderators: phranque

Message Too Old, No Replies

Redirecting incoming jpg URL's stripping characters after the jpg

         

kidcobra

5:58 pm on Feb 3, 2011 (gmt 0)

10+ Year Member



Hi.

I started getting a bunch of incoming incorrect URL's from Google the last several days looking for jpg files. The URL's all have one thing in common in that they have stuff appended after the real URL for each jpg. Here are four sample incoming URL's Google was looking for that triggered 404 errors:

http://example.com/blog/wp-content/uploads/2008/08/sample.jpg" width="37" height="50" alt="image"

http://example.com/folder/folder/photo/sample.jpg" width="39" height="50" alt="image"

http://example.com/blog/wp-content/uploads/2008/10/sample.jpg?q=sample+present+2008

http://example.com/folder/folder/photo/sample.jpg?q=sample+art+festival

I tried to correct it with the following plan to just strip anything off from any incoming request whatever appears after the .jpg , since I could not imagine why I would ever care to have stuff after the jpg. Here is what I did:

RewriteRule ^([^.]+)\.jpg([^.]+)$ http://example.com/$1.jpg [R=301,L]

It worked apparently, and I got no more 404's when testing incoming. But jpg files that had a ? right after the g in the appended stuff, such as the last two URL's of the four above, were returning 200 codes for Google. So I added this to grab/redirect those:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule \.jpg$ http://example.com/%1? [R=301,L]

It appears to be working, putting out 301's checking with tools online that check http requests and responses (don't think I'm supposed to name the website here), as well as seeing 301's in the logs and no more 404's are appearing in my logs. However, I feel like I am missing something for the following reasons:

1. The grab everything code at the start of each re-write is not the same, I used two different examples I found to adopt for the two different redirects.

2. Why did the first redirect not pick up the .jpg? stuff. With my limited knowledge, I thought it would

3. Now, when Google crawls these, the log shows a 301, but some do not show Google then going to the correct shorter URL and putting out a 200 code, and others do. Not sure if this matters.

4. I feel like there must be a more elegant or maybe the better word is more efficient solution than two stepping it.

5. I think from checking various incomings today from my access log that shows Google getting a 301, that the redirect may be chained, where if I check one incoming such as the one below, it comes up 301, but to a shorter URL that still has appended stuff after the .jpg. If I then check that somewhat shorter (but not yet corrected fully) URL, it goes to the right one ending in .jpg:

http://example.com/folder/folder/photo/sample.jpg%22%20width=%2276%22%20height=%2250%22%20alt=%22image%22/%3E%3C/a%3E%20%3C/div%3E%20%3Cdiv%20class=%22c0%20r%22%3E%3Ca%20href=%22/m/imgres?q=sample+sample1

6. Finally, I have this in my htaccess file, which was put there to deal with a very specific group of incomings (.jpg with amp appended so .jpgamp) a while ago, and which I likely should remove as redundant.

RewriteRule ^([^.]+)\.jpgamp$ http://example.com/$1.jpg [R=301,L]

Anyway, because of 1 thru 6, even though all the 404's are gone and everything shows a 301 in the logs, I am uncomfortable and wanted to ask for the view of someone that actually knows what they doing and not winging it like yours truly. Any help would be greatly appreciated.

And the last thing, which may be relevant (to the last URL in No 5. above), is that I have in my htaccess file the great redirect code which I got from a discussion on these pages at this topic: [webmasterworld.com...] which deals with encoded incoming URL's, and the two above redirects (well all three counting the .jpgamp one) are before it in the htaccess file.

Greg

g1smd

11:53 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month





This code will only match if there are more characters OTHER THAN a question mark or period after the .jpg part.

RewriteRule ^([^.]+)\.jpg([^.]+)$ http://example.com/$1.jpg [R=301,L]



Test your existing code by requesting example.com/image.jpgxyz and example.com/image.jpgxyz?abcdef from your server.

RewriteRule patterns match ONLY the path part of the URL request.

The original query string is also automatically reappended in the redirect.

To test the query string, you need a preceding RewriteCond testing the QUERY_STRING value.

To clear the query string data, the redirect target should have a question mark appended to it.

kidcobra

12:03 pm on Feb 4, 2011 (gmt 0)

10+ Year Member



Thanks for the help.

I added a ? at the end or the first rewrite rule as shown, and it stopped the chained redirects. And I deleted the previous rewrite rule to strip amp specifically since it was duplicative.

So I am left with the following:

RewriteRule ^([^.]+)\.jpg([^.]+)$ http://example.com/$1.jpg? [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule \.jpg$ http://example.com/%1? [R=301,L]

The clears everything after .jpg I checked except for incoming URL's where the appended stuff after .jpg begins with one or more unencoded spaces, and not a character. It doesn't matter about other spaces after a character appears in the appendage, just if one or more spaces are after the g in .jpg but before any character.

Woud trying to cover this corner case create essentially an inefficient chain redirect as the only way to deal with it, and if it's not important, would you recommend I don't bother?

jdMorgan

9:25 pm on Feb 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, your first rule should handle the appended-space problem just as it is... The "[^.]+" pattern means "match one or more characters not containing a literal period here."

I may have missed something, but I don'[t see why you don't just use ".+" there, to catch *anything* that is appended to the ".jpg".

A single-rule solution would then be:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^.]+\.jpg)[^\ ]+\ HTTP/
RewriteRule \.jpg http://example.com/%1? [R=301,L]

Jim

kidcobra

11:38 pm on Feb 8, 2011 (gmt 0)

10+ Year Member



Thanks Jim. I used "[^.]+", because I just copied it from this old code I had for the purpose of dealing with solely the amp appendage on a .jpg extension: RewriteRule ^([^.]+)\.jpgamp$ http://example.com/$1.jpg [R=301,L] . So I came up with the seemingly brilliant idea to just replace amp with the same code that was in front of .jpg to catch everything (instead of just amp) on the assumption that the code was catching everything in front before and it would catch everything after as well. Obviously, I'm in the "a little knowledge is dangerous" category, and I was trying to work off existing code that I did not fully understand so as to not take down the entire website. You code catches the situation with the literal periods which mine was not getting.

In any event, I did replace my two cobbled together rules, with your much more elegant and complete solution, and it works like a charm and I appreciate your thoughts and your help.

One thing I didn't mention in my discussion, was the problem of a good number of old .JPG in caps extensions which were not being picked up by my code. My solution was to just copy the same code I had for .jpg, change .jpg to .JPG in three places in the two rules, and it worked. Is this the simplest/efficient method to grab those, basically just doing the same with your more compact version?

g1smd

12:30 am on Feb 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You could use
\.[Jj][Pp][Gg]
or
\.(JPG|jpg)
here.

Even more simply, use the
[NC]
flag to match
aNyCase
.

kidcobra

6:11 pm on Feb 17, 2011 (gmt 0)

10+ Year Member



Thanks to both of you guys for helping me out on this. I got it working like a charm, which would never have happened without your guidance, and I appreciate it. Greg