Hi.
I started getting a bunch of incoming incorrect URL's from Google the last several days looking for jpg files. The URL's all have one thing in common in that they have stuff appended after the real URL for each jpg. Here are four sample incoming URL's Google was looking for that triggered 404 errors:
http://example.com/blog/wp-content/uploads/2008/08/sample.jpg" width="37" height="50" alt="image"
http://example.com/folder/folder/photo/sample.jpg" width="39" height="50" alt="image"
http://example.com/blog/wp-content/uploads/2008/10/sample.jpg?q=sample+present+2008
http://example.com/folder/folder/photo/sample.jpg?q=sample+art+festival
I tried to correct it with the following plan to just strip anything off from any incoming request whatever appears after the .jpg , since I could not imagine why I would ever care to have stuff after the jpg. Here is what I did:
RewriteRule ^([^.]+)\.jpg([^.]+)$ http://example.com/$1.jpg [R=301,L]
It worked apparently, and I got no more 404's when testing incoming. But jpg files that had a ? right after the g in the appended stuff, such as the last two URL's of the four above, were returning 200 codes for Google. So I added this to grab/redirect those:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^?]*)\?
RewriteRule \.jpg$ http://example.com/%1? [R=301,L]
It appears to be working, putting out 301's checking with tools online that check http requests and responses (don't think I'm supposed to name the website here), as well as seeing 301's in the logs and no more 404's are appearing in my logs. However, I feel like I am missing something for the following reasons:
1. The grab everything code at the start of each re-write is not the same, I used two different examples I found to adopt for the two different redirects.
2. Why did the first redirect not pick up the .jpg? stuff. With my limited knowledge, I thought it would
3. Now, when Google crawls these, the log shows a 301, but some do not show Google then going to the correct shorter URL and putting out a 200 code, and others do. Not sure if this matters.
4. I feel like there must be a more elegant or maybe the better word is more efficient solution than two stepping it.
5. I think from checking various incomings today from my access log that shows Google getting a 301, that the redirect may be chained, where if I check one incoming such as the one below, it comes up 301, but to a shorter URL that still has appended stuff after the .jpg. If I then check that somewhat shorter (but not yet corrected fully) URL, it goes to the right one ending in .jpg:
http://example.com/folder/folder/photo/sample.jpg%22%20width=%2276%22%20height=%2250%22%20alt=%22image%22/%3E%3C/a%3E%20%3C/div%3E%20%3Cdiv%20class=%22c0%20r%22%3E%3Ca%20href=%22/m/imgres?q=sample+sample1
6. Finally, I have this in my htaccess file, which was put there to deal with a very specific group of incomings (.jpg with amp appended so .jpgamp) a while ago, and which I likely should remove as redundant.
RewriteRule ^([^.]+)\.jpgamp$ http://example.com/$1.jpg [R=301,L]
Anyway, because of 1 thru 6, even though all the 404's are gone and everything shows a 301 in the logs, I am uncomfortable and wanted to ask for the view of someone that actually knows what they doing and not winging it like yours truly. Any help would be greatly appreciated.
And the last thing, which may be relevant (to the last URL in No 5. above), is that I have in my htaccess file the great redirect code which I got from a discussion on these pages at this topic: [
webmasterworld.com...] which deals with encoded incoming URL's, and the two above redirects (well all three counting the .jpgamp one) are before it in the htaccess file.
Greg