homepage Welcome to WebmasterWorld Guest from 54.81.170.186
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Problem with googlebot and htaccess
I have a problem with htaccess and googlebot
ro101




msg:4542028
 3:33 pm on Feb 3, 2013 (gmt 0)

Hi all,

I have a problem with my site. I'm trying to watermark all my images in order to prevent hotlinking from other sites but I have a weird problem.

Apparently it is working fine, the watermark is shown, but if I have a look of my page from avivadirectory.com/bethebot/ to see hows googlebot "sees" my site, I see that the images appear with the watermark too.

I know that something should be changed in .htaccess to let Googlebot and Googlebot-Image see all my pictures without that watermark, but can't figure out what.

This is what I have in my htaccess right now. Note that I added two lines for the google bot agents, but they are not working. Anyone has an idea? Thanks a lot.

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Image [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www.mysite.com [NC]
RewriteRule ^(.*\.(png|gif|jpe?g))$ http://www.mysite.com/wp-content/plugins/watermarknewplugin/watermark.php?img=$1 [L]
</IfModule>

 

g1smd




msg:4542036
 4:05 pm on Feb 3, 2013 (gmt 0)

As the "not googlebot" condition is unanchored, it also covers the "not googlebot-image" situation.

What happens when someone accesses via the non-www version of your site?

The leading .* in the rule pattern is very inefficient, and should be replaced with something more exact.

Literal periods in patterns should be escaped.

Why are you 302 redirecting to a different URL?

You don't need the ifModule container around this code.

wilderness




msg:4542055
 5:34 pm on Feb 3, 2013 (gmt 0)

It's NOT your htaccess that applies the script, rather your PHP.

Applying watermarks in this manner(rather than embedding watermarks) presents to users some very similar to to the Java-no-save-image option.
The false-prevention is easily circumvented.

In any event, if you prevent a user, and allow google, what prevents the user from going to google to retrieve the image (the later without your knowledge)?

The simple solution is to list google (and other major SE's) in your robots.txt, and specifically in your image directories.

Unless your a photographer selling images, there is NOT any benefit to offering your images to SE's.
Visitors whom are looking for specifically named images could care less about the other content on your website (s), which results in no benefit for serving the image.

ro101




msg:4542151
 1:23 am on Feb 4, 2013 (gmt 0)

Hi,

G1SMD - How can I see the non www version of my site?

WILDERNESS - I'm trying to show a watermark in new google images engine so that the users click on it and access my site. I've done that, but if I check the site with the tool I shared in the first post, it shows me that googlebot sees the watermarked images instead of the original ones when it accesses my site (this doesn't happen to a user that access it). I want to show Googlebot-Image the normal photo when it access my website crawling for images.

wilderness




msg:4542159
 1:56 am on Feb 4, 2013 (gmt 0)

Than you need to correct the syntax errors as g1smd advised, and then answer the question he provided.

lucy24




msg:4542162
 2:02 am on Feb 4, 2013 (gmt 0)

What happens when someone accesses via the non-www version of your site?

They get redirected to the with-www version, of course. You've posted the code yourself at least 8,000 times ;)

Genuine image requests will never come from anything but the canonical form of your sitename, representing the page that your user is actually on. If the domain name is in the wrong form, it's forged and deserves to be blocked.

incrediBILL




msg:4542166
 2:50 am on Feb 4, 2013 (gmt 0)

Another point, "Googlebot [NC]" is just silly.

If you want to validate Googlebot you use their spelling and case as-is which avoids spoofers that use "googlebot" or something equally as wrong which "[NC]" allows.

ro101




msg:4542248
 1:24 pm on Feb 4, 2013 (gmt 0)

Thanks for the responses. Sorry to insist, but I'm not a programmer and it's very difficult for me to understand all the advices you are giving me. Could someone paste the code I shared in the first post with the modifications I should do to test? Thanks all again.

lucy24




msg:4542302
 5:14 pm on Feb 4, 2013 (gmt 0)

Could someone paste

Nope, not in this forum :)

To recapitulate:

RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Image [NC]

Condition #2 is contained within Condition #1, since neither has a closing anchor.

The [NC] tag here is wrong, because it will permit incorrectly cased spoofers to get in.

RewriteCond %{HTTP_REFERER} !^$
ymmv but it is safer to express this as !^-?$

RewriteCond %{HTTP_REFERER} !^http://www.example.com [NC]

Here too the [NC] is wrong. Although domain names (unlike the rest of an IP) are case-insensitive, you've only got one canonical form. Anything else would be a forged referer.

RewriteRule ^(.*\.(png|gif|jpe?g))$ http://www.example.com/wp-content/plugins/watermarknewplugin/watermark.php?img=$1 [L]

The use of a full protocol-and-domain changes the intended rewrite into a redirect-- a 302 at that. In the case of images, this is not just unwanted but will break the rule.

Never use .* or .+ in non-final position. Find a formulation that will keep the Regular Expression from having to backtrack at the end. Here the simplest form is
^([^.]+\.(png|jpe?g|gif))

assuming that unlike, ahem, apache dot org, you have no URLs that contain periods anywhere other than immediately before extensions

You have correctly left out NC. The form jpe?g is conventional but probably redundant unless you really use both. List only the extensions that actually occur on your site; anything else can get a 404 up front.

And, finally, make sure that the php at the end of your rewrite returns the appropriate 404 if the request was for an image file that doesn't exist. Not as crucial as for page requests, but still a good habit.

Whether rewriting from an image extension (jpg et cetera) to php will work at all is a whole nother question.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved