Welcome to WebmasterWorld Guest from 54.144.79.200

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Problem with googlebot and htaccess

I have a problem with htaccess and googlebot

     

ro101

3:33 pm on Feb 3, 2013 (gmt 0)



Hi all,

I have a problem with my site. I'm trying to watermark all my images in order to prevent hotlinking from other sites but I have a weird problem.

Apparently it is working fine, the watermark is shown, but if I have a look of my page from avivadirectory.com/bethebot/ to see hows googlebot "sees" my site, I see that the images appear with the watermark too.

I know that something should be changed in .htaccess to let Googlebot and Googlebot-Image see all my pictures without that watermark, but can't figure out what.

This is what I have in my htaccess right now. Note that I added two lines for the google bot agents, but they are not working. Anyone has an idea? Thanks a lot.

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Image [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www.mysite.com [NC]
RewriteRule ^(.*\.(png|gif|jpe?g))$ http://www.mysite.com/wp-content/plugins/watermarknewplugin/watermark.php?img=$1 [L]
</IfModule>

g1smd

4:05 pm on Feb 3, 2013 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



As the "not googlebot" condition is unanchored, it also covers the "not googlebot-image" situation.

What happens when someone accesses via the non-www version of your site?

The leading .* in the rule pattern is very inefficient, and should be replaced with something more exact.

Literal periods in patterns should be escaped.

Why are you 302 redirecting to a different URL?

You don't need the ifModule container around this code.

wilderness

5:34 pm on Feb 3, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



It's NOT your htaccess that applies the script, rather your PHP.

Applying watermarks in this manner(rather than embedding watermarks) presents to users some very similar to to the Java-no-save-image option.
The false-prevention is easily circumvented.

In any event, if you prevent a user, and allow google, what prevents the user from going to google to retrieve the image (the later without your knowledge)?

The simple solution is to list google (and other major SE's) in your robots.txt, and specifically in your image directories.

Unless your a photographer selling images, there is NOT any benefit to offering your images to SE's.
Visitors whom are looking for specifically named images could care less about the other content on your website (s), which results in no benefit for serving the image.

ro101

1:23 am on Feb 4, 2013 (gmt 0)



Hi,

G1SMD - How can I see the non www version of my site?

WILDERNESS - I'm trying to show a watermark in new google images engine so that the users click on it and access my site. I've done that, but if I check the site with the tool I shared in the first post, it shows me that googlebot sees the watermarked images instead of the original ones when it accesses my site (this doesn't happen to a user that access it). I want to show Googlebot-Image the normal photo when it access my website crawling for images.

wilderness

1:56 am on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Than you need to correct the syntax errors as g1smd advised, and then answer the question he provided.

lucy24

2:02 am on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



What happens when someone accesses via the non-www version of your site?

They get redirected to the with-www version, of course. You've posted the code yourself at least 8,000 times ;)

Genuine image requests will never come from anything but the canonical form of your sitename, representing the page that your user is actually on. If the domain name is in the wrong form, it's forged and deserves to be blocked.

incrediBILL

2:50 am on Feb 4, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Another point, "Googlebot [NC]" is just silly.

If you want to validate Googlebot you use their spelling and case as-is which avoids spoofers that use "googlebot" or something equally as wrong which "[NC]" allows.

ro101

1:24 pm on Feb 4, 2013 (gmt 0)



Thanks for the responses. Sorry to insist, but I'm not a programmer and it's very difficult for me to understand all the advices you are giving me. Could someone paste the code I shared in the first post with the modifications I should do to test? Thanks all again.

lucy24

5:14 pm on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Could someone paste

Nope, not in this forum :)

To recapitulate:

RewriteCond %{HTTP_USER_AGENT} !Googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Image [NC]

Condition #2 is contained within Condition #1, since neither has a closing anchor.

The [NC] tag here is wrong, because it will permit incorrectly cased spoofers to get in.

RewriteCond %{HTTP_REFERER} !^$

ymmv but it is safer to express this as !^-?$

RewriteCond %{HTTP_REFERER} !^http://www.example.com [NC]


Here too the [NC] is wrong. Although domain names (unlike the rest of an IP) are case-insensitive, you've only got one canonical form. Anything else would be a forged referer.

RewriteRule ^(.*\.(png|gif|jpe?g))$ http://www.example.com/wp-content/plugins/watermarknewplugin/watermark.php?img=$1 [L]


The use of a full protocol-and-domain changes the intended rewrite into a redirect-- a 302 at that. In the case of images, this is not just unwanted but will break the rule.

Never use .* or .+ in non-final position. Find a formulation that will keep the Regular Expression from having to backtrack at the end. Here the simplest form is
^([^.]+\.(png|jpe?g|gif))

assuming that unlike, ahem, apache dot org, you have no URLs that contain periods anywhere other than immediately before extensions

You have correctly left out NC. The form jpe?g is conventional but probably redundant unless you really use both. List only the extensions that actually occur on your site; anything else can get a 404 up front.

And, finally, make sure that the php at the end of your rewrite returns the appropriate 404 if the request was for an image file that doesn't exist. Not as crucial as for page requests, but still a good habit.

Whether rewriting from an image extension (jpg et cetera) to php will work at all is a whole nother question.