Forum Moderators: DixonJones

Message Too Old, No Replies

bad-bot script catching lots of Google referrers

Is anybody else seeing this? Is this a GMail thing?

         

stapel

7:46 pm on Sep 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have installed a "bad bot" script to block naughty spiders and scrapers.

[webmasterworld.com...]
[webmasterworld.com...]

Over the last month or so, I have suddenly been catching lots of people coming from a Google IP range. It does not appear that spiders are going where they don't belong; instead, it appears that people are coming from their GMail accounts. I don't have GMail, so I don't know what sort of set-up and options there might be, and thus can't guess what people might be doing to cause them to get themselves banned.

Has anybody else seen this recently? And/or have any advice?

Thank you.

Eliz.

stapel

5:52 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm thinking now that perhaps I'm seeing the results of the Google toolbar installed in Firefox. The toolbar may be pre-fetching all links on each page, which includes the "bad bot" script link.

. . . . . Webmasterworld: Google Enables Firefox Prefetching [webmasterworld.com]

. . . . . Google Webmasters FAQ [google.com]

I have inserted the "RewriteCond %{X-moz} ^prefetch" line into my .htaccess file. I'll post again in a few days, reporting on any changes.

Eliz.

jdMorgan

6:24 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Eliz,

Can you look at these accesses in your raw log file to confirm? That would be the simplest way to figure out what these accesses are.

Jim

stapel

8:11 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan said:
Can you look at these accesses in your raw log file to confirm?

The logfiles show what looks like perfectly innocent surfing with a Firefox browser. That is, the user does not appear to be trying to scrape scores of pages with, say FrontPage or WebCapture. The users appear to be browsing. But they are "browsing" from Google IP addresses, which makes me wonder if it isn't the users browsing, so much as Google's toolbar doing something in the background.

Eliz.

jdMorgan

8:22 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree, it sounds like the prefetch, which is available on the top SERPs for some searches.

Disabling x-moz-prefetch should take care of it, but you can also add exclusions for search engines and WAP/WML (internet-enabled phone) devices to prevent this and similar problems. These exclusions can be added to mod_rewrite (if you're using it to create traps) or to the script itself, by examining the user-agent and the remote IP address.

Jim

stapel

10:09 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think I understand what you mean when you refer to search engines: add a bot exclusion to the "robots.txt" file. But I'm not sure about the rest.

How would one exclude the WAP cell-phone stuff?

The "bad bot" script is a Perl script in the cgi-bin folder. Bad bots and scrapers follow a link to it, and their IPs are added to the .htaccess file. Are you saying that the script should be writing IPs to itself?

What do you mean by adding exclusions to mod_rewrite? ("mod_rewrite" is in the .htaccess file, right? So you're saying that certain agents, such as "pre-fetch", should be directed elsewhere when they make the call for the "bad bot" script...?)

Thank you.

Eliz.

stapel

3:43 am on Oct 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I kept having the same trouble with, as near as I can tell, Firefox pre-fetching, so I changed the coding in my .htaccess file from:

    RewriteCond %{X-moz} ^prefetch [NC]

...to:

    RewriteCond %{HTTP_X-MOZ} ^prefetch [NC]

This seems to have led to a sizeable decrease in problems. (But this might just be a coincidental pause, so I'll keep watching.)

However, I think there may be a separate problem with Google's Gmail that passes through the UUNet servers. Does anybody know if users can view web pages inside the Gmail interface or something? Because I'm having people that, according to the logs, only clicked on a link in a Gmail e-mail, and then, for no readily-apparent reason, immediately followed the other links on the page (including the hidden bad-bot link), just like with pre-fetching.

(What, exactly, is Google up to these days?)

I'll update if/when further information becomes available.

Eliz.

stapel

11:18 pm on Oct 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



FYI: There are different configurations of the "block pre-fetching command" shown on various web sites. I've added the following to my .htaccess file:

    RewriteCond %{HTTP_X-MOZ} ^prefetch [NC,OR]
    RewriteCond %{HTTP:X-MOZ} ^prefetch [NC,OR]

I can't verify that either of these is "the" correct command formatting, but I'm not getting any server errors, and the Google- and/or Firefox-related problems seem to have dwindled to nothing, or nearly so. (When you're not quite sure what the problem was, it can sometimes be difficult to be certain if it's been solved.)

Eliz.

treeline

1:14 am on Oct 25, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Take a look at the Extensions for Firefox. Several attempt to speed up browsing by prefetching everything linked from the page in the background. Faster browsing, busy servers.