Forum Moderators: open

Message Too Old, No Replies

List of Spiders to Block

blocking spiders for web analytics reporting

         

chicagoSEO

10:51 pm on Jan 6, 2006 (gmt 0)

10+ Year Member



So I don't have access to log files, but I need to exclude spider IP addresses in my web analytics tool. Does anyone have a list handy of common spider IP addresses that I should block?

Thanks!

keyplyr

7:29 am on Jan 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Welcome to WebmasterWorld [webmasterworld.com] chicagoSEO.

One man's trash is another's treasure!

incrediBILL

7:53 am on Jan 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's a no win scenario trying to block spiders, they just keep coming and changing IPs.

Your best bet is use your .htaccess to block everything that doesn't have Mozilla in the user agent then expressly allow Google / Yahoo / MSN by ranges of IP addresses, then whitelist other bots as needed.

Then use AlexK's PHP script to snare scrapers pretending to be an http client:
[webmasterworld.com...]

Dijkgraaf

4:44 am on Jan 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you actually trying to block the spiders from visiting your site, or are you just trying to filter them out of your stats?

incrediBILL

6:26 am on Jan 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not blocking legitimate spiders at all.

I'm blocking scrapers and other web aggregators that aren't providing any value to my web site.

Basically Google, MSN, Yahoo, Teoma and Gigablast (on the fence with them) get in and EVERYTHING else gets the boot.

To protect my content and intellectual property I've changed my site to completely OPT-OUT for all crawlers except by whitelist invitation only.

I have the site locked down so hard the only way you'll steal 1,000 pages is with a minimum of 250 IPs, and they better not be on the same block.

chicagoSEO

4:43 pm on Jan 20, 2006 (gmt 0)

10+ Year Member



Thanks keyplyr.

Yeah I sort of figured it is a no win situation. The problem is that we transitioned from hitbox to indextools and traffic reportedly shot up big time which I don't necessarily buy - so I am trying to block the bulk of the spider visits that might be artificially inflating page views.

I wish I could block the useragents, but with indextools - it only allows you to block IP addresses

incrediBILL

5:59 pm on Jan 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you're attempting to block access via IndexTools (if it's the IndexTools I'm thinking of) you aren't really blocking spiders as IndexTools uses javascript to function which spiders don't run, therefore they wouldn't show up in IndexTools in the first place.

chicagoSEO

7:49 pm on Jan 20, 2006 (gmt 0)

10+ Year Member



That's good to know. Guess I should be happy about the increase in traffic!