homepage Welcome to WebmasterWorld Guest from 54.237.78.165
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
bandwidth Issue
scraper bots are eating my site
experienced

10+ Year Member



 
Msg#: 4315608 posted 5:32 am on May 21, 2011 (gmt 0)

is there any way to protect my site from the scraper bots. they are eating my website bandwidth and i am very tired of this now. almost 600% bandwidth is used by the spam bots in comparision of my real user viewed bandwidth.

is there any way that i can just allow, top engines to crawl my sites. so i dont get any ok,ok bot even.

i have been trying htaccess to block ip but it is very hard as ip gets changed regualrly.

 

londrum

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4315608 posted 10:18 am on May 21, 2011 (gmt 0)

theres a script in the php library on this site, that i used to use. it lets you block anything that takes more than 10 pages every 5 seconds, or whatever. you could try that. but i use an open-source thing called "bad behaviour" now, which is a bit better. you could try googling that.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 11:08 am on May 21, 2011 (gmt 0)

Whitelist in robots.txt first (ie. allow only what you want) then .htaccess all who disobey... much shorter list of do nots...

centipede



 
Msg#: 4315608 posted 6:53 pm on May 21, 2011 (gmt 0)

robots.txt will do you no good against badly behaved scraper bots. They don't give a crap about your robots.txt rules.

Like @londrum said, get a plug-in or firewall or some other software tool that will automatically throttle those bottom-feeders.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 8:02 pm on May 21, 2011 (gmt 0)

This is true, but will more swiftly reveal which are the bad bots. :)

Other method it to whitelist what UA is allowed and not worry.

experienced

10+ Year Member



 
Msg#: 4315608 posted 11:02 am on May 23, 2011 (gmt 0)

i have just checked my access log and i can see there bad bots are coming with al different range of IP, plus the same bot like gecko is using almost 100 IP address to crawl all images and pages like everyday and all.

what should i use to do. I have tried blocking the IP but now they have new range.

can we also specify through log, if this is a bot or a user.

i simply want to allow some of the top engines and there Ip address in htaccess and rest of others go away..

Pls suggest.

londrum

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4315608 posted 11:36 am on May 23, 2011 (gmt 0)

do they all have the same user_agent? you could try blocking by that instead, in your htaccess file.

centipede



 
Msg#: 4315608 posted 4:46 pm on May 23, 2011 (gmt 0)

@experienced The bots are a little slice of evil, I'm not about to defend them.

But maybe the problem isn't really the bots, it's your bandwidth. Most web hosting providers provide way more bandwidth than their clients need. If you're worried about the incremental cost of the bandwidth that a handful of bots are using up, I'm guessing that you don't have enough monthly bandwidth allocation.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 5:08 pm on May 23, 2011 (gmt 0)

plus the same bot like gecko is using almost 100 IP address

gecko is a rendering engine found in certain browsers... might avoid using "gecko" alone as a user_agent string to block.

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 5:55 pm on May 23, 2011 (gmt 0)

(oh, for Crawl Wall... hint hint hint)

Even though it's *called* Search Engine Spider and User Agent forum, you'll probably find some hints and tricks here:

[webmasterworld.com...]

incrediBILL is the master on this topic.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 9:22 pm on May 23, 2011 (gmt 0)

Whitelist your robots.txt and .htaccess file, then install a script to stop scrapers that speed through collecting pages, you can find one in our PHP forum and several on the web.

Food for thought, long since posted:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

JAB Creations

WebmasterWorld Senior Member jab_creations us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4315608 posted 10:52 pm on May 23, 2011 (gmt 0)

Going by user agents only goes so far and while it's certainly part of an effective way at denying bots anything except for an HTTP 403 it ultimately comes down to if you understand the technical differences between legitimate browsers, legitimate search spiders and illegitimate bots.

- John

walkman



 
Msg#: 4315608 posted 9:49 am on May 24, 2011 (gmt 0)

My site was very recently scrapped by a sc*mb*g from, say, Asia. Not a robot per se, looked like a browser signature and a home connection. Took every page in 5 minutes. Ideally we should be able to set if you get x pages every 5 seconds and no images or xx pages straight you get banned for 24 hours.

statguy

5+ Year Member



 
Msg#: 4315608 posted 2:28 pm on May 24, 2011 (gmt 0)

Hi,

This is what I do in my httpd.conf file :

SetEnvIfNoCase User-Agent "XYZ" bad_bot
SetEnvIfNoCase User-Agent "ABC" bad_bot
<LocationMatch "/">
Deny from env=bad_bot
</LocationMatch>

So I look at my access log, identify the robots that I want to exclude and add a line in the above, replacing XYZ with the name of the bot in my logs (or a substring of the name that is unique to this robot).

Does that make sense ?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 2:48 pm on May 24, 2011 (gmt 0)

Does that make sense ?


Yes, it's called blacklisting, it never ends, waste of time, always chasing new user agents that come endlessly daily, not to mention the fact that some bots use random gibberish UAs, and the size of the never ending Apache script slows down your server.

Whitelisting is the only way to stop the insanity.

You allow all your favorite bots (google,slurp,bing), allow legit browsers and smart phones, everything else gets bounced, done. Apache script is minuscule by comparison to blacklisting, server runs fast as it should and bounces junk to the curb in a blink.

waynne

10+ Year Member



 
Msg#: 4315608 posted 3:23 pm on May 24, 2011 (gmt 0)

Zbblock is effective and pretty strict. It kills most spammmer, hackers and scrapers in their tracks and saves me about 40% of my monthly bandwidth. I was also able to drop the resources on my cloud server and saved money.

Oh and drop in a "bot trap" as well. Create a robots TXT and forbid spiders going into a directory /bot/ for example and then just log the ips of all users that hit this directory and ban them. I have these auto added to a deny from in .htaccess and legitimate users can even remove themselves.

hugh

5+ Year Member



 
Msg#: 4315608 posted 8:21 am on May 25, 2011 (gmt 0)

vBulletin owners can update their spiders_vbulletin.xml to track bot activity in real time with Who's Online...

[vbulletin.com...]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 10:57 am on May 25, 2011 (gmt 0)

vBulletin owners can update their spiders_vbulletin.xml to track bot activity in real time with Who's Online...


Waste of time.

By the time you've found the new spider to add to the list you've already been scraped.and now you're dragging some big fat spider list around for no reason easily defeated by randomly changing the bot name.

Whitelisting solves that problem.

Whitelisting is only defeated by bots using browser UAs, which requires a script to detect in real-time. How you detect this is bots coming from data centers aren't humans, real browser UAs should never originate from a hosting data center (except for screen shots) therefore they get trapped by trying to bypass the whitelist, catch-22, gotcha.

If people are serious about stopping scrapers, these amateur hour blacklists won't cut it, never did, waste of time which is why I never published my spider list, and it's huge, cause it has no value whatsoever to people trying to stop spiders except s false sense of protection.

justawriter

5+ Year Member



 
Msg#: 4315608 posted 11:45 am on May 25, 2011 (gmt 0)

I'm with Netmeg - what have we got to do to you back working on CrawlWall Bill ... we've been hanging out for it ever since you first mentioned in on Twitter

Stuart

brizad

10+ Year Member



 
Msg#: 4315608 posted 9:03 pm on May 27, 2011 (gmt 0)

I've read all of this and get that whitelisting is the way to go. But I'm not a coder and don't know how to write the scripts and code you guys are talking about.

So is there some code I can get to just paste into my .htaccess, .httpd.conf or whatever?

Thanks

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 9:32 pm on May 27, 2011 (gmt 0)

Good place to start: [webmasterworld.com...]

brizad

10+ Year Member



 
Msg#: 4315608 posted 9:49 pm on May 27, 2011 (gmt 0)

Yes, thank you. That was referenced above, and as I said, I'd read all of that.

As far as I can tell, there's nothing definitive in any of that. As incredibill said in that post you linked to "I wouldn't use the following AS-IS without a bit more work".

And again, not being a coder, I don't even know what "a bit more work" refers to. And that post was written 5 years ago and so much has changed with the spiders, mobile, etc. so I'd imagine much of what it talks about is out of date anyway.

So what I was asking is if there's the exact code, scripts, etc. that I can copy/paste into my site and have it work. Plus some instructions for what exactly needs to be done.

Sorry, it's all beyond my knowledge base. I tried to hire someone to do it but couldn't find anyone that knew how to do it.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4315608 posted 10:20 pm on May 27, 2011 (gmt 0)

You won't get turn key copy paste solutions here unless you've got some concept code you've attempted first...

rewrite
setenvif

Best friends... and YOUR list of who you want to let in for fun and games at your website... that list will be different for every webmaster.

(edit) What works for US sites will be different from UK, RU, CN, etc. This forum is international in scope so posting "use this code" to "accomplish this" may not work for all.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved