is there any way to protect my site from the scraper bots. they are eating my website bandwidth and i am very tired of this now. almost 600% bandwidth is used by the spam bots in comparision of my real user viewed bandwidth.
is there any way that i can just allow, top engines to crawl my sites. so i dont get any ok,ok bot even.
i have been trying htaccess to block ip but it is very hard as ip gets changed regualrly.
theres a script in the php library on this site, that i used to use. it lets you block anything that takes more than 10 pages every 5 seconds, or whatever. you could try that. but i use an open-source thing called "bad behaviour" now, which is a bit better. you could try googling that.
i have just checked my access log and i can see there bad bots are coming with al different range of IP, plus the same bot like gecko is using almost 100 IP address to crawl all images and pages like everyday and all.
what should i use to do. I have tried blocking the IP but now they have new range.
can we also specify through log, if this is a bot or a user.
i simply want to allow some of the top engines and there Ip address in htaccess and rest of others go away..
@experienced The bots are a little slice of evil, I'm not about to defend them.
But maybe the problem isn't really the bots, it's your bandwidth. Most web hosting providers provide way more bandwidth than their clients need. If you're worried about the incremental cost of the bandwidth that a handful of bots are using up, I'm guessing that you don't have enough monthly bandwidth allocation.
Going by user agents only goes so far and while it's certainly part of an effective way at denying bots anything except for an HTTP 403 it ultimately comes down to if you understand the technical differences between legitimate browsers, legitimate search spiders and illegitimate bots.
My site was very recently scrapped by a sc*mb*g from, say, Asia. Not a robot per se, looked like a browser signature and a home connection. Took every page in 5 minutes. Ideally we should be able to set if you get x pages every 5 seconds and no images or xx pages straight you get banned for 24 hours.
So I look at my access log, identify the robots that I want to exclude and add a line in the above, replacing XYZ with the name of the bot in my logs (or a substring of the name that is unique to this robot).
Yes, it's called blacklisting, it never ends, waste of time, always chasing new user agents that come endlessly daily, not to mention the fact that some bots use random gibberish UAs, and the size of the never ending Apache script slows down your server.
Whitelisting is the only way to stop the insanity.
You allow all your favorite bots (google,slurp,bing), allow legit browsers and smart phones, everything else gets bounced, done. Apache script is minuscule by comparison to blacklisting, server runs fast as it should and bounces junk to the curb in a blink.
Zbblock is effective and pretty strict. It kills most spammmer, hackers and scrapers in their tracks and saves me about 40% of my monthly bandwidth. I was also able to drop the resources on my cloud server and saved money.
Oh and drop in a "bot trap" as well. Create a robots TXT and forbid spiders going into a directory /bot/ for example and then just log the ips of all users that hit this directory and ban them. I have these auto added to a deny from in .htaccess and legitimate users can even remove themselves.
vBulletin owners can update their spiders_vbulletin.xml to track bot activity in real time with Who's Online...
Waste of time.
By the time you've found the new spider to add to the list you've already been scraped.and now you're dragging some big fat spider list around for no reason easily defeated by randomly changing the bot name.
Whitelisting solves that problem.
Whitelisting is only defeated by bots using browser UAs, which requires a script to detect in real-time. How you detect this is bots coming from data centers aren't humans, real browser UAs should never originate from a hosting data center (except for screen shots) therefore they get trapped by trying to bypass the whitelist, catch-22, gotcha.
If people are serious about stopping scrapers, these amateur hour blacklists won't cut it, never did, waste of time which is why I never published my spider list, and it's huge, cause it has no value whatsoever to people trying to stop spiders except s false sense of protection.
Yes, thank you. That was referenced above, and as I said, I'd read all of that.
As far as I can tell, there's nothing definitive in any of that. As incredibill said in that post you linked to "I wouldn't use the following AS-IS without a bit more work".
And again, not being a coder, I don't even know what "a bit more work" refers to. And that post was written 5 years ago and so much has changed with the spiders, mobile, etc. so I'd imagine much of what it talks about is out of date anyway.
So what I was asking is if there's the exact code, scripts, etc. that I can copy/paste into my site and have it work. Plus some instructions for what exactly needs to be done.
Sorry, it's all beyond my knowledge base. I tried to hire someone to do it but couldn't find anyone that knew how to do it.