| 10:18 am on May 21, 2011 (gmt 0)|
theres a script in the php library on this site, that i used to use. it lets you block anything that takes more than 10 pages every 5 seconds, or whatever. you could try that. but i use an open-source thing called "bad behaviour" now, which is a bit better. you could try googling that.
| 11:08 am on May 21, 2011 (gmt 0)|
Whitelist in robots.txt first (ie. allow only what you want) then .htaccess all who disobey... much shorter list of do nots...
| 6:53 pm on May 21, 2011 (gmt 0)|
robots.txt will do you no good against badly behaved scraper bots. They don't give a crap about your robots.txt rules.
Like @londrum said, get a plug-in or firewall or some other software tool that will automatically throttle those bottom-feeders.
| 8:02 pm on May 21, 2011 (gmt 0)|
This is true, but will more swiftly reveal which are the bad bots. :)
Other method it to whitelist what UA is allowed and not worry.
| 11:02 am on May 23, 2011 (gmt 0)|
i have just checked my access log and i can see there bad bots are coming with al different range of IP, plus the same bot like gecko is using almost 100 IP address to crawl all images and pages like everyday and all.
what should i use to do. I have tried blocking the IP but now they have new range.
can we also specify through log, if this is a bot or a user.
i simply want to allow some of the top engines and there Ip address in htaccess and rest of others go away..
| 11:36 am on May 23, 2011 (gmt 0)|
do they all have the same user_agent? you could try blocking by that instead, in your htaccess file.
| 4:46 pm on May 23, 2011 (gmt 0)|
@experienced The bots are a little slice of evil, I'm not about to defend them.
But maybe the problem isn't really the bots, it's your bandwidth. Most web hosting providers provide way more bandwidth than their clients need. If you're worried about the incremental cost of the bandwidth that a handful of bots are using up, I'm guessing that you don't have enough monthly bandwidth allocation.
| 5:08 pm on May 23, 2011 (gmt 0)|
|plus the same bot like gecko is using almost 100 IP address |
gecko is a rendering engine found in certain browsers... might avoid using "gecko" alone as a user_agent string to block.
| 5:55 pm on May 23, 2011 (gmt 0)|
(oh, for Crawl Wall... hint hint hint)
Even though it's *called* Search Engine Spider and User Agent forum, you'll probably find some hints and tricks here:
incrediBILL is the master on this topic.
| 9:22 pm on May 23, 2011 (gmt 0)|
Whitelist your robots.txt and .htaccess file, then install a script to stop scrapers that speed through collecting pages, you can find one in our PHP forum and several on the web.
Food for thought, long since posted:
| 10:52 pm on May 23, 2011 (gmt 0)|
Going by user agents only goes so far and while it's certainly part of an effective way at denying bots anything except for an HTTP 403 it ultimately comes down to if you understand the technical differences between legitimate browsers, legitimate search spiders and illegitimate bots.
| 9:49 am on May 24, 2011 (gmt 0)|
My site was very recently scrapped by a sc*mb*g from, say, Asia. Not a robot per se, looked like a browser signature and a home connection. Took every page in 5 minutes. Ideally we should be able to set if you get x pages every 5 seconds and no images or xx pages straight you get banned for 24 hours.
| 2:28 pm on May 24, 2011 (gmt 0)|
This is what I do in my httpd.conf file :
SetEnvIfNoCase User-Agent "XYZ" bad_bot
SetEnvIfNoCase User-Agent "ABC" bad_bot
Deny from env=bad_bot
So I look at my access log, identify the robots that I want to exclude and add a line in the above, replacing XYZ with the name of the bot in my logs (or a substring of the name that is unique to this robot).
Does that make sense ?
| 2:48 pm on May 24, 2011 (gmt 0)|
Yes, it's called blacklisting, it never ends, waste of time, always chasing new user agents that come endlessly daily, not to mention the fact that some bots use random gibberish UAs, and the size of the never ending Apache script slows down your server.
Whitelisting is the only way to stop the insanity.
You allow all your favorite bots (google,slurp,bing), allow legit browsers and smart phones, everything else gets bounced, done. Apache script is minuscule by comparison to blacklisting, server runs fast as it should and bounces junk to the curb in a blink.
| 3:23 pm on May 24, 2011 (gmt 0)|
Zbblock is effective and pretty strict. It kills most spammmer, hackers and scrapers in their tracks and saves me about 40% of my monthly bandwidth. I was also able to drop the resources on my cloud server and saved money.
Oh and drop in a "bot trap" as well. Create a robots TXT and forbid spiders going into a directory /bot/ for example and then just log the ips of all users that hit this directory and ban them. I have these auto added to a deny from in .htaccess and legitimate users can even remove themselves.
| 8:21 am on May 25, 2011 (gmt 0)|
vBulletin owners can update their spiders_vbulletin.xml to track bot activity in real time with Who's Online...
| 10:57 am on May 25, 2011 (gmt 0)|
|vBulletin owners can update their spiders_vbulletin.xml to track bot activity in real time with Who's Online... |
Waste of time.
By the time you've found the new spider to add to the list you've already been scraped.and now you're dragging some big fat spider list around for no reason easily defeated by randomly changing the bot name.
Whitelisting solves that problem.
Whitelisting is only defeated by bots using browser UAs, which requires a script to detect in real-time. How you detect this is bots coming from data centers aren't humans, real browser UAs should never originate from a hosting data center (except for screen shots) therefore they get trapped by trying to bypass the whitelist, catch-22, gotcha.
If people are serious about stopping scrapers, these amateur hour blacklists won't cut it, never did, waste of time which is why I never published my spider list, and it's huge, cause it has no value whatsoever to people trying to stop spiders except s false sense of protection.
| 11:45 am on May 25, 2011 (gmt 0)|
I'm with Netmeg - what have we got to do to you back working on CrawlWall Bill ... we've been hanging out for it ever since you first mentioned in on Twitter
| 9:03 pm on May 27, 2011 (gmt 0)|
I've read all of this and get that whitelisting is the way to go. But I'm not a coder and don't know how to write the scripts and code you guys are talking about.
So is there some code I can get to just paste into my .htaccess, .httpd.conf or whatever?
| 9:32 pm on May 27, 2011 (gmt 0)|
Good place to start: [webmasterworld.com...]
| 9:49 pm on May 27, 2011 (gmt 0)|
Yes, thank you. That was referenced above, and as I said, I'd read all of that.
As far as I can tell, there's nothing definitive in any of that. As incredibill said in that post you linked to "I wouldn't use the following AS-IS without a bit more work".
And again, not being a coder, I don't even know what "a bit more work" refers to. And that post was written 5 years ago and so much has changed with the spiders, mobile, etc. so I'd imagine much of what it talks about is out of date anyway.
So what I was asking is if there's the exact code, scripts, etc. that I can copy/paste into my site and have it work. Plus some instructions for what exactly needs to be done.
Sorry, it's all beyond my knowledge base. I tried to hire someone to do it but couldn't find anyone that knew how to do it.
| 10:20 pm on May 27, 2011 (gmt 0)|
You won't get turn key copy paste solutions here unless you've got some concept code you've attempted first...
Best friends... and YOUR list of who you want to let in for fun and games at your website... that list will be different for every webmaster.
(edit) What works for US sites will be different from UK, RU, CN, etc. This forum is international in scope so posting "use this code" to "accomplish this" may not work for all.