Forum Moderators: open

Message Too Old, No Replies

Is there a list of user-agents to block as of today?

I'm launching a new website soon and want to avoid scrapers

         

shamrock

10:34 pm on Apr 4, 2014 (gmt 0)

10+ Year Member



I plan to launch a new website soon with original content,
and want to prevent (As far as possible) the access of known scrapers. Is there an updated black list as of 2014 of user-agents I should block? the sticky posts here are from 2008, not sure if they are relevant.

thanks

incrediBILL

11:20 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would check out the ultimate blacklist over at Perishable Press as they have a compact version that rocks:

[perishablepress.com...]

There's some other stuff over there worth looking at too:

[perishablepress.com...]

Beware with any list of this type as you're putting your trust in a list with unknown consequences to your site.

Good luck!

keyplyr

12:10 am on Apr 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My warning is even stronger than Bills.

No one web site is the same. What one webmaster sees as a bad agent, another will welcome, even solicit. Some sites block all social media bots, other sites flourish because of them.

Many so called block lists contain a lot of antiquated user agents that no longer exist, or that became *legit* and can now be controlled via robots.txt (example: HTTrack)

I personally would not blindly install someone else's block list without researching each and every UA very closely.

shamrock

12:17 am on Apr 5, 2014 (gmt 0)

10+ Year Member



Ok, however, If I wouldn't want httrack for example to access my website, should it matter If i decide to block it via .htaccess rather than robots.txt?

incrediBILL

12:41 am on Apr 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



robots.txt doesn't block squat.

It's just a file that tells bots they are allowed or not, and it's up to the bot to honor that file.

.htaccess actually has teeth and kicks them to the curb.

I do both, as I'm nice to good bots, but I whitelist my robots.txt as it's a short file and tell all others to go away, then block the hell out of them in .htaccess and PHP scripts in case they don't pay attention to robots.txt, which most don't.

Making a big robots.txt file is just a waste of time.

tangor

6:06 am on Apr 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm with incrediBILL. We had a similar discussion in January: [webmasterworld.com...]

Good insights re: whitelisting (who we let in) v blacklisting (who we keep out).