Forum Moderators: open

Message Too Old, No Replies

Generic Bot Filtering Criteria

What keywords do you use to filter?

         

wilderness

2:51 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




System: The following 16 messages were cut out of thread at: http://www.webmasterworld.com/search_engine_spiders/3640095.htm [webmasterworld.com] by incredibill - 9:28 am on May 3, 2008 (PST -8)


capture, grab, fetch, reap, download, crawl, spider, link and many more similarly worded UA's should be added for denial whenever noticed.

incrediBILL

4:28 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My server blocked it because of the "http://" in the user agent ;)

No free advertising allowed on my server!

This seems to only cause a problem for people that install BSalsa, oops!

Hobbs

6:06 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill,
You're blocking Yahoo!
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Hobbs

6:18 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wilderness,
You're blocking Apple browser
Safari/413 UP.Link/6.3.1.15.0

wilderness

6:35 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wilderness,
You're blocking Apple browser
Safari/413 UP.Link/6.3.1.15.0

Hobbs,
If Apple is dense enough to use such a term in their UA?
They deserve denying.

Same for similar exceptions of "crawl or spider". Alta Vista or somebody else uses one of these and gets denied every time they visit my sites.

Don

Lord Majestic

7:39 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My server blocked it because of the "http://" in the user agent

Googlebot 2.1: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

It is a good custom to include in user-agent a link back to the creator explaining what the bot is doing and what for, I think people in this forum are the first ones to cry foul over bots that don't include such information in their user-agents.

As for blind blocking on the basis of keywords then it is just sad really - all you achieve is encourage bot writers to avoid providing this information in user-agent.

incrediBILL

10:32 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You're blocking Yahoo!

No, Yahoo is whitelisted.

What happens beyond my whitelist is black magic :)

incrediBILL

11:26 pm on May 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is a good custom to include in user-agent a link back to the creator explaining what the bot is doing and what for

Well stated from the perspective of a bot operator.

From the perspective of a webmaster, this feature is often abused, specifically on blogs, and it called "REFERRER SPAM" which is why I block the "http;//" in the first place.

If the user agent is whitelisted then it bypasses my referrer spam block.

Lord Majestic

2:35 am on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I get lots of email spam but I don't block all emails I get, I don't whitelist them either because you never know if new email from stranger won't actually be important - with this attitude "block everyone but G/Y/M" you are just making it harder for new search engines to appear.

Obviously it is your choice but holding against legit bot operators who obey robots.txt in the first place the fact that they actually put in url back to their site (you'd be the first to complain if it was not done) so that webmaster could decide if they want to block it or not is wrong in my view.

This referer spam is overrated anyway - if you don't publish your log reports to the whole world then it won't affect you anyway, and in any case search engines can trivially detect backlinks from such log reports - in fact if they see lots of those spammed urls from logs then it is easier for search engines to decide that the page was spammed. Effectively this spam approach helps weed out spam.

You guys just need to take it easy - your efforts are probably not making a smallest dent in spammers activities anyway, there is really no need to take it so personal.

IGMC.

Hobbs

7:15 am on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What happens beyond my whitelist is black magic

Disclaimers for the coding challenged please. Limited knowledge is more dangerous than full ignorance, I'm still floating in the middle somewhere :-)

If Apple is dense enough to use such a term in their UA?
They deserve denying.

wilderness, that's scary stuff, "link" is not an offensive harm meaning word, I'll give you reap, grab and capture, but link?

wilderness

1:15 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wilderness, that's scary stuff, "link" is not an offensive harm meaning word, I'll give you reap, grab and capture, but link?

Really?
How many link grabbing tools are out there?
I use Xenu now and again to verify my pwn sites, however the majority of the time it's denied.
There are many other similar tools.
Most of these run through entire sites in seconds.

Hobbs

2:02 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sure, but you're blocking Apple Safari visitors instead of adding 2 more lines in your htaccess, I just block
^link
findlinks
and Xenu for Xenu Link Sleuth

wilderness

3:12 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hobbs,
Your really making a big deal over what to me personally is a nothing issue.

Each webmaster must decide what is benefical or detrimental to their own websites.

The Mac/Apple users that visit my widget sites are a smaller % than California residents, which is exceed by visitiors from Oceanic countries.

And just to get your goat ;)
I also have the following OS-UA's denied.
Linux, Opera and a couple of others, while in other instances these are based on multiple critera (UA & IP or UA & Refer, or Refer and IP, etc, etc)

In addition, nearly all cell-phones and/or PDA's are denied at my sites. The majority of my pages are lengthy articles and contain simply to much material for these small viewers.
It's not my intention (regardless of future technologies) to provided multiple views for different OS's.

There was a time when my web pages were my primary focus.
Today however, my websites are simply a tool that makes avaialable a very small selection of high profile articles and images from my immense accumlation of older materials.
Should visitors not desire to conform to my restrictions (whether they are aware or unaware of the restrictions), than they may simply go somewhere else (non-existent) and view the materials.
(A broader explantion of this preference and my widgets is simply beyond the scope of this forum).

Dob

Hobbs

3:22 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was thinking more along the line of best practice for the benefit of others reading the thread Don, of course I understand and agree that each web site situation is different even when the disease is the same :-)

wilderness

3:47 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hobbs,
No matter how much detail or explantion we might include in each thread?
It will NEVER be enough.

How times has Bill stated that he uses MULTIPLE criteria for his whitelisting?
And yet he's taken to task on "beginning with ["!...]
Go figure.

Don

incrediBILL

6:10 pm on May 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Obviously it is your choice but holding against legit bot operators who obey robots.txt

Hundreds of copies of Nutch and Heritrix out there obey robots.txt and put a path to their server in the user agent with "http://" but that doesn't make them legit IMO until they have a viable service that can send traffic.

your efforts are probably not making a smallest dent in spammers activities anyway, there is really no need to take it so personal.

Here's an example of why it becomes personal.

One of them hit my server hard the other day during a period of heavy load, with all the visitors and legit bots on at the same time, that their requests for 8K pages in a few seconds, a literal DoS, caused a complete server overload.

It took a couple of minutes for it to clear up even with the automatic bot blocker nailing them quickly because all of the legit visitors and bots that were then backlogged as the whole thing snowballed out of control.

So I can either:

a) block as much as possible or,
b) buy bigger server hardware upgrading from dual CPUs to quad CPUs.

At the moment, blocking seems to be the cheapest solution, and I'd still need to do it even with quad CPUs for many other reasons as well.