Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Filtering Out Really Hard To Find Bad Bots


blend27 - 3:42 am on Jan 21, 2013 (gmt 0)


@lucy24
Logs of course have one advantage over real-time activity: you can see what the next request will be.

Now take the knowledge you have learned, create a mySQL/MSSSQL schema, and log all that info.

request headers
robots.txt access
URI requested/QueryString/Referrer
UAs
IPs(including rdns)
hosting ranges
country ranges(2 indexed views - first search allowed, if not found search not allowed(log data, block))
media files access
speed of access
Errors, redirects / Click Path / Scrape Path

You will be surprised how much real time data matters/is useful now days. And how much faster

I have 9 tables with 3GB of data in MSSQL with a sub-domain on one of the busiest site's that is used for WebServices that spit out all that data live to other sites I own. 7 queries, all together all under a second. Authenticated access only.

I could tell you how many times GoogleBot had crawled URI #672 in the second week of April of 2004 or a particular UA first showed up on the site and which geographical area in US or CA was more interested in "curly red widgets" on BlackFriday/CyberMonday of 2008. Oh, and that IPhone and IPad based UAs send request headers in different order all together :).

I could ban/unban an IP/range based on that info on more than 2 dozen sites via an Custom Blackberry App that I wrote.

It's is a lot more fun that way.

and no it's not on Apache/PHP platform, sorry ;)
-----------------------------------------------------------

@incrediBILL

I have a function that does look ups that takes advantage of Java Classes.

in short:
function rdnsLookUp(address) {
// Variables
var iaclass="";
var addr="";
// Init class
iaclass=CreateObject("java", "java.net.InetAddress");
// Get address
addr=iaclass.getByName(address);
// Return the name
return addr.getCanonicalHostName();
}

Problem with running rDNS requests against the IPs that do not have them is that the time to look up is USUALLY 4-5 seconds. So if someone would run a ddos style scrape that would slow down a server a bit. So I time out the requests after 2 seconds(no more), then scheduled tasks that runs on the back burner(diff app pull) picks it up. Mostly these are hosting ranges.


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4536448.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com