homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Bingbot with one domain and two IPs
What's the point of that?

 11:46 pm on Oct 29, 2012 (gmt 0)

Hey guys, new to this forum, so excuse me if this is common knowledge or already discussed, but..

I do a reverse/forward DNS lookup on my server to ban bots that don't pass the test and send them a 403. Recently I noticed, that a lot of requests from search.msn.com get a 403 as well.

My initial thought was that it's a fake Bingbot. But on further investigation i discovered the following pattern:

When requests from IP address crawl my web pages, sometimes within the same second I get a 200 for one file request following by a 403 for another file request and so on. Strange. So I test for the reverse DNS, it's always msnbot-157-55-33-113.search.msn.com. But doing a forward DNS with this I get two results (you can check for yourself with a lookup tool):

Name: msnbot-157-55-33-113.search.msn.com
Name: msnbot-157-55-33-113.search.msn.com

In practice, it shows that a request from this bot in about 50 percent of the cases resolves to and passes the lookup, whereas in the other cases it resoves to and (rightly) gets a 403 from my server.

So, as the IP is officially from Microsoft, why do they do that? Have you experienced this before? How do you handle that? I certainly won't let the Bingbot in, if it can't verify itself. It produces massive 403 errors in my logs.



 8:42 am on Oct 30, 2012 (gmt 0)

Though another would have answered this by now.

There are others that do header checks and may chirp in later (I don't)

I use the following and in addition have lines for all the MS-Bing Previews and broken UA's, denying access to the later.

RewriteCond %{REMOTE_ADDR} ^131\.253\.(2[1-9|3[0-9]|4[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteCond %{HTTP_USER_AGENT} !(Bingbot|msnbot) [NC]
RewriteRule .* - [F]

FWIW, there are similar threads on Google. Try a search on "fake googlebot".

The major SE's are utilizing so many supplemental tools that are not related to their primary crawlers. (even though they may come from the same IP ranges). Denying these supplemental tools doesn't hinder your SERP's.


 4:10 am on Nov 15, 2012 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^131\.253\.(2[1-9|3[0-9]|4[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteCond %{HTTP_USER_AGENT} !(Bingbot|msnbot) [NC]
RewriteRule .* - [F]

So you are saying these are all IPs for fake bingbots and msn bots? Where is the IPs for the real ones then?


 6:02 am on Nov 15, 2012 (gmt 0)

Other way around. He's saying: These are the known Bing IPs. If anything from those ranges other than the bingbot or msnbot shows up, kick it out the door.

This is based on the assumption that a non-bingbot coming from bingbot territory is up to no good. I really, really hope it isn't part of some vast ongoing Cloaking test, where the winners are the people who don't block :(

wilderness, I'm assuming that in real life at least some of those [OR]s are on single pipe-delimited lines, changed to [OR] for forums readability.


 7:26 am on Nov 15, 2012 (gmt 0)

wilderness, I'm assuming that in real life at least some of those [OR]s are on single pipe-delimited lines, changed to [OR] for forums readability.

You know what they say about assume ;)

Have I imposed upon some copyright infringement, or even worse, upset "Zeus"?
If merely the latter, the winter months are rapidly approaching and there won't be much sun anyway (at least in the Northern Hemisphere).


 9:48 am on Nov 15, 2012 (gmt 0)

You know what they say about assume

Un asino viejo sabe mas que un potro.


 11:00 am on Nov 15, 2012 (gmt 0)

I do realize that your attempting to help, and thanks for that thoughtfulness.

My closed-mindedness has nothing to do with you or "tenure".

My only interest is status-quo, which I've explained (projects) previously.

I've no interest in any of the following:
Windoze 15 or anything after XP.
Apache 8.0
most other computer skills.
I may never even progress to a multi-core CPU.
"i' anything, or anything else related to social media and/or new machine mediums are beyond my interests.

FWIW ten years ago the "big hype" was CSS and Sam Spade.
I used Sam Spade for about five minutes and didn't like it.

Website designers are still using tables to layout pages and the SE's pick up the content just fine, despite the mess.

KISS (Keep it Simple and Stupid).

My htaccess is by far larger (due to continent restrictions) in lines and kb's than anybody whom participates here, and its been cut down in size/lines 2-3 times.
Despite some outdated synatx (there are newer methods (even more efficient) of doing most anything) it functions just fine.


 10:05 pm on Nov 15, 2012 (gmt 0)

Heh. It was just my way of saying "Yes, I do know what they say about..."

You may be gratified to hear that there's a recently published study showing that table-based layout loads up much faster than all the stuff with divs that you're supposed to use. So do a couple other standard no-nos.

But seriously: mslina, still there? The underlying point was: Don and I both block non-bingbots coming from bing ranges. This is based on the, ahem, assumption that those non-robots aren't doing anything important.

There could well be a matching htaccess rule that says the exact opposite: Anything claiming to be bingbot (or googlebot or whatever) coming from the wrong IP would also get blocked. That version would look like this (allowing for one tiny adjustment that I couldn't resist):

RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.(46|[67][0-9])\.
RewriteCond %{HTTP_USER_AGENT} (Bingbot|msnbot) [NC]
RewriteRule .* - [F]

Notice how all ! (not) have been toggled, and all [OR] removed (i.e. replaced with default AND).

[0-9] can always be replaced with \d for a savings of two bytes ;)

If you wanted to make a rule like that, put the USER_AGENT first. Whenever a rule has multiple conditions, try to start with the one most likely to fail. RewriteCond is sudden-death. Once a condition has failed, any remaining conditions aren't even evaluated. (Groups of lines with [OR] behave as a single condition.)


 2:03 am on Nov 16, 2012 (gmt 0)

Still here just trying to make sense of it all (logs). To me looking at logs, is like looking at all the ugliness and bad things in the world. Scrapers, hackers, malicious attempts, why-o-why is the world so bad?

Oops. Yeah I missed the "!" part of the wilderness' code and that would of course throw my fragile mind off. Good to see all points of view and learn how you all write those rewrite rules and conditions.

Thanks for the clarification and code, lucy. You are officially the Apache goddess around here or would that be god?

I have added that bit to my htaccess. Gracias!


 3:18 am on Nov 16, 2012 (gmt 0)

Another point of view...

I have seen legit M$ bots come from other M$ ranges on occasion, so I also allow those, as well as a cople additional UAs:

RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot|MSRBOT) [NC]
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.253\.[2-4][0-9]\.
RewriteCond %{REMOTE_ADDR} !^131\.107\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved