homepage Welcome to WebmasterWorld Guest from 54.145.252.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Microsoft / Bing Search Engine News
Forum Library, Charter, Moderators: mack

Bing Search Engine News Forum

    
Bing ignoring robots.txt?
ken_b

WebmasterWorld Senior Member ken_b us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4441555 posted 4:50 pm on Apr 16, 2012 (gmt 0)

Several months ago I disallowed my image folder/directory in robots.txt.

Intitially this worked well and with-n a short period of time the number of my images showing up on Bing Image Search using a site:www.example.com search dropped about 4,000 images and showed only 4 random images that were in folders that hadn't been disallowed.

Yesterday the number was back up to 3,200, all in the disallowed image folder.

Anyone have an idea why that would happen?

.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4441555 posted 10:07 pm on Apr 16, 2012 (gmt 0)

Have you looked in your logs to see who's been visiting the images? A while back I had to physically block the plainclothes MSIEbot because it doesn't pay any attention to robots.txt, although the ordinary bingbot and msn-media both do.

ken_b

WebmasterWorld Senior Member ken_b us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4441555 posted 10:32 pm on Apr 16, 2012 (gmt 0)

I don't spend a lot of time looking at my raw logs, is "MSIEbot" what I should be looking for?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4441555 posted 12:27 am on Apr 17, 2012 (gmt 0)

Well, the name will be part of the question. Look for anything in the IP ranges

:: shuffling papers ::

65.52-55...
157.54-60...
207.46...

They're pretty fluid about which robot comes from where, much more so than the ordinary googlebot or even yandexbot. The plainclothes bot always picks up a stylesheet to go with the page. By now it must have a drawer full of my errorstyles.css :)

ken_b

WebmasterWorld Senior Member ken_b us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4441555 posted 12:43 am on Apr 17, 2012 (gmt 0)

Thanks lucy24;
I'll dig around a bit and see what I can find using that info.

v3Exceed



 
Msg#: 4441555 posted 9:13 pm on Aug 5, 2012 (gmt 0)

We manage hundreds of smaller business sites. We have found that Bing does indeed ignore the robots.txt directives for whatever reason, and placing a removal request is NOT a reasonable expectation for a real business.

For a while, we had been watching scraper bots from east European countries scrape copies of our clients websites for a variety of malicious reasons. We had even seen copies of a clients site hosting advertising on a foreign owned network.

In order to combat this, we integrated a bot trap. The bot trap requires that the robots.txt be ignored and information we have intentionally not allowed be accessed in order to trigger the trap.

Well guess who we consistently catch... Bing. In all of it's iterations and from all of it's ip's, Bing goes after the information we have asked not to be indexed. Even adjusting the robots.txt to specifically address the syntax that Microsoft says Bing will honor, does not work.

Our final solution is to completely ignore Bing. Since yandex, google, yahoo and the rest all index correctly without fail. We don't have the time or energy to constantly ask Bing to remove entries that they shouldn't have indexed in the first place.

The idea of using the noindex tag is moot, because our websites are entirely generated on the fly.
There is no sub directory or index to tag separately.

Bing should really get its act together before it is further relegated to the abyss that the Zune currently resides.

It is in Bing's interest to confirm to the robots.txt standards as each website that ALL developers produce provide the value to the search engine. If they expect to compete with Google, or even Yahoo in the future.

Thanks ..A

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4441555 posted 10:24 pm on Aug 5, 2012 (gmt 0)

Your post is kinda fuzzy on the difference between crawling and indexing. Entirely different processes. I do know from personal observation that bing doesn't seem to care about the "noindex" instruction in a meta tag. But it has to crawl the page in order to read-- or ignore, as the case may be-- anything on it including the meta.

Last time I looked, yahoo was using bing's information rather than do their own crawling.

v3Exceed



 
Msg#: 4441555 posted 11:42 pm on Aug 5, 2012 (gmt 0)

From a clients perspective, there is no real difference between crawling and indexing and for the purpose of this thread its not really relevant either. The crawling of the external sites and the inclusion within Bing's index is fundamentally entwined within the scope of this thread.

The major concern for us is that Bing triggers the bot trap and then gets blocked from any of the information on the site. Although some developers will argue that these scraper bots pose no risk, we and others who employ bot traps and honeypots do so because we recognize a real threat from these scraper bots.

Our clients are still listed on Bing, but from links on other sites and not directly crawled. As I understand it, Yahoo uses some Bing, some Yandex and some other crawlers in their search results so not being directly listed in Bing hasn't really made much of an impact.

The first file the search bots are supposed to grab if present is the robots.txt. Then based on this information they crawl the site to include the site folders and information within their index. When the search bots ignore the robots.txt there really isn't any way for developers to direct the bot to the right information except by internal links or site map.

Regardless, if Google and others can follow the simple directives of a robots.txt there is no excuse for Bing not to follow suit. The suggestion that links can always be removed from Bing may work for a person with a few sites, but in volume this just doesn't work.

Thanks ..A

revrob

5+ Year Member



 
Msg#: 4441555 posted 7:26 pm on Sep 28, 2012 (gmt 0)

I've virtually given up with bingbot - having tried a whole variety of methods, via robots.txt and .htaccess. Even when I had all the bingbot IP ranges supposedly banned, I found that bingbot was occasionally accessing bulky media files in disallowed folders, even using an IP address that should have been totally banned, and which was getting a 403 response everywhere else on my site - it seemed to be able to evade the Rewrite to [F] commands when accessing a minority of some pdf and jpg files (which were also restricted in robots.txt but bing didn't care about that either.

My current experiment is to use a rewrite command to send all the various MS IP ranges I can identify, to visit robots.txt, whatever it is they are asking for, where they can chew on the disallow directive for bingbot that they are so keen to ignore.

User-agent: bingbot
Disallow: /

Here is what I have put up this afternoon in .htaccess

RewriteCond %{REMOTE_ADDR} ^157\.(5[4-9]|60)\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]
RewriteCond %{REMOTE_ADDR} ^131\.253\.(2[1-9]|3[0-9]|4[0-7])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]
RewriteCond %{REMOTE_ADDR} ^65\.52\.([0-9]|[1-4][0-9]|5[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]

I have robots.txt listed in the "don't rewrite" section near the beginning of .htaccess.
RewriteCond %{REQUEST_URI} !/robots\.txt$
RewriteCond %{REQUEST_URI} !^/robots\.txt$

I'm now waiting to see if that works or if some of the bingbot visits will continue to somehow evade it.

I have had a couple of bingbot visits since putting that code up, which have redirected nicely to robots.txt

If MS are not prepared to observe robots.txt then I am not prepared to let them read anything EXCEPT robots.txt

The only other legit bot I have trouble with is Yahoo Slurp! which also has a habit of ignoring robots.txt directives but I have managed to tame that one via .htaccess.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4441555 posted 2:33 am on Sep 29, 2012 (gmt 0)

RewriteRule .* http://www.example.com/robots.txt [L]

Hate to break it to you, but that isn't a Rewrite. It's a temporary redirect. To keep it as a rewrite, leave off the protocol-plus-domain part of the target.

You don't need all those separate Rules. If the IPs won't all fit on one line, make them into separate conditions separated by OR.

RewriteCond %{REQUEST_URI} !/robots\.txt$
RewriteCond %{REQUEST_URI} !^/robots\.txt$

The second line is contained within the first line, so it's superfluous.

But, er, that's for another forum ;)

I locked out Yahoo ages ago. Belt and suspenders: IP block and UA both.

revrob

5+ Year Member



 
Msg#: 4441555 posted 7:20 am on Sep 29, 2012 (gmt 0)

Thanks for the helpful reply. I understand the second half of your reply about the separate Rules not being necessary and will implement your advice about [OR]

Presumably this is what you had in mind?

RewriteCond %{REMOTE_ADDR} ^157\.(5[4-9]|60)\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$ [OR]
RewriteCond %{REMOTE_ADDR} ^131\.253\.(2[1-9]|3[0-9]|4[0-7])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.52\.([0-9]|[1-4][0-9]|5[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [L]

I don't understand the first bit of your reply and what you mean by "leave off the protocol-plus-domain part of the target". Could you translate that please with an example? And clarify what the difference is between a Rewrite and a temporary redirect, please especially in terms of what the visiting bot receives by way of a response.

As for the final part "that's for another forum ;-)" - I think I got that particular bit of code FROM another WW forum ;-) - where should I go to chat about that please? This WW site is where I have learnt everything (so far very little) I know about .htaccess and the help is much appreciated.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4441555 posted 7:38 am on Sep 29, 2012 (gmt 0)

You mentioned that you wanted a "rewite". Your code generates a 302 redirect, not an internal rewrite.

I don't think you need either. After the Conditions, you should simply block access with something like:

RewriteRule . - [F]

The "other forum" would be the Apache forum here at WebmasterWorld.

revrob

5+ Year Member



 
Msg#: 4441555 posted 8:50 am on Sep 29, 2012 (gmt 0)

Thank you for your reply.

At the moment I do want to send those IP ranges to robots.txt and see what happens. AFAICS there is one bingbot IP range that behaves itself, by requesting robots.txt and obeying it, for example 65.55.52.111
NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
and that is the range I don't restrict.

The others seem to be uncontrollable. I will probably put an [F] rule to replace the redirect to robots in due course.

I'd still be grateful if someone could explain exactly the difference between a 302 redirect and an internal rewrite.
(BTW - I don't control my own web server, so only have access to .htaccess and robots.txt controls)

Once again - thank you.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4441555 posted 10:42 am on Sep 29, 2012 (gmt 0)

To a human:
A redirect means that your browser's address bar changes, and you are now on a different page than the one you originally asked for. (Browsers ordinarily do this without asking your permission. The site says "Go around the back" and your browser obliges.)

A rewrite means that your address bar doesn't change, but you're seeing content that lives somewhere else. A special kind of rewrite that everyone has met is the 404 page: Your address bar will say www.example.com/ directory/ pagename.html or whatever you typed in, but the page you are looking at will be an error page that lives somewhere else entirely.

To a robot:
A redirect is a message that the stuff you want to see is somewhere else. Robots, unlike humans, can choose not to follow redirects. That is: they can't ignore the redirect and barge on to the page they originally asked for. But they can go away and try the new URL later-- or not at all. ("Oh, right, /foobar.html. I was there yesterday.")

But robots are powerless against rewrites. They don't know they've been rewritten, any more than humans do, and they can't ignore the rewrite.

If you redirect a robot to robots.txt it will say (in Robot) "Haha, very funny, I'll come back later when you're in a better mood". If you rewrite it to robots.txt, it will end up there whether it wants to or not.

How To
In mod_rewrite, there are two overlapping ways to create a redirect. One is to include the full protocol-plus domain in the target: http://www.example.com et cetera. The other is to use a flag saying [R]. Or, preferably, [R=301]. Either one by itself will turn a rewrite into a redirect, but you should do both together, for reasons that have nothing to do with bing.

To make a rewrite, you simply leave out both of those things. Keep the [L] flag, because you always use it. But change the target to say only /robots.txt

And then sit back and wait for them to start yapping about Duplicate Content as they see that every one of your pages says the exact same thing.

revrob

5+ Year Member



 
Msg#: 4441555 posted 11:05 am on Sep 29, 2012 (gmt 0)

So to make a rewrite - that the bot can't avoid:
would this do?

RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* /robots.txt [L]

And the redirect: - that the bot can see and then just go away
- is this correct?
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* http://www.example.com/robots.txt [R=301][L]

Does the choice of which one to use actually make a difference to frequency of unwanted visits from a badly behaved bot like bingbot? Or does it just speed up their visit and get rid of them quicker?

Once again many thanks for your patience.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4441555 posted 11:10 am on Sep 29, 2012 (gmt 0)

Redirect:
RewriteRule ^cats\.html http://www.example.com/pets.html [R=301,L]
When
www.example.com/cats.html is requested, tell the browser or bot to make a new request for www.example.com/pets. The address bar will change to the new URL when the browser makes a new request for www.example.com/pets.
A redirect is a URL to URL translation.

Rewrite:
RewriteRule ^pets$ /pets.html [L]
When
www.example.com/pets is requested show the user the content of the file /pets.html and leave the address bar showing the same URL the user originally requested.
A rewrite is a URL to file translation.

revrob

5+ Year Member



 
Msg#: 4441555 posted 4:24 pm on Sep 29, 2012 (gmt 0)

Thanks for all the replies. I think I've got it. So far the result of having the original temporary redirect with redundant bits in place has been that the bad bingbot IPs have made far fewer visits but then it is the weekend. I'll monitor progress over a couple of weekdays then try out various of the corrected alternatives above.

Once again many thanks.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Microsoft / Bing Search Engine News
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved