homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

MSN bot finds php robots.txt

 10:54 pm on Jul 9, 2008 (gmt 0)

Anyone know if MSN bot has figured out a way to know if robots.txt is being rendered using a php script?

Today MSN bot directly hit my php version of robots.txt (robots.#*$!.php) with a get and then immediately hit robots.txt with a get. No other bots/requests for robots.txt have ever tried/or succeeded in accessing the php version. While it wouldn’t be impossible to guess the #*$! it isn’t all that obvious and I have to believe the MSN bot knew what it was looking for.




 10:11 am on Jul 12, 2008 (gmt 0)

I would disallow the .php URL in robots.txt, OR, better yet, I would set up an internal rewrite (that's a rewrite, and NOT a redirect) to /this-does-not-exist so that the .php URL returns a 404. That 404 would not affect the ability of the script to operate and do it's thing.


 11:33 pm on Jul 13, 2008 (gmt 0)

Hi g1smd,

I asked for a review of my .htaccess rewrite logic over in the apache forum. The problem was the order of my rewrites. I did the internal rewrite for robots.txt to robots.$!@#.php first which worked except,, later I did an external (304) rewrite appending www which exposed my internal rewrite. Most (all other) bots that I've logged went after robots.txt with a www. so I never saw the exposing of my internal rewrite. Thanks to Jim all fixed now.

So, msnbot must try to get robots.txt using a url with both the www. and without the www. Maybe they've learned that the technique can yield results.



 8:26 am on Jul 14, 2008 (gmt 0)

Yes, there might be a completely different website at domain.com compared to www.domain.com just as there might be different sites at forums.domain.com and store.domain.com - it's just another subdomain after all

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved