Forum Moderators: goodroi
I keep thinking I could send GoogleBot into an infinite loop with:
User-agent: *
Disallow: /robots.txt
LMAO, but I really want to try it some time, just to see what SEs do...
Consider that the recommended robots.txt record to deny all robots access to all resources is
User-Agent: *
Disallow: /
As a result, it is reasonable to assume that all properly-coded 'bots feel entitled to fetch robots.txt from all sites, regardless of the content seen in that file when previously fetched.
Jim
if you really want to play around with forcing the search engines into loops then you better use a throw away domain. traditionally search engines have blacklisted urls that cause loops.
I might throw it up on one someday, just because I want to see what they do with it, but it was just one of those funny thoughts I had.
I would guess they will continue to request it, because they think it's their Internet, but personally, I think if a domain is disallowed in the robots.txt they should not spider the domain again (including the robots.txt), unless the owner changes and resubmits the robots.txt, and if I were to disallow the robots.txt, then they should just not request it and follow the rules at the last time of spidering.
Really, honestly, it's my domain and if I kick you out of the robots.txt, then you should not request it, and if I tell you to keep out of the whole thing I mean the whole thing, including the robots.txt...
No need to experiment, as I have already done so. Several robot exclusion records on my sites have "Disallow: /" in them, and no unfortunate or unexpected effects have resulted.
There's actually one site that indexes and caches robots.txt files themselves, and I disallowed them as above. The result was that they removed my robots.txt file from their results, as I wished. This is just another interesting "edge case," though not the one you're inquiring about.
Jim
Google does for sure... It has all locations listed as URL only when they are all noindexed, even the 404 page, which is what GBot should get for most of the locations if they were actually requested.
Yahoo has what would be the index page as URL only and no other URLs listed.
Bing has all listed locations as URL only, but I'm not sure if this is from the robots.txt or their handling of noindex pages.
IMO Bing may be treating the entire site as disallowed, because on the sites I check there 404 and 410 pages that do not and have not ever existed are usually not listed, but there are some locations where this is the case, and I think if they were actually requested they would not be listed because of the status code returned by the server.
Based on my results I have to disagree with g1smd's post about what disallowing the robots.txt does.
Did either of you ever actually test exactly what I said I wanted to do with the major search engines, so we know a change was made or some sites are treated differently, or did you draw your conclusions some other way?
On a 'posting note', personally, I think it's probably a good idea to remind people to always test for themselves, especially when posting as what would probably be considered an authority on the subject because if I didn't follow-up on the posts or information in this thread I may very well have just installed the disallow if I needed to based on yours and jdMorgan's posts, which could have been really ugly a couple weeks later, but that's my personal opinion and I'll let the two of you decide if it's a good idea or not. (No offense intended to either of you.)
Thanks for letting me know you did really test, because I got totally different results than I expected based on earlier posts in this thread.