homepage Welcome to WebmasterWorld Guest from 23.22.97.26
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Does Anyone Know What Happens If.
TheMadScientist




msg:4023895
 5:00 pm on Nov 12, 2009 (gmt 0)

Does anyone know what happens if I disallow my robots.txt in my robots.txt?
Seriously, has anyone ever tried this on a 'nothing' domain just to see what SE spiders do?

I keep thinking I could send GoogleBot into an infinite loop with:

User-agent: *
Disallow: /robots.txt

LMAO, but I really want to try it some time, just to see what SEs do...

 

jdMorgan




msg:4023914
 5:36 pm on Nov 12, 2009 (gmt 0)

It's not really a 'meta-problem' as you might imagine it.

Consider that the recommended robots.txt record to deny all robots access to all resources is

User-Agent: *
Disallow: /

If the problem you imagine actually existed, then this would send all 'bots into a loop, since "/" is a prefix-match for "/robots.txt".

As a result, it is reasonable to assume that all properly-coded 'bots feel entitled to fetch robots.txt from all sites, regardless of the content seen in that file when previously fetched.

Jim

goodroi




msg:4024461
 4:18 pm on Nov 13, 2009 (gmt 0)

as jim says the bots will still look at robots.txt. the big search engines do a really good job at avoiding loops because loops waste a lot of resources which the search engines hate.

if you really want to play around with forcing the search engines into loops then you better use a throw away domain. traditionally search engines have blacklisted urls that cause loops.

TheMadScientist




msg:4024495
 5:19 pm on Nov 13, 2009 (gmt 0)

I was about totally kidding, but thank for the replies...

I might throw it up on one someday, just because I want to see what they do with it, but it was just one of those funny thoughts I had.

I would guess they will continue to request it, because they think it's their Internet, but personally, I think if a domain is disallowed in the robots.txt they should not spider the domain again (including the robots.txt), unless the owner changes and resubmits the robots.txt, and if I were to disallow the robots.txt, then they should just not request it and follow the rules at the last time of spidering.

Really, honestly, it's my domain and if I kick you out of the robots.txt, then you should not request it, and if I tell you to keep out of the whole thing I mean the whole thing, including the robots.txt...

jdMorgan




msg:4024542
 6:39 pm on Nov 13, 2009 (gmt 0)

You'd have to know that you needed to 'resubmit' your robots.txt file, then, if you'd previously disallowed all fetching and robots.txt handling worked as you propose. I think that's a bad idea, myself. If you don't want any fetching at all, then black hole that client at the firewall, or 403 all requests and serve a blank (0-byte) page as the custom 403 error document.

No need to experiment, as I have already done so. Several robot exclusion records on my sites have "Disallow: /" in them, and no unfortunate or unexpected effects have resulted.

There's actually one site that indexes and caches robots.txt files themselves, and I disallowed them as above. The result was that they removed my robots.txt file from their results, as I wished. This is just another interesting "edge case," though not the one you're inquiring about.

Jim

g1smd




msg:4024631
 9:34 pm on Nov 13, 2009 (gmt 0)

What adding
Disallow: /robots.txt actually does, is stop the contents of the robots.txt file appearing in the SERPs as text, if someone links to it and the file is fetched and parsed as if it were a file with content rather than configuration data.
jdMorgan




msg:4024659
 10:34 pm on Nov 13, 2009 (gmt 0)

True, except in the case I cited above, where a particular 'institute of higher learning' fetches robots.txt and republishes it (as 'content,' in your terms) even without any external links.

I fed it a "Disallow: /" and it quit doing that, as you otherwise correctly surmise.

Jim

TheMadScientist




msg:4036820
 10:13 am on Dec 4, 2009 (gmt 0)

Uh, actually, I tried it on a site I have noindexed and all 3 major search engines appear to treated the entire site as disallowed.

Google does for sure... It has all locations listed as URL only when they are all noindexed, even the 404 page, which is what GBot should get for most of the locations if they were actually requested.

Yahoo has what would be the index page as URL only and no other URLs listed.

Bing has all listed locations as URL only, but I'm not sure if this is from the robots.txt or their handling of noindex pages.

TheMadScientist




msg:4037041
 4:30 pm on Dec 4, 2009 (gmt 0)

Just a note RE the preceding:

IMO Bing may be treating the entire site as disallowed, because on the sites I check there 404 and 410 pages that do not and have not ever existed are usually not listed, but there are some locations where this is the case, and I think if they were actually requested they would not be listed because of the status code returned by the server.

Based on my results I have to disagree with g1smd's post about what disallowing the robots.txt does.

Did either of you ever actually test exactly what I said I wanted to do with the major search engines, so we know a change was made or some sites are treated differently, or did you draw your conclusions some other way?

g1smd




msg:4037354
 11:50 pm on Dec 4, 2009 (gmt 0)

I tested the
Disallow: /robots.txt directive with both Google and Yahoo a year or more ago, and that discussion is somewhere in the annals of WebmasterWorld history.
TheMadScientist




msg:4040519
 6:07 am on Dec 10, 2009 (gmt 0)

It's good to know the way they handle it now appears to be different or it's not the same for all sites then. I'm actually glad I tested, because it could have been a big deal (especially for future readers) if people had just read yours and jdMorgan's posts and installed it on a site they needed indexed thinking there would be no unexpected or undesired results, because my results were definitely unexpected and could have been very undesired on a site I needed indexed but wanted to keep the robots.txt from showing for.

On a 'posting note', personally, I think it's probably a good idea to remind people to always test for themselves, especially when posting as what would probably be considered an authority on the subject because if I didn't follow-up on the posts or information in this thread I may very well have just installed the disallow if I needed to based on yours and jdMorgan's posts, which could have been really ugly a couple weeks later, but that's my personal opinion and I'll let the two of you decide if it's a good idea or not. (No offense intended to either of you.)

Thanks for letting me know you did really test, because I got totally different results than I expected based on earlier posts in this thread.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved