Your impression is correct that the purpose of robots.txt is just to control what obedient bots will and won't crawl.
If the bot doesn't find a robots.txt, it assumes it's okay to crawl anything and everything it finds under that domain. If you're okay with that, then no worries. :)
|Under status it says 404 (Not found). |
I would put up a blank one if to do nothing more than stop logging 404's in the server logs.
Robots.txt are also helpful in blocking the bots from indexing directories that contain scripts. If you have a very plain site that you would be ok with having the engines crawl everything than you do not need a robots.txt.
as for xml sitemaps, just because the engines say they would like you to do something doesn't mean you should do it. for good webmasters i see no benefit from xml sitemaps. but we should discuss that in a seperate thread ;)
I hardly use robots.txt* and I don't use sitemaps, and I don't seem to suffer at all.
*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.
|*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'. |
And that is exactly what robots.txt is for -- To save bandwidth and control cooperative robots' crawling of your site.
Along with that comes an improvement in the usability/validity of your log files and stats, since they won't be full of 404-Not Found errors resulting from robots trying to fetch the customary robots.txt file.
You don't *have* to have a robots.txt file, but even if you don't need the robots-control facility it provides, adding one that's either blank, or that contains
is a very good idea, if just to keep your access log and error log clean, and avoid skewing your stats with all those errors from attempted robots.txt fetches.
My feeling is, in the spirit of KISS (Keep It Simple, Stupid), that unless you NEED it, don't put it up at all, and filter out any (spurious) errors from its absense in other ways.
I do agree with Jim that it would be ideal to make a robots.txt and allow everything.
Damon also makes a good point in keeping it simple since I have helped many people who had their sites fall out of the search engines because they made a badly formatted robots.txt. The webmasters never used a validator to verify the robots.txt was correct. This is not to say robots.txt is hard. It is more a story of not being a lazy webmaster.
Besides the "normal" search engines, there are specialty search engine you may want to allow or block. At a minimum, it's good to be aware of them.
Internet Archive Wayback Machine: Takes a periodic snapshot of your site, making it available for browse/search years after pages may have been taken down. To block it, put these lines in your robots.txt file:
Google Images, Yahoo Image Search, PicSearch: These crawlers look for images on your site, make a best-guess as to their content, and make it easy for everyone to view or download. Depending on whether you think this is good or bad, you may want to block them. Add these lines to your robots.txt file:
The only thing I do with my robots.txt file is to disallow crawling of my images directory. I have enough problems with hotlinked images as it is.
Some of my sites get hit really hard by spiders (specially overseas) that have no reason to be spidering them - sucking up a ton of bandwidth - and we block for that reason.
I block out print pages and also AJAX-driven pages (Google ripped through my JS hidden links and had me very confused for a while).
Not sure how relevant this is today, but still worth a read perhaps:
It's a handy place for a blog [webmasterworld.com], too. Of course, that approach may not be for everybody. ;)
Even if you aren't concerned about inappropriate content being spidered by the major SEs, a robots.txt file will avoid clogging up your logs with 404 errors.
We use it to help eliminate bad traffic. Without it we appear top for odd things "terms of trade" being the one that eventually made us realize that by not using robots.txt to eliminate non-(seo)-marketing pages we were in danger of losing focus.
I've read countless hundreds of documents about robots.txt files and am still not completely clear on some issues regarding them.
My understanding is : Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. I don't think this will ever change because the robots also gather statistics for G (and others).
How many pages does the average website block? For a search engine company to know this they would need to spider them all.
That being said I think you need to use them to ensure that some things do not get indexed like "member profiles" etc unless you have a better way. A BETTER way to keep that content hidden is to make the links to profiles etc show up only when a user is logged in.
This thread has me wondering if a webmaster needs to hide anything at all because everything is a potential link back to your site from a search engine... but then I remember that our sites get rated by a machine that can't fully comprehend the content. Oh joy!
|Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. |
Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.
Robots.txt was originally conceived as a way for Webmasters to prevent robots from consuming excess bandwidth, and to keep them from executing cgi scripts. However, now that the Web has gone commercial, there are many other good reasons to Disallow spiders from fetching various URLs.
A second control mechanism exists in the HTML <meta name="robots" content"noindex"> tag; Its function is different, and the file containing it must not be Disallowed in robots.txt, or the robots won't be able to fetch it to "read" it.
See www.robotstxt.org and w3c.org for authoritative information.
I tried this over a month ago and the google image bot is still eating up mu bandwidth like mad.I am spending an extra 10 dollars a month on bandwidth because of Google.What am I doing wrong?
|User-agent: Googlebot-Image |
I'm uncertain if this will work, but can you stop it in .htaccess?
|Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file. |
If they already have pages indexed or cached that have recently been disallowed in a revised robots.txt, then they won't remove those pages the SE's already indexed.
Quite a dilemma for webmasters.
If a search engine crawls a page and then you block it with robots.txt, that page will eventually fall out of the search index.
If you accidentally allowed content to be published on two sites, you need to choose one and block the other or you will have problems with the engines.
Not sure where the dilemma is.
Set up robots.txt to prevent search engines from over-crawl.
It happened to me once and I paid a hefty price for it...
Be aware that not every spider obeys Robots.txt. There are some nasty bots out there, including but not limited to: 1) spambots that harvest email addresses from your contact forms or guestbook pages; 2) scrapers that scrpe your site for free content to be used in their spammy doorway pages; 3) downloader programs that suck your bandwidth by downloading your entire site; 4) programs that are out on the web looking for copyright infringements so they can sue people; 5) viruses & worms; 6) data mining programs; 7) hackers; 8) DDOS attacks, etc.
Many of these will actually go to Robots.txt to see what you are trying to hide or protect, and go straight to the restricted content. For this reason I use a dynamic robots.txt page.
Through proper use of .htaccess and mod_rewrite, every time my server calls up robots.txt, it invisibly serves a PHP page (although it looks the same to the viewer) and it detects what bot or browser is viewing the page. For search engine spiders I serve the real Robots.txt content for proper indexing, and for all others I simply disallow everything.