Welcome to WebmasterWorld Guest from 107.20.75.63

Forum Moderators: goodroi

Message Too Old, No Replies

Why should I have a robots.txt file?

Isn't it OK if I don't?

     
11:36 pm on Dec 29, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 4, 2003
posts:537
votes: 0


This might be a dumb question, but I did not see one like it anywhere, including the forum library.

Between my cliets' sites and my own, I oversee about 50-80 websites. None are more than 200 pages. None have a robots.txt file on the server. I always thought robots.txt was for very large sites or when site owners wanted to control what the engines spidered. We have nothing to hide on any of our sites, so have never used them. As far as I am concerned, anyone (including robots) can look at the sites.

I'm now adding XML sitemaps because the major engines say they prefer that we do so. I notice that in Google Webmaster Tools, when logged in, in the Diagnostic area there is a link for robots.txt analysis. It says Last Downloaded and has today's date. Under status it says 404 (Not found). Which makes sense because there isn't one. Google then says: "We check for a new robots.txt file approximately once per day."

Is this like XML sitemaps where the search engines prefer that we have a robots.txt file? I'll put them in if the major engines really want me to, but I see no particular reason to otherwise.

7:58 am on Dec 30, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 22, 2002
posts:902
votes: 0


Your impression is correct that the purpose of robots.txt is just to control what obedient bots will and won't crawl.

If the bot doesn't find a robots.txt, it assumes it's okay to crawl anything and everything it finds under that domain. If you're okay with that, then no worries. :)

12:56 pm on Dec 30, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 29, 2002
posts:1954
votes: 0


Under status it says 404 (Not found).

I would put up a blank one if to do nothing more than stop logging 404's in the server logs.

1:58 pm on Dec 30, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3080
votes: 67


Robots.txt are also helpful in blocking the bots from indexing directories that contain scripts. If you have a very plain site that you would be ok with having the engines crawl everything than you do not need a robots.txt.

as for xml sitemaps, just because the engines say they would like you to do something doesn't mean you should do it. for good webmasters i see no benefit from xml sitemaps. but we should discuss that in a seperate thread ;)

2:01 pm on Dec 30, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 25, 2004
posts:2156
votes: 0


Hi,

I hardly use robots.txt* and I don't use sitemaps, and I don't seem to suffer at all.

*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

Rgds

Damon

3:56 pm on Dec 30, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

And that is exactly what robots.txt is for -- To save bandwidth and control cooperative robots' crawling of your site.

Along with that comes an improvement in the usability/validity of your log files and stats, since they won't be full of 404-Not Found errors resulting from robots trying to fetch the customary robots.txt file.

You don't *have* to have a robots.txt file, but even if you don't need the robots-control facility it provides, adding one that's either blank, or that contains

User-agent: *
Disallow:


is a very good idea, if just to keep your access log and error log clean, and avoid skewing your stats with all those errors from attempted robots.txt fetches.

Jim

5:04 pm on Dec 30, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 25, 2004
posts:2156
votes: 0


My feeling is, in the spirit of KISS (Keep It Simple, Stupid), that unless you NEED it, don't put it up at all, and filter out any (spurious) errors from its absense in other ways.

Rgds

Damon

1:11 pm on Dec 31, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3080
votes: 67


I do agree with Jim that it would be ideal to make a robots.txt and allow everything.

Damon also makes a good point in keeping it simple since I have helped many people who had their sites fall out of the search engines because they made a badly formatted robots.txt. The webmasters never used a validator to verify the robots.txt was correct. This is not to say robots.txt is hard. It is more a story of not being a lazy webmaster.

4:36 pm on Dec 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 6, 2005
posts:670
votes: 0


Besides the "normal" search engines, there are specialty search engine you may want to allow or block. At a minimum, it's good to be aware of them.

Internet Archive Wayback Machine: Takes a periodic snapshot of your site, making it available for browse/search years after pages may have been taken down. To block it, put these lines in your robots.txt file:

User-agent: ia_archiver
Disallow: /

Google Images, Yahoo Image Search, PicSearch: These crawlers look for images on your site, make a best-guess as to their content, and make it easy for everyone to view or download. Depending on whether you think this is good or bad, you may want to block them. Add these lines to your robots.txt file:

User-agent: Googlebot-Image
Disallow: /

User-agent: Yahoo-MMCrawler
Disallow: /

User-agent: psbot
Disallow: /

4:38 pm on Dec 31, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 7, 2003
posts:472
votes: 0


The only thing I do with my robots.txt file is to disallow crawling of my images directory. I have enough problems with hotlinked images as it is.
6:45 pm on Dec 31, 2006 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12678
votes: 144


Some of my sites get hit really hard by spiders (specially overseas) that have no reason to be spidering them - sucking up a ton of bandwidth - and we block for that reason.
7:11 pm on Dec 31, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 5, 2002
posts:529
votes: 0


I block out print pages and also AJAX-driven pages (Google ripped through my JS hidden links and had me very confused for a while).
12:44 pm on Jan 1, 2007 (gmt 0)

Full Member

10+ Year Member

joined:Aug 12, 2003
posts:203
votes: 0


Not sure how relevant this is today, but still worth a read perhaps:
[webmasterworld.com...]
3:53 am on Jan 2, 2007 (gmt 0)

Administrator

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 2, 2000
posts:9685
votes: 0


It's a handy place for a blog [webmasterworld.com], too. Of course, that approach may not be for everybody. ;)

Even if you aren't concerned about inappropriate content being spidered by the major SEs, a robots.txt file will avoid clogging up your logs with 404 errors.

4:48 pm on Jan 2, 2007 (gmt 0)

Senior Member

joined:Mar 8, 2002
posts:2897
votes: 0


We use it to help eliminate bad traffic. Without it we appear top for odd things "terms of trade" being the one that eventually made us realize that by not using robots.txt to eliminate non-(seo)-marketing pages we were in danger of losing focus.
2:55 am on Jan 3, 2007 (gmt 0)

Junior Member

5+ Year Member

joined:Oct 18, 2006
posts:139
votes: 0


I've read countless hundreds of documents about robots.txt files and am still not completely clear on some issues regarding them.

My understanding is : Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. I don't think this will ever change because the robots also gather statistics for G (and others).

How many pages does the average website block? For a search engine company to know this they would need to spider them all.

That being said I think you need to use them to ensure that some things do not get indexed like "member profiles" etc unless you have a better way. A BETTER way to keep that content hidden is to make the links to profiles etc show up only when a user is logged in.

This thread has me wondering if a webmaster needs to hide anything at all because everything is a potential link back to your site from a search engine... but then I remember that our sites get rated by a machine that can't fully comprehend the content. Oh joy!

3:52 am on Jan 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes.

Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

Robots.txt was originally conceived as a way for Webmasters to prevent robots from consuming excess bandwidth, and to keep them from executing cgi scripts. However, now that the Web has gone commercial, there are many other good reasons to Disallow spiders from fetching various URLs.

A second control mechanism exists in the HTML <meta name="robots" content"noindex"> tag; Its function is different, and the file containing it must not be Disallowed in robots.txt, or the robots won't be able to fetch it to "read" it.

See www.robotstxt.org and w3c.org for authoritative information.

Jim

11:01 pm on Jan 3, 2007 (gmt 0)

New User

10+ Year Member

joined:June 19, 2005
posts:13
votes: 0


User-agent: Googlebot-Image
Disallow: /
I tried this over a month ago and the google image bot is still eating up mu bandwidth like mad.I am spending an extra 10 dollars a month on bandwidth because of Google.What am I doing wrong?
4:36 pm on Jan 4, 2007 (gmt 0)

New User

5+ Year Member

joined:Feb 27, 2006
posts:30
votes: 0


I'm uncertain if this will work, but can you stop it in .htaccess?
4:41 pm on Jan 4, 2007 (gmt 0)

New User

5+ Year Member

joined:Feb 27, 2006
posts:30
votes: 0


Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

If they already have pages indexed or cached that have recently been disallowed in a revised robots.txt, then they won't remove those pages the SE's already indexed.

Quite a dilemma for webmasters.

9:17 pm on Jan 4, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3080
votes: 67


If a search engine crawls a page and then you block it with robots.txt, that page will eventually fall out of the search index.

If you accidentally allowed content to be published on two sites, you need to choose one and block the other or you will have problems with the engines.

Not sure where the dilemma is.

6:50 am on Jan 7, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:May 3, 2005
posts:41
votes: 0


Set up robots.txt to prevent search engines from over-crawl.

It happened to me once and I paid a hefty price for it...

8:45 am on Jan 11, 2007 (gmt 0)

New User

5+ Year Member

joined:Aug 2, 2006
posts:20
votes: 0


Be aware that not every spider obeys Robots.txt. There are some nasty bots out there, including but not limited to: 1) spambots that harvest email addresses from your contact forms or guestbook pages; 2) scrapers that scrpe your site for free content to be used in their spammy doorway pages; 3) downloader programs that suck your bandwidth by downloading your entire site; 4) programs that are out on the web looking for copyright infringements so they can sue people; 5) viruses & worms; 6) data mining programs; 7) hackers; 8) DDOS attacks, etc.

Many of these will actually go to Robots.txt to see what you are trying to hide or protect, and go straight to the restricted content. For this reason I use a dynamic robots.txt page.

Through proper use of .htaccess and mod_rewrite, every time my server calls up robots.txt, it invisibly serves a PHP page (although it looks the same to the viewer) and it detects what bot or browser is viewing the page. For search engine spiders I serve the real Robots.txt content for proper indexing, and for all others I simply disallow everything.