homepage Welcome to WebmasterWorld Guest from 54.234.59.94
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Why should I have a robots.txt file?
Isn't it OK if I don't?
beren




msg:3203374
 11:36 pm on Dec 29, 2006 (gmt 0)

This might be a dumb question, but I did not see one like it anywhere, including the forum library.

Between my cliets' sites and my own, I oversee about 50-80 websites. None are more than 200 pages. None have a robots.txt file on the server. I always thought robots.txt was for very large sites or when site owners wanted to control what the engines spidered. We have nothing to hide on any of our sites, so have never used them. As far as I am concerned, anyone (including robots) can look at the sites.

I'm now adding XML sitemaps because the major engines say they prefer that we do so. I notice that in Google Webmaster Tools, when logged in, in the Diagnostic area there is a link for robots.txt analysis. It says Last Downloaded and has today's date. Under status it says 404 (Not found). Which makes sense because there isn't one. Google then says: "We check for a new robots.txt file approximately once per day."

Is this like XML sitemaps where the search engines prefer that we have a robots.txt file? I'll put them in if the major engines really want me to, but I see no particular reason to otherwise.

 

pleeker




msg:3203616
 7:58 am on Dec 30, 2006 (gmt 0)

Your impression is correct that the purpose of robots.txt is just to control what obedient bots will and won't crawl.

If the bot doesn't find a robots.txt, it assumes it's okay to crawl anything and everything it finds under that domain. If you're okay with that, then no worries. :)

The Contractor




msg:3203690
 12:56 pm on Dec 30, 2006 (gmt 0)

Under status it says 404 (Not found).

I would put up a blank one if to do nothing more than stop logging 404's in the server logs.

goodroi




msg:3203699
 1:58 pm on Dec 30, 2006 (gmt 0)

Robots.txt are also helpful in blocking the bots from indexing directories that contain scripts. If you have a very plain site that you would be ok with having the engines crawl everything than you do not need a robots.txt.

as for xml sitemaps, just because the engines say they would like you to do something doesn't mean you should do it. for good webmasters i see no benefit from xml sitemaps. but we should discuss that in a seperate thread ;)

DamonHD




msg:3203700
 2:01 pm on Dec 30, 2006 (gmt 0)

Hi,

I hardly use robots.txt* and I don't use sitemaps, and I don't seem to suffer at all.

*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

Rgds

Damon

jdMorgan




msg:3203751
 3:56 pm on Dec 30, 2006 (gmt 0)

*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

And that is exactly what robots.txt is for -- To save bandwidth and control cooperative robots' crawling of your site.

Along with that comes an improvement in the usability/validity of your log files and stats, since they won't be full of 404-Not Found errors resulting from robots trying to fetch the customary robots.txt file.

You don't *have* to have a robots.txt file, but even if you don't need the robots-control facility it provides, adding one that's either blank, or that contains

User-agent: *
Disallow:


is a very good idea, if just to keep your access log and error log clean, and avoid skewing your stats with all those errors from attempted robots.txt fetches.

Jim

DamonHD




msg:3203781
 5:04 pm on Dec 30, 2006 (gmt 0)

My feeling is, in the spirit of KISS (Keep It Simple, Stupid), that unless you NEED it, don't put it up at all, and filter out any (spurious) errors from its absense in other ways.

Rgds

Damon

goodroi




msg:3204362
 1:11 pm on Dec 31, 2006 (gmt 0)

I do agree with Jim that it would be ideal to make a robots.txt and allow everything.

Damon also makes a good point in keeping it simple since I have helped many people who had their sites fall out of the search engines because they made a badly formatted robots.txt. The webmasters never used a validator to verify the robots.txt was correct. This is not to say robots.txt is hard. It is more a story of not being a lazy webmaster.

jwolthuis




msg:3204470
 4:36 pm on Dec 31, 2006 (gmt 0)

Besides the "normal" search engines, there are specialty search engine you may want to allow or block. At a minimum, it's good to be aware of them.

Internet Archive Wayback Machine: Takes a periodic snapshot of your site, making it available for browse/search years after pages may have been taken down. To block it, put these lines in your robots.txt file:

User-agent: ia_archiver
Disallow: /

Google Images, Yahoo Image Search, PicSearch: These crawlers look for images on your site, make a best-guess as to their content, and make it easy for everyone to view or download. Depending on whether you think this is good or bad, you may want to block them. Add these lines to your robots.txt file:

User-agent: Googlebot-Image
Disallow: /

User-agent: Yahoo-MMCrawler
Disallow: /

User-agent: psbot
Disallow: /

wrgvt




msg:3204472
 4:38 pm on Dec 31, 2006 (gmt 0)

The only thing I do with my robots.txt file is to disallow crawling of my images directory. I have enough problems with hotlinked images as it is.

netmeg




msg:3204536
 6:45 pm on Dec 31, 2006 (gmt 0)

Some of my sites get hit really hard by spiders (specially overseas) that have no reason to be spidering them - sucking up a ton of bandwidth - and we block for that reason.

AhmedF




msg:3204543
 7:11 pm on Dec 31, 2006 (gmt 0)

I block out print pages and also AJAX-driven pages (Google ripped through my JS hidden links and had me very confused for a while).

Bones




msg:3204933
 12:44 pm on Jan 1, 2007 (gmt 0)

Not sure how relevant this is today, but still worth a read perhaps:
[webmasterworld.com...]

rogerd




msg:3205416
 3:53 am on Jan 2, 2007 (gmt 0)

It's a handy place for a blog [webmasterworld.com], too. Of course, that approach may not be for everybody. ;)

Even if you aren't concerned about inappropriate content being spidered by the major SEs, a robots.txt file will avoid clogging up your logs with 404 errors.

Receptional




msg:3205920
 4:48 pm on Jan 2, 2007 (gmt 0)

We use it to help eliminate bad traffic. Without it we appear top for odd things "terms of trade" being the one that eventually made us realize that by not using robots.txt to eliminate non-(seo)-marketing pages we were in danger of losing focus.

Kurgano




msg:3206475
 2:55 am on Jan 3, 2007 (gmt 0)

I've read countless hundreds of documents about robots.txt files and am still not completely clear on some issues regarding them.

My understanding is : Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. I don't think this will ever change because the robots also gather statistics for G (and others).

How many pages does the average website block? For a search engine company to know this they would need to spider them all.

That being said I think you need to use them to ensure that some things do not get indexed like "member profiles" etc unless you have a better way. A BETTER way to keep that content hidden is to make the links to profiles etc show up only when a user is logged in.

This thread has me wondering if a webmaster needs to hide anything at all because everything is a potential link back to your site from a search engine... but then I remember that our sites get rated by a machine that can't fully comprehend the content. Oh joy!

jdMorgan




msg:3206505
 3:52 am on Jan 3, 2007 (gmt 0)

Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes.

Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

Robots.txt was originally conceived as a way for Webmasters to prevent robots from consuming excess bandwidth, and to keep them from executing cgi scripts. However, now that the Web has gone commercial, there are many other good reasons to Disallow spiders from fetching various URLs.

A second control mechanism exists in the HTML <meta name="robots" content"noindex"> tag; Its function is different, and the file containing it must not be Disallowed in robots.txt, or the robots won't be able to fetch it to "read" it.

See www.robotstxt.org and w3c.org for authoritative information.

Jim

Shawna




msg:3207380
 11:01 pm on Jan 3, 2007 (gmt 0)

User-agent: Googlebot-Image
Disallow: /
I tried this over a month ago and the google image bot is still eating up mu bandwidth like mad.I am spending an extra 10 dollars a month on bandwidth because of Google.What am I doing wrong?
Glitzer




msg:3208145
 4:36 pm on Jan 4, 2007 (gmt 0)

I'm uncertain if this will work, but can you stop it in .htaccess?

Glitzer




msg:3208153
 4:41 pm on Jan 4, 2007 (gmt 0)

Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

If they already have pages indexed or cached that have recently been disallowed in a revised robots.txt, then they won't remove those pages the SE's already indexed.

Quite a dilemma for webmasters.

goodroi




msg:3208550
 9:17 pm on Jan 4, 2007 (gmt 0)

If a search engine crawls a page and then you block it with robots.txt, that page will eventually fall out of the search index.

If you accidentally allowed content to be published on two sites, you need to choose one and block the other or you will have problems with the engines.

Not sure where the dilemma is.

piplio




msg:3211037
 6:50 am on Jan 7, 2007 (gmt 0)

Set up robots.txt to prevent search engines from over-crawl.

It happened to me once and I paid a hefty price for it...

hybrid6studios




msg:3215726
 8:45 am on Jan 11, 2007 (gmt 0)

Be aware that not every spider obeys Robots.txt. There are some nasty bots out there, including but not limited to: 1) spambots that harvest email addresses from your contact forms or guestbook pages; 2) scrapers that scrpe your site for free content to be used in their spammy doorway pages; 3) downloader programs that suck your bandwidth by downloading your entire site; 4) programs that are out on the web looking for copyright infringements so they can sue people; 5) viruses & worms; 6) data mining programs; 7) hackers; 8) DDOS attacks, etc.

Many of these will actually go to Robots.txt to see what you are trying to hide or protect, and go straight to the restricted content. For this reason I use a dynamic robots.txt page.

Through proper use of .htaccess and mod_rewrite, every time my server calls up robots.txt, it invisibly serves a PHP page (although it looks the same to the viewer) and it detects what bot or browser is viewing the page. For search engine spiders I serve the real Robots.txt content for proper indexing, and for all others I simply disallow everything.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved