Does every website need to use robots.txt?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does every website need to use robots.txt?

solaris1

8:53 am on Apr 8, 2016 (gmt 0)

HI friends. I am very confused about Robots.txt. I want to It is necessary to appy robots.txt file in the website. It may reflect our website ranking if we dont use robot.txt file in the website.

[edited by: Andy_Langton at 8:59 am (utc) on Apr 8, 2016]
[edit reason] No specific website please, see forum charter! [/edit]

Andy Langton

9:06 am on Apr 8, 2016 (gmt 0)

Hi solaris1, and welcome to WebmasterWorld [webmasterworld.com]! :)

The "official" name for what robots.txt does is robots exclusion - i.e. stopping search engines from visiting or indexing certain pages. If you want to allow all search engines to access everything, then you don't need robots.txt at all. If there is no robots.txt, search engines are allowed access to everything.

If your site has a lot of pages that are not the sort of thing you want Google to find (e.g. duplicates, empty pages) then it's possible that using robots.txt to exclude these pages would help performance.

Some sites use a blank robots.txt to avoid seeing 404 in their server logs, but this is certainly not required.

Wilburforce

9:20 am on Apr 8, 2016 (gmt 0)

No, you don't have to use it.

However, posting a blank file won't do any harm, and if you want to allow all robots access to all areas of the site (which is what will happen if you don't post anything or post a blank file) you could add (DO NOT add any characters after the colon)

User-agent: *
Disallow:

which expressly allows all robots to access everything.

AnujSinghal

12:36 pm on Apr 8, 2016 (gmt 0)

The Robots.txt file is useful for preventing it from accessing specific directories or files of your website. If you don't need this type of specification in your website then definitely you should not to use robots.txt.

lucy24

8:37 pm on Apr 8, 2016 (gmt 0)

It may reflect our website ranking if we dont use robot.txt file in the website.

Where's that FUD subforum when we need it? There is a heck of a lot of speculation about things that search engines do and do not like, but I must admit I have never heard it suggested that the mere existence or nonexistence of a "robots.txt" file is a ranking factor.

Seriously though... Is there absolutely nothing on your site that you'd prefer to keep off-limits?

:: detour for random experiment with asssorted Big Names, winding up with my local library* just for yucks ::

Er, WebmasterWorld, isn't there supposed to be a "User-Agent:" line before the "Disallow:" lines? And gosh, I wonder what the IRS has against the discobot that merits an exclusion all its own :)

Conclusion: everyone has a robots.txt

* Hate to break it to you, locallibrary dot org, but that particular search engine ignores robots.txt

tangor

9:09 pm on Apr 8, 2016 (gmt 0)

Do you need one? No.
Should you be without one? No.

The benefit of robots.txt is to manage your bandwidth and where your stuff is indexed (scraped, stolen, perverted, even by well-meaning folks) by notifying those bots that DO RESPECT your directives to stay away.

While you can deal with the bad bots with other methods if you can keep the respectful out (the ones that bring no value to YOU, the site owner) that reduces your work load, your bandwidth (costs) and other associated ills.

If nothing else robots.txt is a first filter on who behaves and who does not. Useful in that regard.

Jepakazol

9:16 am on Apr 10, 2016 (gmt 0)

One important note - if you optimize your website for Baidu the answer might be different.

[searchengineland.com...]
"Robots.txt: Baidu does not like websites with a robots.txt file. If one currently exists, it should be removed. Any important rules that would normally be set up in the robots.txt file should be set in the .htaccess file or IIS server settings."

Never checked it but saw it before a while regrading Baidu SEO

lucy24

8:31 pm on Apr 10, 2016 (gmt 0)

Baidu does not like websites with a robots.txt file.

What possible difference can it make, since they ignore it anyway?

anallawalla

11:51 pm on Apr 10, 2016 (gmt 0)

Here is a very simple reason for having even an empty robots.txt file. Every spider asks for it and this makes an entry in the server log. If it is not found, a 404 error is added to the server log. So, you may not want to have all these 404s distracting you in the log, should you ever want to look at it. :)

tangor

12:28 am on Apr 11, 2016 (gmt 0)

Even 404s are instructive.

Well behaved bots are in and out and leave you alone (at least once a week as they will be back to see if you changed your mind). A 200 then means you have to filter THAT out of the traffic to find out what you have. The 404 you are already filtering out so you never see it. At least that is the way I do it, though I do look a the 404's in a separate report.

robots.txt will not protect any site. It just lets us know who is naughty or nice.

Wilburforce

7:08 am on Apr 11, 2016 (gmt 0)

The 404 you are already filtering out so you never see it.

What are you looking at? If I want to go forensic on something I use the raw server logs.

In those I might very occasionally feel the need to look at 404s (especially if e.g. GSC is reporting crawl errors), and having every robots.txt request in the results would make it much more of a pain than it needs to be.

I would never filter for 200 (why would anybody, unless they had no traffic at all?), and if I want to know who is looking at robots.txt, I filter for robots.txt.

Am I missing something?

aakk9999

10:27 am on Apr 11, 2016 (gmt 0)

It should be noted though that robots.txt that gives server error (HTTP 500) can be problematic and can impact ranking. Couple of years ago I have seen a website getting deindexed because their server returned HTTP 500 instead of 404 Not Found on googlebot's robots.txt requests.

Uploading emtpy robots.txt (which resulted in robots.txt response code being 200 OK) solved the problem and the site came back pretty much immediately. The old thread is here:

Google drops site after only 10 days of robots.txt returning HTTP 500
June 2013
https://www.webmasterworld.com/google/4580855.htm [webmasterworld.com]

There was also a thread on this on Seoroundtable:

Google: Can't Crawl Your Robots.txt Then We Stop Crawling Your Site
Jan 2014
https://www.seroundtable.com/google-crawl-robots-txt-17906.html [seroundtable.com]

Note that "Can't Crawl" in the post above refers to HTTP 500 and not to receiving HTTP 404 Not Found.
It also appears that the same issue (deindexing) would be if robots.txt response code is HTTP 401 Not Authorised / HTTP 403 Forbidden

Robert Charlton

1:35 am on Apr 12, 2016 (gmt 0)

robots.txt that gives server error (HTTP 500)

In general, if you don't want to disallow anything, a blank robots.txt with a properly configured 404 error page is the preferred setup.

If you don't have a robots.txt file, though, a custom 404 error page that doesn't return a true 404 can lead to problems.

I can't find the reference, but I remember that Matt Cutts had said that Google would prefer no robots.txt to a robots.txt that returns a soft 404 (ie, a redirect to a custom error page that returns a 200). Among other things, that setup wastes a lot of server resources, even if Google ultimately does figure it out. This would happen every time your site is requested, as Google always looks for robots.txt first.

Obviously, returning a 500 is even worse.

PS: Here's a classic post from tedster that's worth rereading if you're not thoroughly familiar with the issues.

Custom Error Pages - Beware the Server Header Status Code
April, 2008
https://www.webmasterworld.com/google/3626149.htm [webmasterworld.com]