Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

What should go in robots.txt file in 2019

         

TomSnow

3:51 pm on Dec 3, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



About to launch a website redesign. Want to make sure I don't screw up indexing by leaving something out of robots.txt.

At the moment the setup is:

User-agent: *
Disallow: /wp-admin/
Disallow:/wp-includes/
Sitemap: http://example.com/sitemap.xml

Assuming this is allowing google to crawl and index ALL pages on my site except Wordpress stuff.

The sitemap is automatically generated by Yoast and I will double check that all pages I want indexed are in it. And that all pages I want no indexed are left out.

For pages I don't want indexed, I'm using meta robot directive "noindex" as directed by Google (they no longer accept "noindex" in the robots.txt.file).

Anything I'm missing?

Thanks!

[edited by: goodroi at 3:54 pm (utc) on Dec 3, 2019]
[edit reason] Examplified [/edit]

goodroi

4:03 pm on Dec 3, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you are selling downloads or have private information, you'd probably want to add those directories. But remember that robots.txt is a friendly suggestion & not real security so you probably want stronger access control.

I like to include a few honeypot subdirectories in the exclusion list like /payment-database/ & then ban any ip that tries to access them.

not2easy

6:21 pm on Dec 3, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I agree that file permissions and 'honeypots' can help avoid problems. Anything WP that can be changed from their defaults is a plus for site security. For one example, I always change the sql table prefixes from their "wp-" default to force a need to guess at that for scripted injection efforts. Every bit helps.

Just a few minor suggestions:
Disallow:/wp-includes/
should have a space as in
Disallow: /wp-includes/


Sitemap: http://example.com/sitemap.xml
did you mean the http: part or should that be
Sitemap: https://example.com/sitemap.xml
It is a lot easier to go to https when you are setting up a new site, especially one that includes WP, than to go back and make all the changes to switch to https: if it can be changed before your first URL.

IF there is more than one sitemap as your post suggests, all of them should be listed in your robots.txt file.

TomSnow

12:45 am on Dec 4, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks you guys! Always helpful.

tangor

9:23 am on Dec 4, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My honeypot is /secretsauce/

Wanna guess how many have been banned? :)

No5needinput

1:38 pm on Dec 4, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



846

skaterpunk

1:45 pm on Dec 4, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



I'm under the impression that this question is an ongoing debate.

You mention the default robots.txt setup of:

User-agent: *
Disallow: /wp-admin/
Disallow:/wp-includes/
Sitemap: http://example.com/sitemap.xml

I have read many articles that have stated that those exclusions are dated and just to leave the robtots.txt as:

User-agent: *

Unless of course you have some very specific exclusions beyond a basic setup.

lucy24

6:25 pm on Dec 4, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Disallow: /wp-admin/
Disallow: /wp-includes/
Are these two directories ever visibly linked (that is, visible to a robot, not necesssarily to a human) from publicly accessible pages? If not, the lines would seem to be a waste of time, since nobody requests these directories except malign robots who don’t honor--or even look at--robots.txt anyway. And then, if a malign robot does look at robots.txt, you’ve now told them “Yup, these directories do in fact exist on my site”.

skaterpunk

1:46 pm on Dec 6, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



@lucy24

On the same note, adding:
Sitemap: http://example.com/sitemap.xml

is a repetitive step if you are already submitting a sitemap / 's with a WP plugin, as most WP sites are doing.

Hence, leaving the robots.txt file blank for the things mentioned.

not2easy

2:35 pm on Dec 6, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



@lucy24 - yes, resources are leading to /wp-includes/ which is where default versions of .css, .js and other resources are stored. I do include allows, probably should have mentioned that earlier:
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Allow: /wp-content/*/*.jpg
Allow: /wp-includes/css/
Allow: /wp-includes/js/
The reason for the disallows is that I have seen '/wp-admin/' URLs listed under 'errors' (due to permissions) in GSC and wondered why Google would be looking around in those default folders.

@skaterpunk just expanding on your point to clarify - IF you have not connected your GSC account to your sitemap generator (I have not) then you should list the sitemap in robots.txt because it does not automatically submit unless you have set it to do that. By the same token, you can manually submit a sitemap or automatically submit via plugin but that does not mean that Google will retrieve it on any given crawl. They visit without checking your robots.txt each time.
as opposed to Bing that will check robots.txt 6 times per page request.

lucy24

6:23 pm on Dec 6, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... and if your sitemap is the default name and location /sitemap.xml then any robot (search engine or otherwise) worth its salt will find it anyway. In fact, some will ask for it even if your actual sitemap has a different name and/or location. I've got one site so tiny, I just have a sitemap.txt, named as such in robots.txt, but that hasn't stopped some robots--both good and bad--from requesting sitemap.xml.

skaterpunk

1:30 pm on Dec 7, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



@not2easy - fair point if you aren't submitting a sitemap to GSC. I do and my sitemap is crawled regularly without any mention in robots.txt.