Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Indexing Wordpress CMS Admin Folder URLs

         

Rysk100

8:12 am on Jul 4, 2018 (gmt 0)

5+ Year Member



For some reason, Google has been aggressively crawling and now indexing sets of CMS admin URL

It crawled and indexed over 50 /wp-includes URLs - I have since blocked these on Robots.txt

Its now crawled and indexed a similar number of URLS from the /app folder e.g. /app/mu-plugins/advanced-custom-fields/images/add-ons/ into the primary index

The problem is I can't block this folder on robots.txt as it contains the CSS and JS of the actual site - which I understand Google needs to render the site and is now best practice to give Google access to.

Why is Google suddenly finding and indexing these URLs?
What can I do to stop Google from crawling and indexing them?

Rysk100

8:53 am on Jul 4, 2018 (gmt 0)

5+ Year Member



Maybe one solution is to set no index for these file / URLs from the x-robots tag in the header via http access:

<Files ~ "\app $">
Header append X-Robots-Tag "noindex"
</Files>

Does this seem right?

keyplyr

11:28 am on Jul 4, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Rysk100 - why not simply noindex the folders directly in the robots.txt.

Rysk100

11:32 am on Jul 4, 2018 (gmt 0)

5+ Year Member



Unless i'm very mistaken /robots.txt stops crawling not indexing. Unlike/wp-admin and /wp-includes the files under /app contain the site's CSS, images and J.S which Google needs to be able to crawl

keyplyr

11:37 am on Jul 4, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My point is, you said you were going to append a header tag to noindex the app folder, and I suggested to just write it directly in the robots.txt.

In robots.txt just disallow the URLs you don't want indexed. Then remove them from the index in Google Search Console with the removal tool. That way they won't be crawled in the future & reindexed.

[fix typo]

[edited by: keyplyr at 11:55 am (utc) on Jul 4, 2018]

TorontoBoy

11:44 am on Jul 4, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



I am also interested. I've never had Google or any other search engine index my WP admins. Did you set permissions on your folders to not allow viewing? [codex.wordpress.org...]

not2easy

1:28 pm on Jul 4, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You can disallow folders with robots.txt and allow specific file types within those folders.
Disallow: /wp-includes/
Allow: /wp-includes/js/
Allow: /wp-includes/css/
Note that 'Allow' follows 'Disallow'
Allow: /*.css
Allow: /*.js
covers all .js and .css in disallowed folders - those would be after all disallows. Also note that this applies to Google's bots and 'some' others - not all; also note that bad bots don't even bother to read robots.txt.

If you want to know whether it works, use the robots.txt Tester in GSC (not the 'new' version).

aristotle

6:28 pm on Jul 4, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Fake googlebots might try to crawl those files looking for weaknesses that a hacker could exploit

Robert Charlton

7:43 pm on Jul 11, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Mod's note: This discussion continued by OP under new topic...

X-Robots Noindex or 403 Forbidden?
https://www.webmasterworld.com/google/4910541.htm [webmasterworld.com]

phranque

10:48 am on Jul 16, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Google has been aggressively crawling and now indexing sets of CMS admin URL

Unlike/wp-admin and /wp-includes...

requests for /wp-includes/ paths should get a 403 (Forbidden) status code.
see this from the Codex:
https://codex.wordpress.org/Hardening_WordPress#WP-Includes

requests for /wp-admin/ paths should get a 401 status code which is typically a challenge for HTTP Basic Athentication.
see this from the Codex:
https://codex.wordpress.org/Hardening_WordPress#WP-Admin

both the 401 and 403 status code responses will prevent google from indexing the requested url.