Welcome to WebmasterWorld Guest from 3.81.29.226

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

X-Robots Noindex or 403 Forbidden?

     
2:07 pm on Jul 11, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


Further to my post on [webmasterworld.com...]

I added the x-robots tag no index directive to the http header response for the /app directory which Google was indexing and which I didn't want indexing.

Our Dev Ops guy though added a 403 response to this folder (which should have been there originally) and now from what I understand Google can't act upon the no index directive because of the 403

What should I do:

1. Keep the 403 response code - Google will eventually remove the URLs because of this
2. Remove the 403 response code so Google can see the x-robots no index directive
7:40 pm on July 11, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12402
votes: 410


Mod's note: I've changed the spelling in thread title from incorrect "No Index" to corrected "Noindex". Have left the spelling in the post unchanged.

8:05 pm on July 11, 2018 (gmt 0)

Senior Member from FR 

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Feb 15, 2004
posts:7139
votes: 413


Our Dev Ops guy
..<= you have a "dev ops guy"* you have my sincere commiserations..

*one should never dignify such a spurious title / job description with even Camel Case..

If the folder cannot be accessed due to the 403..the directive cannot be read, by G, nor anyone / thing else, other than those with password access to your site..and any eventual hackers that might be intrigued enough to take a look..

<snort>dev ops guy</snort>
8:45 pm on July 11, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


If you don't want Google crawling the /app directory, why can’t you simply disallow it in robots.txt? Sure, Google always kicks up a fuss when there's something it is not allowed to crawl, but the appropriate response is to ignore them. Is the directory filled with pages that are constantly getting linked from other people's sites, so there is a real danger of its content showing up in SERPs?

I may not want to know why two different people--in this case you and Dev Ops Guy--have the independent power to modify responses or govern access on the same site.

with even Camel Case
It's only Camel Case if you write it “DevOps”. Otherwise it’s Title Case.
8:53 pm on July 11, 2018 (gmt 0)

Senior Member from FR 

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Feb 15, 2004
posts:7139
votes: 413


True..But..can / should "dev ops guy" be a "title" ( Use of Title Case" would imply such ) ..

I would consider it to be more of an expletive..
6:27 am on July 12, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


lucy24 - the /app folder contains the site's CSS / J.S etc - Google needs to be able to crawl these URLs see: [webmasterworld.com...]

Is anyone able to directly answer my question as to how best to remove the 100+ /app and /wp/wp-includes URLs that are now in Google's primary index

1. Keep the 403 response code - Google will eventually re-crawl and remove the URLs because of this response code
2. Remove the 403 response code so Google can see the x-robots no index directive
8:42 am on July 12, 2018 (gmt 0)

Senior Member from FR 

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Feb 15, 2004
posts:7139
votes: 413


Already did answer that above..remove the 403..
8:56 am on July 12, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


Leosghost - thanks for the reply

Right, but Google also says that to remove a URL permanently you should 404/410 or block by requiring a password which in a round about way is what a 403 does
(see 'Make Removal Permanent' in [support.google.com...]

Google isn't explicitly saying 403, but its the same intention for a hard removal?

I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users
8:57 am on July 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


What should I do
Your questions were answered in the last discussion you started.

If you block the folder with a 403, robots can't crawl to verify the noindex that would remove the files from being indexed.

But you went ahead and 403'd the folder anyway. And now you started another thread about the same issue.
9:10 am on July 12, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


I didn't want to block via robots.txt as the /app folder contains the actual sites CSS / J.S which I understand it is now best practice is to allow the crawling of.

I also didn't like the solution offered of disallowing crawling to the folder but with some exceptions for the CSS and JS file, as I read elsewhere that this isn't foolproof and Google will often take the first disallow as the stronger directive

So, I set up an x-robots tag - then our dev ops person (not me) did a 403 on the /app folder creating a new issue for me and thus a new thread

If you follow the chain of events 1) my questions in the 1st thread weren't really answered 2) this is now a separate issue because of the 403

Thanks anyway
9:22 am on July 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Google will often take the first disallow as the stronger directive
You are misinformed. That is not accurate.

You can test your robots.txt in GSC.
12:33 pm on July 12, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4558
votes: 363


I read elsewhere that this isn't foolproof and Google will often take the first disallow as the stronger directive

The instructions to be sure that the "Allow" follows the "Disallow" are from Google. I use them myself and I know that it works as expected. As mentioned, you can test and verify in your GSC account. The Header set X-Robots-Tag "noindex" does not prevent crawling of anything, but disallow does. If you disallow folders, or respond with a 403 error Google can't see the noindex header. Until they know that files in that folder should not be indexed, they will remain indexed.

A 403 response will eventually remove the files from the index, but noindex headers tell Google not to index the files. If you have files that you do not want to have indexed, the best way to manage those files is to move them to a folder and password protect and Disallow the folder. Then you can use Google's tools to remove URLs from the index as the indexed URLs will return a 404 error and requests for the URLs from Google will read the X-Robots headers that tell them to remove the URLs from the index. If you absolutely need to keep those files where they are then the choice is up to you.

In your case, the 403 response might prevent more damage because not all robots care about X-Robots headers and Disallow directives. If Google found and indexed them, they are likely in many other places.

5:10 pm on July 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


Google also says
Tough on them. You don't have to do everything google says.
9:28 pm on July 12, 2018 (gmt 0)

New User

joined:June 29, 2018
posts:12
votes: 3


Google aren't stupid thats for sure, there are millions of websites on wordpress that don't have this problem. If Google are indexing it there is something very very wrong maybe its time to ask your "DevOps" to explain exactly WTF you doing in the includes and why google have chosen to index it. Solve the problem Don't hide it!

wp-includes contains the core functionality of wordpress and basically shouldn't be played about with so if you've coded there its time to sack the "DevOps". Sometimes a badly coded functions.php file in a theme can allow malicious code to be uploaded to the includes folder and maybe its a hack. Google have given you an "Easter Egg" and your trying to hide it.
9:51 pm on July 12, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


not all robots care about X-Robots headers
Bing has said they do not support the X-Robots header
11:32 am on July 15, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


For those of you who actually read and responded to my question - thank you
10:00 pm on July 15, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11870
votes: 244


I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users

what is the purpose of these directories?
blocking all requests for a set of urls is a different problem than removing those urls from the google index.
10:16 pm on July 15, 2018 (gmt 0)

New User

joined:June 29, 2018
posts:12
votes: 3


I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users


Good luck updating wordpress if your coding in wp-includes, you do know that there is a plugin folder for coding and hooking into includes right?

Honestly I think your R+D are full of #*$!

As for the app/ folder sounds like Magento install trying to combine with wordpress probably through fishpig. You do know woocommerce is better right?

My advice is be honest and comprehensive in your questions and not start multiple threads on the same topic.

I'm done answering you now tbh. 2 threads and you haven't listened to damn thing anyone said.
6:14 am on July 16, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


I'm sorry that these 2 posts have caused you such emotional pain. No one demanded that you reply to me
Good luck
6:31 am on July 16, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10569
votes: 1123


There's no magic answer, other than the ones expressed. 403 as a control is not the usual answer to anything. Real question is ... what is there to hide?
7:24 am on July 16, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11870
votes: 244


No one demanded that you reply to me

same here but if you wanted my opinion on your OP you would reply to my reply:
what is the purpose of these directories?
8:06 am on July 16, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


/wp/wp-includes - this is the core WP directory. I have since blocked crawling to this folder via /robots.tx and Google has since de-indexed 90% of URLs from this folder

/app - this contains the site's css, images, plug-ins etc e.g /app/mu-plugins/amazon-web-services/vendor/aws/Monolog/
/app/plugins/anspress-question-answer/templates/js-template/
Google has indexed hundreds of these URLs
10:33 am on July 16, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11870
votes: 244


which of these directories contains resources necessary for properly rendering your content?
you probably want googlebot crawling such resources but not indexing them.

addressing the discussed solutions:

- the X-Robots-Tag is the best solution for googlebot (and any other search engines that support it) as it allows crawling but not indexing of resources necessary for rendering content.

- the 403 response code removes the url from the index eventually, but might make properly rendering content impossible.
what happens when non-googlebot user agents request these resources?
if you are showing googlebot a different response than a non-googlebot request, it might be seen as a form of cloaking.
btw the X-Robots-Tag is irrelevant with a 403 status code.

- disallowing the googlebot crawl with robots.txt will eventually remove these urls from the index most likely because googlebot discovered these within your documents as embedded resources (eg images) and external resources (eg css/js)
however disallowing the googlebot crawl with robots.txt might also make properly rendering content impossible.
typically if you disallow crawling for a path discovered in an anchor element, the url will remain indexed with the typical "A description for this result is not available because of this site's robots.txt" snippet.

basically the answer to your OP is your solution is likely correct and your DOg is wrong.
(assuming my assumptions about your applications are correct)
11:12 am on July 16, 2018 (gmt 0)

New User

joined:July 5, 2017
posts: 30
votes: 1


@phranque - that is great thank you.
Seems like the overall consensus is to serve Google the x-robots tag noindex directive removing the 403 server response on these /app folders (files that Google may need to render the site correctly). I will ask dev ops if there's another solution that can allow both the serving of the x-robots directive whilst stopping visitors from accessing files/folders under the /app directory.