Welcome to WebmasterWorld Guest from 54.242.115.55

Forum Moderators: Robert Charlton & goodroi

2 Indexing Issues

     
9:30 am on Mar 10, 2019 (gmt 0)

New User

joined:July 5, 2017
posts: 28
votes: 1


I am facing 2 indexing problems on a new company / site

1. Their development / staging sites was indexed (before I started working with them).
I removed the URLs via Google search console and then set a sitewide noindex via meta robots tag.
Google is still crawling and indexing the URLs though. Specifically the permalink unfriendly URLs i.e https://www2.bestwidgets.com/?p=383 are showing in the index which 301/canonicalize to https://www2.bestwidgets.com/green-widgets
Why is this happening? Why is google not respecting the noindex directive? What can I do?

2. I need to remove from the index a number of pdf URLs. We are using nginx, the web administrator set the the following x-robots command

location ~* .(doc|pdf)$ {add_header X-Robots-Tag "noindex, noarchive, nosnippet";}

After 2 weeks these URLs still haven't been removed. Is there a problem with the command? How long till they will be removed

Thanks...



[edited by: not2easy at 5:07 am (utc) on Mar 11, 2019]
[edit reason] unlinked for readability [/edit]

5:18 am on Mar 11, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4198
votes: 264


If this line is doing anything, it definitely is not noindexing any html or php pages because it is applied specifically to .doc and .pdf files:
location ~* .(doc|pdf)$ {add_header X-Robots-Tag "noindex, noarchive, nosnippet";} 

I would look at your source code for the pages you want noindexed and see what kind of robots meta information is in the headers of those pages.

You do not mention how these pages are generated, but because permalinks are mentioned I am guessing it is a WordPress site. Ideally those old permalinks would be 301 redirected to the new permalinks. Those questions are covered in the Apache [webmasterworld.com] Forum.
6:58 am on Mar 11, 2019 (gmt 0)

New User

joined:July 5, 2017
posts: 28
votes: 1


Thanks

To answer your question

1. Yes, I specifically want to exclude these .pdf urls
2. Yes, its a WP site
8:08 am on Mar 11, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11608
votes: 193


i would do some or all of the following:
- check the robots.txt file and make sure those urls are not being excluded from crawling
- run a header checker on those pdf requests to see if the X-Robots-Tag header is being supplied properly with the response
- check the server access log file and make sure the /?p=383 url is being requested by googlebot and is getting a 301 response
- run a header checker on the /?p=383 url request to see if the Location header is being supplied properly with the response
- check the server access log file and make sure the /green-widgets url is being requested by googlebot and is getting a 200 response
- use the URL Inspection tool on GSC to see what G reveals about those urls
1:51 pm on Mar 11, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4198
votes: 264


I understand that you wanted to exclude the .pdf files, the question was about how you are noindexing the pages, the unwanted permalink URLs and whether you could see that those URLs actually contained a noindex signal in the head section of the pages. If you view your page source and see a robots meta tag that says
 <meta name="robots" content="index, follow">
that would explain why they are still crawling and indexing. You did not mention setting noindex in the WP Settings where it would need to be done.
.. and then set a sitewide noindex via meta robots tag. Google is still crawling and indexing the URLs though.
Sending mixed signals would likely be the answer to
Why is this happening? Why is google not respecting the noindex directive?
12:40 am on Mar 12, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:9233
votes: 780


Why is this happening? Why is google not respecting the noindex directive? What can I do?

g (or any search engine) never forgets a url it has encountered. It will continue to test it for YEARS afterwards. Just a nature of the beast. You do NOT have the ability to ERASE those undesired urls from their index. Just won't happen.

So, mitigate as much as possible then ignore it from here on out and build what you WANT them to index. Just make sure you continue to report the old undesired as 410 ... FOREVER.
10:26 am on Mar 17, 2019 (gmt 0)

New User

joined:July 5, 2017
posts: 28
votes: 1


@phranque

- check the robots.txt file and make sure those urls are not being excluded from crawling
> no they aren't

- run a header checker on those pdf requests to see if the X-Robots-Tag header is being supplied properly with the response
> Seems that the command for the x-robots tag wasn't correct. X-Robots tag not showing in both Screaming Frog and in online http header tools

- run a header checker on the /?p=383 url request to see if the Location header is being supplied properly with the response
> Yes, it returns a 301 followed by a 200 (to /green-widgets)

- use the URL Inspection tool on GSC to see what G reveals about those urls
> For the/?p=383 it shows 200 response, crawled and indexed
> for the /green-widget it shows not indexed due to noindex tag (which is correct)

[edited by: Rysk100 at 10:41 am (utc) on Mar 17, 2019]

10:33 am on Mar 17, 2019 (gmt 0)

New User

joined:July 5, 2017
posts: 28
votes: 1


@not2easy

- The staging site has a sitewide meta robots noindex directive set (via settings > appearances). This has not been overridden by any page specific directives to index these pages. As stated above the permalink URLs /?p=308 etc 301 correctly to the /green-widget URLs which have a no index tag

- The pdf URLs are a separate problem on the production site.
5:35 pm on Mar 17, 2019 (gmt 0)

New User

joined:Mar 17, 2019
posts:1
votes: 0


location ~* \.(txt|log|pdf|doc|js)$ {
add_header X-Robots-Tag noindex;
}