Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

2 Indexing Issues

         

Rysk100

9:30 am on Mar 10, 2019 (gmt 0)

5+ Year Member



I am facing 2 indexing problems on a new company / site

1. Their development / staging sites was indexed (before I started working with them).
I removed the URLs via Google search console and then set a sitewide noindex via meta robots tag.
Google is still crawling and indexing the URLs though. Specifically the permalink unfriendly URLs i.e https://www2.bestwidgets.com/?p=383 are showing in the index which 301/canonicalize to https://www2.bestwidgets.com/green-widgets
Why is this happening? Why is google not respecting the noindex directive? What can I do?

2. I need to remove from the index a number of pdf URLs. We are using nginx, the web administrator set the the following x-robots command

location ~* .(doc|pdf)$ {add_header X-Robots-Tag "noindex, noarchive, nosnippet";}

After 2 weeks these URLs still haven't been removed. Is there a problem with the command? How long till they will be removed

Thanks...



[edited by: not2easy at 5:07 am (utc) on Mar 11, 2019]
[edit reason] unlinked for readability [/edit]

not2easy

5:18 am on Mar 11, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If this line is doing anything, it definitely is not noindexing any html or php pages because it is applied specifically to .doc and .pdf files:
location ~* .(doc|pdf)$ {add_header X-Robots-Tag "noindex, noarchive, nosnippet";} 

I would look at your source code for the pages you want noindexed and see what kind of robots meta information is in the headers of those pages.

You do not mention how these pages are generated, but because permalinks are mentioned I am guessing it is a WordPress site. Ideally those old permalinks would be 301 redirected to the new permalinks. Those questions are covered in the Apache [webmasterworld.com] Forum.

Rysk100

6:58 am on Mar 11, 2019 (gmt 0)

5+ Year Member



Thanks

To answer your question

1. Yes, I specifically want to exclude these .pdf urls
2. Yes, its a WP site

phranque

8:08 am on Mar 11, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would do some or all of the following:
- check the robots.txt file and make sure those urls are not being excluded from crawling
- run a header checker on those pdf requests to see if the X-Robots-Tag header is being supplied properly with the response
- check the server access log file and make sure the /?p=383 url is being requested by googlebot and is getting a 301 response
- run a header checker on the /?p=383 url request to see if the Location header is being supplied properly with the response
- check the server access log file and make sure the /green-widgets url is being requested by googlebot and is getting a 200 response
- use the URL Inspection tool on GSC to see what G reveals about those urls

not2easy

1:51 pm on Mar 11, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I understand that you wanted to exclude the .pdf files, the question was about how you are noindexing the pages, the unwanted permalink URLs and whether you could see that those URLs actually contained a noindex signal in the head section of the pages. If you view your page source and see a robots meta tag that says
 <meta name="robots" content="index, follow">
that would explain why they are still crawling and indexing. You did not mention setting noindex in the WP Settings where it would need to be done.
.. and then set a sitewide noindex via meta robots tag. Google is still crawling and indexing the URLs though.
Sending mixed signals would likely be the answer to
Why is this happening? Why is google not respecting the noindex directive?

tangor

12:40 am on Mar 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why is this happening? Why is google not respecting the noindex directive? What can I do?

g (or any search engine) never forgets a url it has encountered. It will continue to test it for YEARS afterwards. Just a nature of the beast. You do NOT have the ability to ERASE those undesired urls from their index. Just won't happen.

So, mitigate as much as possible then ignore it from here on out and build what you WANT them to index. Just make sure you continue to report the old undesired as 410 ... FOREVER.

Rysk100

10:26 am on Mar 17, 2019 (gmt 0)

5+ Year Member



@phranque

- check the robots.txt file and make sure those urls are not being excluded from crawling
> no they aren't

- run a header checker on those pdf requests to see if the X-Robots-Tag header is being supplied properly with the response
> Seems that the command for the x-robots tag wasn't correct. X-Robots tag not showing in both Screaming Frog and in online http header tools

- run a header checker on the /?p=383 url request to see if the Location header is being supplied properly with the response
> Yes, it returns a 301 followed by a 200 (to /green-widgets)

- use the URL Inspection tool on GSC to see what G reveals about those urls
> For the/?p=383 it shows 200 response, crawled and indexed
> for the /green-widget it shows not indexed due to noindex tag (which is correct)

[edited by: Rysk100 at 10:41 am (utc) on Mar 17, 2019]

Rysk100

10:33 am on Mar 17, 2019 (gmt 0)

5+ Year Member



@not2easy

- The staging site has a sitewide meta robots noindex directive set (via settings > appearances). This has not been overridden by any page specific directives to index these pages. As stated above the permalink URLs /?p=308 etc 301 correctly to the /green-widget URLs which have a no index tag

- The pdf URLs are a separate problem on the production site.

abdrahimben19

5:35 pm on Mar 17, 2019 (gmt 0)

5+ Year Member



location ~* \.(txt|log|pdf|doc|js)$ {
add_header X-Robots-Tag noindex;
}