Welcome to WebmasterWorld Guest from 107.23.176.162

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

page is noindexed, but still shows in SERP with a Google notice

     
5:34 pm on Jun 27, 2013 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 15, 2011
posts: 66
votes: 0


I have a page which I noindexed many months ago (in meta and robots.txt), and it shows for a site operator + keyword search.

the description says:

A description for this result is not available because of this site's robots.txt learn more.

Clicking on learn more takes me here:

[support.google.com...]

Anyone see this before?
8:12 pm on June 27, 2013 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3439
votes: 321


If you disallow it in robots.txt, Google can't crawl the page to see the noindex meta tag.
8:19 pm on June 27, 2013 (gmt 0)

Preferred Member

10+ Year Member Top Contributors Of The Month

joined:June 19, 2005
posts: 362
votes: 12


If it can't crawl it, why would Google want it in the index?
9:07 pm on June 27, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15446
votes: 739


Because someone, somewhere, has linked to it.
9:17 pm on June 27, 2013 (gmt 0)

New User

5+ Year Member

joined:June 10, 2013
posts:3
votes: 0


GoodROI and lucy both got it right.

A robot.txt doesn't prevent a page from showing up in the SERP, it only prevents it from being crawled. If that page is linked to from enough outside sources the URL will still show in a SERP, but without any additional information (meta description etc.)

A meta no-index tag is on the specific page and prevents that page from showing in the SERP altogether... the only catch is the page has to be crawled for the crawler to find the meta no-index tag.

So take that page off your robots.txt is the moral of the story
10:34 pm on June 27, 2013 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4161
votes: 262


And make sure it is not in your sitemap if you have one.
10:51 pm on June 27, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15446
votes: 739


If a page is in the sitemap, and it isn't roboted-out, will this override a "noindex" on the page itself? g### does occasionally hint that they will disregard a site owner's expressed wishes if they feel like it. (Where "if they feel like it" is shorthand for a long and complicated explanation that I can't lay my hands on at the moment.)
12:04 am on June 28, 2013 (gmt 0)

Junior Member

5+ Year Member

joined:Aug 14, 2012
posts:79
votes: 0


Have seen internal links without a rel=nofollow tag, with a noindex header, and blocked in robots.txt - still show up in the SERPs with the description showing:

"A description for this result is not available because of this site's robots.txt learn more."
12:17 am on June 28, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15446
votes: 739


with a noindex header, and blocked in robots.txt

If the page is roboted-out, the search engine cannot see the "noindex" header.
2:08 am on June 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


@lucy24, If a page linked to from other sites is password protected but not roboted out, what will be the status on Google SERPS?
2:39 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


If a page linked to from other sites is password protected


"password protected" as in a "401 status code" response or a "redirect to login page"?
2:45 am on June 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


I meant "password protected" as in a "401 status code" response but would be interested in knowing the answers for both...
3:37 am on June 28, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15446
votes: 739


Closely related question:
If a page comes with
X-Robots-Tag "noindex"
(as some of my non-html pages do)
will this directive be honored in html pages that don't have a meta robots tag?

:: detour here to make sure the header has been working as intended with my non-page files ::

Here, again, the search engine will only see the header if it is allowed to receive the page. But it's an alternative way of conveying the same information. Useful if for example you don't want the index to hint at the existence of anything within a particular directory, even if you happened to forget the meta on one page.
7:37 am on June 28, 2013 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12225
votes: 361


Related discussion....

Pages are indexed even after blocking in robots.txt
http://www.webmasterworld.com/google/4490125.htm [webmasterworld.com]
9:05 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


I meant "password protected" as in a "401 status code" response but would be interested in knowing the answers for both...


as far as i know google does not index any 4xx status code responses.

these are reported in GWT as "URL Errors".
you can see these, grouped with any 401, 403 and 407 responses by going to "Health"/"Crawl Errors"/Access denied".

as far as redirecting to a login page, that depends on what status code is used for the redirect.
9:08 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


If a page comes with
X-Robots-Tag "noindex"
(as some of my non-html pages do)
will this directive be honored in html pages that don't have a meta robots tag?


X-Robots-Tag is intended for resources that are non-html documents and therefore cannot provide a meta robots noindex element, but the X-Robots-Tag HTTP Response header works equally well for any Content-Type.
10:04 am on June 28, 2013 (gmt 0)

Full Member

5+ Year Member

joined:Feb 25, 2011
posts: 257
votes: 0


Ok so lets say you wanted to remove these urls that are in a subfolder. IN GWT would you use the following syntax.

www.domain.com/*/folderofurlsyouwanttoremove/

Note that is a subfolder.
10:50 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


Ok so lets say you wanted to remove these urls that are in a subfolder. IN GWT would you use the following syntax.

why are you removing these urls?
are the urls in the index?
are they meta robots (or X-Robots-Tag) noindexed?
is the directory excluded from crawling?
are these urls getting 404/410 responses?

When NOT to use the URL removal tool - Webmaster Tools Help:
http://support.google.com/webmasters/answer/1269119 [support.google.com]


www.domain.com/*/folderofurlsyouwanttoremove/

it appears you can't use wildcarding when specifying the removal url.

Find the URL of a page - Webmaster Tools Help:
http://support.google.com/webmasters/answer/63758 [support.google.com]
11:05 am on June 28, 2013 (gmt 0)

Full Member

5+ Year Member

joined:Feb 25, 2011
posts: 257
votes: 0


Thanks phranque, the pages are beingredirected now. Panda smack with trackback URLS on a wordpress site - dupe content.
11:27 am on June 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


as far as i know google does not index any 4xx status code responses.


Thanks phranque. Yes it shouldn't and that should be the right behavior.

But will it appear in the form of link only stubs in SERPS with a description that is similar to this one.

"A description for this result is not available because of this site's robots.txt learn more. "

If it doesn't even show up in the SERPS, why does Google chose to show link only stubs for robots.txt excluded pages? The argument that those links are found on other sites, should hold good for password protected pages as well, isn't it?
11:47 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


"A description for this result is not available because of this site's robots.txt learn more. "


if you have excluded googlebot from crawling a url it will never see the 4xx response and therefore doesn't know that the content is password-protected.
11:49 am on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


the pages are beingredirected now.


if the pages are being redirected then the urls are not suitable for a removal request.
11:59 am on June 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


No, sorry for the confusion if any. They aren't excluded from ronots.txt but only password protected.

My question - is there any differential treatment for password protected pages vs robot.txt excluded pages in Google SERPS? We know that robots.txt excluded pages show up as link only stubs in SERPS with a description posted by the OP. But what about password protected pages? If they don't show up in SERPS at all, why this differential treatment?
12:15 pm on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


the difference is that a 401 is unambiguous.
1:25 pm on June 28, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


My question - is there any differential treatment for password protected pages vs robot.txt excluded pages in Google SERPS? We know that robots.txt excluded pages show up as link only stubs in SERPS with a description posted by the OP. But what about password protected pages? If they don't show up in SERPS at all, why this differential treatment?


Yes, there is a different treatment.

As others said above - if the page is excluded from crawling via robots.txt, Google is only told it is not allowed to crawl the page and therefore will not be (should not be) requesting it. Hence it cannot see any other directive such as:

- HTTP response code (including these that are 301, 401, 403, 404, etc)
- on-page robots meta such as noindex, noodp etc

Hence pages that are roboted out may show in SERPs as Google was only told not to crawl them and does not know about any other directive or response code that might result in different page handling.

Think of it like this:

If I forbid you to ring the doorbell, you cannot tell whether I am at home or not, in fact you cannot even see if it was me living at this address nor even whether the door exists. All that you know is that someone talked about my door.
3:25 pm on June 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


Hence pages that are roboted out may show in SERPs as Google was only told not to crawl them and does not know about any other directive or response code that might result in different page handling.


But doesn't the same hold true for password protected pages? By password protection we are telling them, they are not allowed to crawl. Google is forced to obey as they might not know to break past the password. But Google does get a hint a page exists for that URL as someone has linked to the password protected page. Why aren't they showing the password protected URLS in the SERPS with a boilerplate description, like they do for roboted out URLs?
6:46 pm on June 28, 2013 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11568
votes: 182


Why aren't they showing the password protected URLS in the SERPS with a boilerplate description, like they do for roboted out URLs?


a 4xx response means for all practical purposes the requested resource doesn't exist or is not available.
excluding a robot from crawling says nothing about the status of the resource for a live visitor - it's only an instruction for the robot.
6:49 pm on June 28, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


But doesn't the same hold true for password protected pages? By password protection we are telling them, they are not allowed to crawl.


I can see what you are thinking but I believe it is not the same. If a page is password protected, then visitors cannot see the page either unless they know the password. Visitors that know the password may be only a selected few - and if a visitor knows the password, they would probably know the URL too. Hence it is probably not good for Google to have such page in index because if a click from SERPs requires a password it is most likely a bad experience for visitors coming from SERPs.

But restricting access via robots.txt is for bots only - they are not allowed to go there, but visitors see the page.

Unless you are cloaking, of course, and have a page password protected for bots only - then visitors would see it without password, but Google would not know this, so why should it show in its index.


<added>Which is pretty much what phranque said in the post above.</added>
7:46 pm on June 28, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15446
votes: 739


www.domain.com/*/folderofurlsyouwanttoremove/

Note that is a subfolder.

The wild-card formulation only makes sense if you have a bunch of different directories all containing a subdirectory with the same name-- obvious example, a group of directory-specific /images/ subdirectories. Is that what you're aiming at?

Hence it is probably not good for Google to have such page in index because if a click from SERPs requires a password it is most likely a bad experience for visitors coming from SERPs.

That doesn't seem to stop sites from doing it. At the access level it's done with a "Satisfy any" directive: visitor has to either know the password, or be the googlebot. It's absolutely infuriating to the human visitor, but the sites don't seem to care. "The full text of this article-- including the content you searched for-- is only available to logged-in members."
8:32 pm on June 28, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


@lucy
I wonder if we are talking cross-wired. The question I was answering was why pages returning HTTP 401 to Googlebot are not in index (as opposed to pages excluded by robots.txt, which may be).

I am not entirely sure what you are referring to - are you saying pages responding with 401 are in index? Or perhaps are you referring to "first click free" situations?
This 67 message thread spans 3 pages: 67
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members