page is noindexed, but still shows in SERP with a Google notice

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

page is noindexed, but still shows in SERP with a Google notice

SEOPanda

5:34 pm on Jun 27, 2013 (gmt 0)

I have a page which I noindexed many months ago (in meta and robots.txt), and it shows for a site operator + keyword search.

the description says:

A description for this result is not available because of this site's robots.txt – learn more.

Clicking on learn more takes me here:

[support.google.com...]

Anyone see this before?

phranque

6:48 am on Jul 2, 2013 (gmt 0)

there's nothing in robots.txt that names the googlebot?

that was my next question - not "googlebot" specifically, but any substring of googlebot's user agent string.

for example:

User-agent: bot
...

and none of those exclusions in your robots.txt fragment would necessarily match a /.../review/ subdirectory as indicated in your access log sample:

/[REMOVED_BY_ME]/review/[REMOVED_BY_ME].html

it has been mentioned numerous times in this thread that the noindex directive is irrelevant when you have excluded googlebot from crawling that url.

Point being?

it's not useful information for your problem statement.

These pages show up in the SERPs from time to time with the "description blocked by robots.txt" statement.

if the description is blocked, so are all other meta elements.

indyank

7:41 am on Jul 2, 2013 (gmt 0)

I believe this is what happens though Google and many here might not agree.

Password protected pages - Googlebot cannot access them,so they have nothing to store in their DB for those URLs and hence discarded i.e.the pages are completely ignored and nothing goes into their DB.

robots.txt excluded pages - They get stored in their DB but they use the robots.txt rules (which is also stored in their DB for every site) to hide the real descriptions and show only the boilerplate description in the SERPS.

Convergence

8:03 am on Jul 2, 2013 (gmt 0)

Cross-check: AND there's nothing in robots.txt that names the googlebot?

that was my next question - not "googlebot" specifically, but any substring of googlebot's user agent string.

for example:
User-agent: bot

No - not needed.

and none of those exclusions in your robots.txt fragment would necessarily match a /.../review/ subdirectory as indicated in your access log sample:

BINGO - I literally sit corrected. Yes. Human error on our part. Wow.

You guys/gals are correct.

When we tested the robots.txt in WMT we tested:

Example.com/review/ and not example.com/niche-cat/review/

Now when I tested it correctly with

Disallow: */review/

We get that it is blocked.

Thanks for your patience AND the time to help talk this through.

Our mystery is solved. Always wondered why it was just on this one site, lol.

Thanks again!

phranque

8:29 am on Jul 2, 2013 (gmt 0)

thanks for posting a follow-up, Convergence!

you have to admit that if (the) google made a practice of ignoring robots.txt exclusions it would be difficult to hide that fact and you wouldn't have to look too far to find hundreds of "googlez ignoring robots.txt!" blog posts and forum threads.

i think you're okay but make sure you retest example.com/review/ for exclusion.

phranque

8:30 am on Jul 2, 2013 (gmt 0)

robots.txt excluded pages - They get stored in their DB but they use the robots.txt rules (which is also stored in their DB for every site) to hide the real descriptions and show only the boilerplate description in the SERPS.

the truth is in your server access log.

[edited by: phranque at 8:31 am (utc) on Jul 2, 2013]

Convergence

8:31 am on Jul 2, 2013 (gmt 0)

Hi phranque,

Just tested each exclusion, TWICE. LOL - NOW it's correct..

Thanks, again!

PS: It was like this for a LONG, LONG time - oh, my...

phranque

8:54 am on Jul 2, 2013 (gmt 0)

It was like this for a LONG, LONG time - oh, my...

this will make you blind to that as a problem since you will assume "it has always worked".

Disallow: */review/

note that while googlebot respects wildcarding you can't expect all well-behaved bots to respect this rule since this extension to the robots exclusion protocol was introduced by google.

This 67 message thread spans 3 pages: 67