Welcome to WebmasterWorld Guest from 22.214.171.124
That 1 page is not properly indexed by G, the title and description are omitted, but the link is still present. It's made worse by the fact that none of the 'real' pages so far appear in the SERPs.
Why does that 1 page appear? What other measures should/could have been taken in order to prevent a page (or even its link) from appearing at all in the SERPs?
Webmaster Tools has a robots.txt checking utility - it can be quite helpful.
You mean the home page isn't showing?
Well, it wasn't, but it is now! A few hours later and the home page is now at the top of the list (doing a regular search for the company name), and the privacy page has dropped off the SERPs! Good.
However, if I do a site search (ie. site:www.example.com) and include omitted results then the privacy page (or at least a link to it) is still there. No matter really in this case, but why is it there at all?
Webmaster Tools has a robots.txt checking utility...
Yes, Webmaster Tools reports that the privacy page/URL is restricted by robots.txt. However, it is not preventing the URL from appearing in google SERPs. The title and description are not shown and the page does not appear to have been spidered by Google (no record in my stats).
why is it there at all?
Because it's excluded in robots.txt and there's a link to it somewhere on the web.
Disallow in robots.txt is NOT the same as noindex. Disallowing keeps them from spidering the page (in theory, there have been reports they do anyway occasionally, but it could simply be a glitch or incorrect robots.txt when they do), but since there is a link to it, they consider it a page, so they show the URL only link. It's not an error. It's something they clearly state they do. If you do not want the page to show in the index (results) then use the URL removal tool, or allow them to crawl the page and put a noindex tag on it and it will disappear.
(Note, however, that while Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.)
Removing Your Own Content from Google [google.com]
Instead of fretting about this for a month, work on content and back-links. Once Google finds a few links, things will sort themselves out.
Because it's excluded in robots.txt and there's a link to it somewhere on the web. ... Disallow in robots.txt is NOT the same as noindex. ...
No links elsewhere on the web (I'm pretty sure) - just quite a few internal ones. But yes, thanks for the explanation. It was that 1 URL getting listed before all the others that threw me initially.
Many thanks for the clarity - no fret. ;)
No links elsewhere on the web (I'm pretty sure) - just quite a few internal ones.
Just to overly clarify:
By somewhere on the web, I meant somewhere on the web, which includes your website...
It was that 1 URL getting listed before all the others that threw me initially.
My guess is it's because there's really not much to process, score, rank, organize, attempt to determine the meaning of, etc. since it's only a URL they're listing. No HTML to parse. No language to attempt to decipher. No 'similar pages' to compare to. IOW: It's just a URL, which means it takes less time to process than a page.