|Word file (noindexed) showing first with site: command|
The site: command appears to have been increasingly broken for a few years now but I've never seen this before.
The first result for the query is a Word file (blocked in robots.txt)
The site has had no SEO/SEM work done on it ever, it isn't a strong one, very few links indeed, but still.
When the domain root isn't the first result for a site: operator search, in my experience it is a sign that something isn't right with the site - especially if the situation continues for an extended period. It's worth digging to discover what that problem might be - technical canonical errors can do it, and at some period in Google's history, even certain penalties.
URLs of many kinds can still be in the index, even if crawling is disallowed in robots.txt.
|Be careful about disallowing search engines from crawling your pages. Using the robots.txt protocol on your site can stop Google from crawling your pages, but it may not always prevent them from being indexed. For example, Google may index your page if we discover it by following a link from someone else's site. To display it in search results, Google will need to display a title of some kind and because we won't have access to any of your page content, we will rely on off-page content such as anchor text from other sites. (To truly block a URL from being indexed, you can use meta tags.) |
To keep a non-html file (such as a Word document) out of the index altogether, the best bet is using an X-Robots tag [code.google.com]
Thanks for the X-Robots tag link, I'll check it out.
I've just redeveloped the site with my standard CMS / redirects etc, on a server I'm familiar with, so unless I've made a glaring mistake (and I'm gonna check everything again now just to be sure) I don't think it means there's something wrong with the site either.
Google is still replacing old urls with my new ones so it could be something to do with that - the old site did have duplicate content issues but <50 pages so nothing severe.
What are the things which we should look if the root of the site isnt 1st result in site: search
What could be the possible cause? top sites like amazon, olx even google dont return root as first page in their site: search... does that mean their seo is twisted?
Check both of these searches:
One should return zero results.
I've run into legitimate instances where the root did not show first, usually it is because i have nofollowed an item, and get some rather sparse results.
In terms of documents, I find the best way to keep them out of results is to put them all in a directory and then noindex nofollow links/folders to them with robots.txt and rel=nofollow.
Overall though, I wouldn't worry that it shows up in site:
|top sites like amazon, olx even google dont return root as first page in their site: search... does that mean their seo is twisted? |
In those cases, I'd say it means they don't use their domain root to generate a strong page - and the much greater portion of backlinks point to internal pages. Clearly those sites are not having a major problem.
If your site falls into the same description as those sites - multiple millions of strong pages, to the degree that key pages are internal and not the domain root - then it would be fine. But for a site such as you described, it sounds like there may well be an issue.
Since you just redeveloped the site, and you said iin the opening post that it had few backlinks - did you take care with those URLs that do have backlinks? Did that Word document attract the bulk of existing backlinks for some reason?
No, the Word doc was only added as part of the new site. Linked to from one page in the site. No external links.
site:example.com -inurl:www returns zero results.
None of the redirects from old pages to new are chained.
Nothing stupid in robots.txt
404 handler returns a 404 at the requested url, rather than 301ing to a 404 page.
Home page is second result.
Other uploaded files (three, all PDFs, linked to from more than one page) only show when you request omitted results after doing a site: search.
The Word doc was provided by a practitioner who has his own well-established site (firstnamelastname.com)
The page that links to the Word doc mentions that name, but does not link to him (and neither does any other page in the sitel; not even the url in text).
He has a page with similar (but not identical, not even in any part) content to the Word doc on his site; if you Google the name of the Word doc then that page is 6th, and the exact title of the Word doc is in the meta desc.
Google seems to favor non php/html filetypes in some situations. I know of several sites where a KML version of the page always outranks the htm version. I know of many pages where the pdf version trumps the non pdf version and keywords where you will find many top ranked pdf files above html/php results.
Something may be wrong with the sites that this happens to however there definitely seems to be a rankings benefit to having a different file extension on pages right now. That being said ignoring your robots.txt suggests you've got some things to work out.
I just wanted to add that you have to consider time too, age of page plays a role. It might be that a pdf file will outrank an html version at first but give Google enough time and they'll usually right the ship so to speak.
|Home page is second result. |
With a word document first, I'd expect it to be a temporary glitch that lasts a few weeks at the most.
If you're still in this position in a few months time, there really is a problem.
"Not first" with the home page buried tens or hundreds of entries down the list is a whole other ball game to "not first" with just one oddball listing above it.
Thanks for all input and replies. I think it is a glitch; I'll keep an eye on it.