| 2:07 pm on Nov 6, 2012 (gmt 0)|
I would double check the robots.txt and then triple check it. A single typo mistake in robots.txt can cause huge problems. Also check to see if you are using the meta robots tag on the page.
Then check your log file to see how Googlebot has been behaving on your website and for how long. This can help identify the day that the issue was introduced to your website.
| 6:06 pm on Nov 6, 2012 (gmt 0)|
Just a long shot. It's not a wordpress blog needing adjustment to the privacy settings?
| 6:21 pm on Nov 6, 2012 (gmt 0)|
Definitely not a typo. I've gone in to GWT, fetched as Googlebot, and checked blocked URLs, and everything looks fine. What's weird is that I'm seeing it on 2 websites (on different C-class IPs), and one of them doesn't even have a robots.txt file, so the error message can't be correct.
| 7:18 pm on Nov 6, 2012 (gmt 0)|
What about googlebot getting someone's elses robots.txt? Are you on shared hosting?
| 11:57 pm on Nov 6, 2012 (gmt 0)|
could it possibly be another hostname (subdomain) serving a different robots.txt?
| 12:31 am on Nov 7, 2012 (gmt 0)|
|I've gone in to GWT, fetched as Googlebot |
Double-checking here: "fetch as googlebot" meaning that you've fetched the specific page it claims not to be able to show? And there's no problem getting to it?
After you Fetch as Googlebot there's a button called something like "index this page right away". Did you click it?
Oh and, ahem, you do have one canonical form of your domain name, right? No with-and-without www to confuse the robot?
| 1:28 am on Nov 7, 2012 (gmt 0)|
Having a domain accessible both with and without the www won't confuse a bot any more than texas.example.com and california.example.com would ... Having the same content available on both can create duplicate content issues for algorithms, but it's actually handled by search algo's way better now than it used to be, because they basically just pick one to show in the results when both are available with the same content.
If your page is really not available via robots.txt you should see a URL only in the results, so if you see more than that (like the title) I would suggest starting by looking at the robots header/meta tag.
You don't possibly use the 'noarchive' robots header or meta tag do you? (That's the first one I would look at personally, because I would not be surprised at all if it's the issue.)
robots.txt block = URL only in the results.
robots header/meta noindex = no page in the results, so that's not the issue.
robots header/meta noarvchive = my first guess as to the problem. (I know it messes up the preview 'snap shot' of a page and I'm pretty sure it's tied to the description in some way too.)
Those are really the only 3 possibilities outside of a huge glitch at Google, because fetching the correct robots.txt file from each accessible domain/subdomain is critial when you're running a bot, and the chances of them getting that wrong after the time they've been running a bot is Very slim.
The only other thing I can think of as remotely possible is an erroneous redirect, but again, it should cause your page to be URL only in the results if somehow you're redirecting your robots.txt to some other domain/subdomain and that's easy enough to check by typing in your domain/robots.txt and making sure you stay at your domain.
| 2:32 am on Nov 7, 2012 (gmt 0)|
|Having a domain accessible both with and without the www won't confuse a bot any more than texas.example.com and california.example.com would ... Having the same content available on both can create duplicate content issues for algorithms, but it's actually handled by search algo's way better now than it used to be, because they basically just pick one to show in the results when both are available with the same content. |
the point is that the different hostnames are different urls and also can have different robots.txt, so although the search engines usually/eventually figure it out, there are things you can do to screw that up.
| 3:28 am on Nov 7, 2012 (gmt 0)|
Deleted, because I wasn't thinking when I replied...
Honestly, I'd wager it's a noarchive meta tag or header...
[edited by: TheMadScientist at 4:15 am (utc) on Nov 7, 2012]
| 3:52 am on Nov 7, 2012 (gmt 0)|
Unless you Purposely separate the www/non-www (which you would Know you did, because you have to physically, purposely change the server settings) they show Exactly the same content (including robots.txt), because they run out of the same directory on the server by default in any hosting account I know of where you don't have to 'do it yourself', so how can you 'screw that up' and/or 'confuse a bot'?
In other words, you Cannot have two different robots.txt files for www and non-www Unless you Purposely make it so you can, meaning: Canonicalization should not matter a bit in this situation.
You're not going to 'confuse a bot' by having the same robots.txt file for both the www and non-www version of the domain and you can't even accidentally 'screw it up' yourself or you would Know to check both, because it takes a higher knowledge level to be able to have separate files on each than it does for them to serve duplicates of each other AND if you had them set to not serve duplicates of each other you would NOT want to canonicalize them, because that would defeat the purpose of serving different files from each. (Like say you know something about site speed and want to serve your 'cookieless files' from the non-www and files requiring a cookie from the www to keep the upstream requests from the browser down or something crazy like that.)
Two separate robots.txt files between www and non-www is like serving a 410 error, it doesn't 'just happen'. You can't even 'accidentally upload' one with an error and one without on the different versions of www / non-www, unless you've purposely made it so you can and if you made it so you could there would be a reason behind.
Sorry, but canonicalization is neither the issue nor the answer for this one.
| 5:14 am on Nov 7, 2012 (gmt 0)|
all i'm doing is describing a hypothetical situation which would result in that robots.txt message as described by the OP in the snippet.
the situation you described (meta robots noarchive) would not cause that message.
the meta robots noarchive prevents the "Cached" link from appearing but otherwise doesn't affect the snippet.
| 5:51 am on Nov 7, 2012 (gmt 0)|
Your hypothetical situation doesn't make any more sense than lucy24's comment about confusing bots, and the noarchive tag used to behave as you described, but I removed it from all my sites after running it for years, because it was treated like the nosnippet in some cases. Why? I really don't have a clue, but in some cases it was ... It could have just been a glitch, but I haven't ever used nosnippet yet I had no preview for some key pages, so I pulled the noarchive and 24 hours later the preview was back ... Go figure.
I don't feel like arguing or even giving answers any more ... If you and others want to believe canonicalization has something to do with the issue described, even though it defies logic, reason, and server settings/bot interaction, then go ahead...
I Will Add: I assure you, I didn't pull the noarchive tag off of 50,000+ pages lightly ... It was the only tag I ran on all pages across all sites and it was there for years without issue ... I thought for quite a while before I pulled it.
[edited by: TheMadScientist at 6:16 am (utc) on Nov 7, 2012]
| 5:58 am on Nov 7, 2012 (gmt 0)|
@Uber_SEO - have you used the robots.txt validation tool in WMT to see if there's an issue that could affect the home page?
Also, is this "home page" actually the domain root? Or is it some other, longer URL?
| 6:40 am on Nov 7, 2012 (gmt 0)|
|I'm seeing it on 2 websites (on different C-class IPs), and one of them doesn't even have a robots.txt file, so the error message can't be correct. |
It almost has to be somewhere you wouldn't normally look and something you wouldn't normally think it was, imo, and since you don't have a robots.txt file on one, it seems like it almost has to be a misbehaving robots tag, much like the issue I had for some unknown reason with the noarchive.
(I checked everything before I pulled the noarchive from mine, because I didn't want to, but it was the only thing that made any sense at all as a possible cause since everything else was 'wide open' for indexing and there's No Way I'd put nosnippet on a site ... It may be something completely different on yours, but there's not much else that makes much sense as a starting point for 'digging' to me.)
(I really just posted again to say, 'Hey, tedster! Been a long time since we've posted in the same thread ... Good to see you and congrats again on that little award you picked up, it was VERY well deserved.' I did try to be helpful above so I don't get snipped too hard lol)
| 3:38 pm on Nov 13, 2012 (gmt 0)|
Just thought I'd report back on this, as it may turn out to be useful to someone else. Turned out that the hosting company had rolled out some IP blocking technology to their servers, which had blocked Google from accessing the domains. No idea why they did this, and trying to get to the bottom of it.
Effectively, Google's message in the SERPs was incorrect.