Forum Moderators: Robert Charlton & goodroi
Does anyone have a similar problem?
Googlebot has visited these pages tons of times over the past few months. However, the cache remains out of date.
Thanks.
Set up a "fake sitemap" page that points to all of the pages that you do NOT want to be listed, and host that page on another site. Google will spider the page, see the links, and when it follows them it will pick up the 301 status for them all.
.
If your internal links do not include the domain name, and you also have no redirect, then a single incoming link to non-www is enough to get most of the site indexed under the wrong version. There is nothing in place to make the correction and include the www at any place on the site. The 301 redirect will correct that within a few months, weeks if you are lucky.
I have had troubles getting googlebot to spider my sites for real again after I was hit by the googlbur 302 and hijackers, after that I have not been present on google serps, once a month I see a single hit for my main keyword, but then its gone, maybe google have spidered the non www when I was hijacked and now im filtered because of those 400 non www pages, just a theory.
No. Google will need to directly visit each and every non-www page in order to "see" the redirect for that page. That is why you need to set up a fake sitemap page that lists every page that you want removed. You need one link to each page that you no longer want to be indexed.
Once the redirect is in place, asking for any non-www page forces the browser or bot over to the www version of that page: it is impossible to navigate around the site as non-www.
Before the redirect was put in place, and for all pages linked using relative links, entry at any point of the site (any page, and any version: www or non-www) caused the whole site to be spidered as whatever the first page was (www or non-www). If the internal links all contained www.domain.com/... then this would force the site to be indexed as www; but, be clear, this would not stop any non-www pages that were linked from external sites from also being indexed (because the non-www URL would still serve "content" when asked {"duplicate" content too!}). The redirect forces the canonicalisation, on a per-page basis, whether or not the internal links are relative or absolute.
I have 21,000 pages which I do not know if they are indexed as www or non-www. The page counts under "site:" are way inflated. Google reports that we have 83,400 pages. Not sure how this happened. So...
- Should I create a 200 page (100 links per page) site map listing every URL with the non-www format from another site?
- If I do this, will the other site be penalized for having a 20,000 links pointing to the other site? Will both domains be penalized?
- I have been using a sitemap at Google. Why does this not correct the issue?
All thoughts are appreciated.
Lorel - No that would not make it, I think those 2 pages totaly unique fit into my topic of the site and my writing it can not get more uniqu, no I think I know whats wrong, its former hijack pages which cache is still in google DB, that way there are more versions of those pages, its a shame they dont update this supplemental DB.
If someone has hijacked your pages then your changing the text on those pages by at least 15% should remove the penalty as they will no longer be duplicates of the other.
also have this theory, that many former hijacked or sites that where hit by the googlebog 302, still is not reapearing in the serps, because of all those old caches around. If the hijacker and 302 links still have the old cache where they have dublicated the original site, listed in google, then the original site still have troubles.
It is my theory that even though Google "claims" to have removed those penalities re the 302 redirects (because it no longer lists them in the site command) doesn't mean that's so. It's just harder to find evidence of the 302 redirects now.
I think Im now filtered for dublicated, because of all those old caches that are still in the serps, also because 1-2 times a month I get a single hit from my main keyword in the logs.
Maybe the telling apart of "fact" , "fantasy" and "fell through the cracks" is getting to much for all these search engines ...chasing the "largest number searched is with us award" is not the way to inspire confidence if this is the result ...
BTW the page was not an old hijacked but from an impeccable site in Italian dealing with realtime news and financials...
For one of my static sites "G" has almost as many pages in their supplemental results as does the wayback machine ..maybe we have stumble on another "to be used later" resource here ...or maybe they have their eye off the ball again ..
Yes, when I use the command: -www inurl:yoursite.com
It reports only 11 results on 216.239.59.104.
On the 216.239.37.99 - it is reporting 2190 results. However, it will only show me what three of those results are. I am not sure what the other 2187 results are.
Then at least a month ago, several datacentres (like [216.239.37.99...] etc) reverted to indexes from 7 or 8 months ago. I have no idea why they are hanging on to such old data, but I suspect it is something to do with fixing the 302 redirect problem.
Compare [216.239.59.104...] and [216.239.37.99...] for example. Are they both very old, or just one?
Both have very old serps - stuff back from Nov 1 2004. However, .104 has stuff from Mar. 28 2004. This stuff no longer exists on our site. It is basically 404.
Anyone else have stuff from 2004?