Forum Moderators: Robert Charlton & goodroi
I just now did an allinurl: check for my site, leaving out the www.
All my regular www pages show up as such, but 3 pages stubbornly remain
listed without the www. Worse, they are URL only, with no description.
Those 3 pages are listed both ways: www with a short snipped and non-www
which is URL only.
Searching by keywords, only the proper www-version shows at all.
1) What are the likeliest causes of this? Old incoming links maybe?
2) Is it any cause for concern? Pagerank splitting or whatever?
3) Is there something I should do about it? If so what?
It doesn't look like any emergency in any case. Thanks in advance -Larry
In March the redirect was added from www to non-www (the opposite of what I normally do). All of the non-www pages appeared within days, and almost all had a title and description. The www site listings were slowly fixed in the index, either losing their title and description, or just disappearing from the index. The cache date was updated for all of the pages that remained in the index.
After a few weeks, only a few www pages were left, then many suddenly reappeared in the index again. This time they were fixed by linking to all the stuff we didn't want indexed by putting a "fake sitemap" on another site. After a few weeks, all was fine again; at the beginning of May, most of the www pages dropped out again.
At the end of May, just as the Bourbon update was beginning, the index suddenly went back to the version that Google had displayed back in January, and the cache dates were all from December 2004 and January 2005 too. The index contained both www and non-www pages again, and many URL-only listings too.
It stayed this way for nearly a month, and then was fixed all by itself, all except for two www pages which still remain. The latest changes happened much less than two weeks ago.
Google did struggle with the two versions for about two updates before it was completely straightened out. For me, the PR consolidated too - which was the benefit I was after.
Here is that thread:
[webmasterworld.com...]
The interesting thing about that thread was that g1smd posted in the thread. I have a lot of respect for g1smd and read his? posts carefully. I was surprised to see him posting later on about the 301 - he could have solved his problem back in October...
Yes, I am aware of the redirects and have already recommended them for a very long time, but I haven't yet personally gotten around to every webmaster to suggest they use it too. :-)
I wonder if Google Sitemaps could be used to force G to crawl the remaining pages?
So submit a site map of the non-www pages (or whatever you want to get rid of), forcing Google to crawl them, see the 301, and close out that version of the site?
Anyone want to give it a shot?
One of the two remaining URL-only listings for www pages has today dropped out of a site: search. The page that that URL pointed to has not existed for several years.
The other rogue www listing still remains. The 118 non-www pages are still listed, all of them with full title and description. The 301 redirect (from www to non-www) has been in place since mid-March.
Here is the odd bit.
Today, 223 of the 224 /cgi-bin pages that exist have all suddenly reappeared in a site: search. They are all shown as URL-only listings. These pages have been disallowed in the robots.txt file since mid-March. The URL of that robots.txt file was submitted to the Google URL Console in late March too (and at that time, all of the /cgi-bin pages dropped out of the index within a few days), and have stayed out until today.
Half of them need a password to get in. Without it you get a 401 error. It seemed pointless to have them in the index as URL-only entries.
The other half of the 224 cgi-bin pages are pages where people can submit information. Those pages are near identical to each other. Again, it was not necessary for Google to index those. We didn't want people coming to a submission page directly from a Google result. We wanted them to see the site content first. If they then want to submit something, then there is a link on every page to do so.
In March adding the /cgi-bin folder URL to robots.txt got them all out of the index within a few days (the robots.txt URL was submitted to the Google URL Console for removal).
Google has correctly listed the 118 real HTML content pages; we didn't need the 224 site management cgi-bin pages listed too. The cgi-bin pages are not duplicates of the site content, 112 are public submission pages, and 112 require a password to get in.
In the Google URL Console, pages that were requested for removal between 2005-03-28 and 2005-04-04 are showing as expired and so have been added back into the listings.
I don't know why they have been re-added, as the robots.txt file is still on the site, and still says that they should not be indexed.
Note: A few months ago many people were reporting that this removal would be for 6 months. These pages are back in after only 3 months.
However, there are other pages (on other sites) that were removed in March (in fact all of the pages removed in the rest of March, all before the 27th) that still show as "complete" and are still removed from the index.
I wish, now, that I made a note as to which pages were removed by submitting the URL of the robots.txt file, and which were removed by submitting the URL of a page or folder that needed delisting and letting the bot see the 404 status that some were giving out.
<edit>It looks to me, that pages asked to be removed by submitting the URL of the page are dropped for 6 months or more (that is, they are still out after 4 months), and that pages removed by submitting the URL of the robots.txt file that mentions them as "disallowed" are removed for only three months (as they are back in on the 91st day).</edit> -- no, see the rest of this, I already disproved it:
Oh my! I removed all trace of a local site that had closed down, by submitting the URL of each page, individually, to the URL Console. I did that in mid-March. The entries from March 26th still show as "complete" in the console, but those from March 28th show as "expired". A load of pages have actually been added back into the Google index, with full title and description and a cache from nearly a year ago. Every link in Google's index goes to a 404 error page, as the site has been offline for 9 months (there is a "we have closed down" notice on the main index page, all of the rest of the content is gone, long ago). I cannot understand why Google didn't check the status of the pages before adding them back into the index. As they are all 404 (were submitted as 404 pages), and have been 404 for 9 months, why add them back in?