Page is a not externally linkable
RonnieG - 10:14 am on Nov 13, 2006 (gmt 0)
Duh! Very similar is not the same as exact duplicate. Mine is a real estate site. mytown is distinctly not the same as yourtown. This is not the same as simple color change or shoe size difference. I have examined several other similar sites, where this kind of minor difference in wording is common, and is not penalized. False. What old index.htm? Where did that come from, unless from a bad IBL outside of my control, which is not my problem, and should result in a 404. Case specifically referenced was www.mysite/Default.aspx vs. a lower case version www.mysite/default.aspx, of which only the Default.aspx version was found in the site: results, with no cached page. And I had to request the additional omitted results to even see that. The mysite.com/ home page is indexed, and apparently is appropriately redirected to the sole Default.aspx url, for both www and non-www. So it seems that steveb randomly guessed that this might be an issue and picked up on what might be a common issue with some sites. However, this site is hosted on IIS, which unlike **nix servers, is not case sensitive, and the target url was the same exact canonical url in any case, not a separate page. I tested several other variations of the same url with random capitalization of various other letters in the URL, and they all went to the same proper and unique canonical URL. I did the same random letter capitalization test with several internal urls, including various letters in the folder names, all with the same clean results. Of course they all show the same content, since the landing page url is the same exact file! All this shows is that the site is IIS hosted. Nothing more. So what does that prove? The cached results were mostly March-April-May 2006, the same time lots of sites were first being hit by the same supplemental issues we have been discussing here, as evidenced by hundreds of posts on a now-locked WebmasterWorld thread that had to be continued 7 times to handle all the posts. All this shows is that the supplementals issues discussed in those extensive threads are still affecting some sites, and that the pages in the supplementals have not yet recovered from those issues. Since few or none of the pages have the usual problems that would cause them to be penalized / made supplemental / left out of main index purely for their individual page issues, this also seems to support my point that there is still some kind of a site-wide site size and/or PR threshold being applied to what interior pages are allowed in the main index. My site's home page is PR3, and has been there for over a year. This may not be wonderful, but it is not a PR0-PR2. G webmaster tools show my site crawl history chart and page hit numbers, which indicates a monthly full crawl, with daily hits to at least my home page, and average 9-16 pages per day, which is about what I would expect given the dynamic content of a few of the pages and periodic content updates. My web logs show similar crawl rates for googlebot/2.1. So the issue is NOT that the pages are not getting crawled. They, and/or the site, are just not "good enough" for G's main index for some reason. It is just possible that, after 8-9 months in supplemental hell except for the home page, the next time the Gbot is in the neighborhood checking my home page and xml sitemap, and hopefully also crawling all of the pages of my site and counting them, it is just waiting for that magic moment when it crosses the mysterious threshold of time/pagecount/IBLs/etc., that means it can finally be allowed to index my interior pages again, perhaps because I have added another 30-40 pages to the site and have acquired more IBLs to my home page and to a few of the interior pages as well. In the meantime, following the lead of some of the suggestions in the old threads, I have deleted and re-submitted my xml site map, and submitted a re-inclusion request through webmaster tools. No. G needs to be able to better recognize and index true quality content of legitimate sites, on a page by page basis, regardless of site size, as well as recognize and discount spam links. My site has a number of spam links from scraper sites and others that I never solicited or authorized. Those absolutely should be discounted, and it appears that they are, but I should not be penalized for them. On the other hand, I also have legitimate and relevant links from other small businesses in my industry, but G does not seem to be crediting those at all. [edited by: RonnieG at 10:25 am (utc) on Nov. 13, 2006]
steveb said:
... Near duplicate descriptions (the same basic sentence on multiple pages but with something like a different color substituted in the sentence text ... multiple URLs showing the same content, no redirect from the old index.htm page to a new default file, ... and caches from six or more months ago (these are NOT supplementals from October like some sites have, but rather May and March). Google needs to alter its crawl priorities so it crawls these legitmate small sites, and allow them to be beaten by legitimate large sites, rather than its current priority of favoring blog comment spam links at the expense of both small and large legitimate sites.