Forum Moderators: Robert Charlton & goodroi
How I noticed this, is that we have a huge directory of content arranged alphabetically with each letter being a seperate page a.html for example. From my front page I have a.html linked, and then all the content links on that page. The content that starts with a letter 'a' is all indexed. The pages like b.html and c.html are also indexed, but the individual content pages aren't.
So, what this means is that Google is giving an overall site PR which tells it how many levels down it will index. In my limited research it seems that a site with a front page of PR 5 will get indexed three levels down, and a site of PR 6 will get indexed four levels down. Those below PR 5 I have looked at are barely getting spidered.
When doing this, keep in mind that your front page counts as a level. So if you are only PR 5 it seems like if you have a huge directory don't split it up into sections, just have a huge page with the links to it all. This of course totally hoses usability but you will get spidered.
Also, externally linked pages will get spidered, as a few of the pages listed under the other letters are indexed, as they are linked in blogs and other sites. This is across the board what is happening on my site and the others I have looked at.
Count your levels getting spidered and you will notice how deep they are going. For me, three levels and that is it except for the externally linked individual pages I have seen.
[edited by: tedster at 6:16 pm (utc) on May 22, 2006]
[edit reason] formatting [/edit]
I think that I know what is the root of problem. All their algorithms are approximate (with numerical or statistical error) and when you perform 30 different algorithms on billions of pages numerical and statistical errors do accumulate.
Who does know the basics of numerical analyzis and/or probability/statistics will know what I'm talking about.
Accumulation of numberical and probability error for 30 different algorithms and billions of pages. That accumulation makes Google less relevant and at this point I'm seeing better results in Y! SERPS.
I had 80% of unique content pages dropped, with the majority of the remaining gone into supplemental.
Another extremely interesting find, is that I am searching for the titles of my 'supplemental' pages and they don't even come up. At least before, G would rank supplementals if it didn't find more relevant pages in its clean index.
Has anyone else noticed this?
If Google can't properly index 'blogs' that they have created, then how do you expect them to correctly index the rest of the sites?
It signifies a major shift in G policy when indexing the web. They are not anymore organising the world's information but the information they see important, effectively 'censoring'.
Which again promts the thought:
Why drop so many pages?
Lack of capacity or simply a bug?
There is no other answer.
This is not a shift in priorities as Mr Cutts claims, this is G in serious trouble.
Why do you thing crawling is now prioritised?
Why did Bigdaddy shrink the 'Google' web instead of expanding it?
Why did Matt talk about indexing more pages before Bigdaddy and then started saying that its indexing only (or mainly) 'important' pages after the Bigdaddy bug was spotted.
With Google resolving canonical issues and even indexing more javascript everyone should see more pages, not less, after the last update.
There are thousands of webmasters that keep reporting problems with millions of sites hit by this.
And this is why this thread keeps going on.
:)
Your ignorance is forgiven as a newbie...
They do not drop pages they reindexig with certain priorities (see Matt Cutts blog) and that will take some time and patience from webmasters and at this point is useless this thread goin on.....
1984bb, Why are you reading and posting in this thread so frequently if it is "useless"? Hypocricy and immaturity at it's best.
On another note...I have seen a significant reduction in spidering on a site of mine during the past month, but there is no real reason for it in terms of IBLs or page "reputation". So, I wonder if part of their "priority" crawling has to do with the condition of a site's indexing? That is, if a pages are already indexed properly, utilize minimal redirects, and do not change frequently, then perhaps they are spidered less. It might not suggest that the page is less important, which many might automatically assume if they see googlebot visiting less. Just a thought.
Getting that supplemental mess cleaned up is a big task, so they really do need to prioritize.
Anyway hope he continues to write about what is going on, every bit helps.
4 distinct websites about different genres.
2 have been registered and indexed in Google longer than 3 years.
2 were indexed in Nov 2004
All have 301 redirect in place from non www to www. All are taking part of link exchanges to a moderate degree. All of the sites have some one ways, some recips and some outbound links to high quality sites inthe respective genre.
All are using Google sitemaps, and every site has a link to a sitemap page that lists every page in the site, and this link is near the top of the page code for each page.
A site: search shows more pages listed now than last week for all four sites. A site search for each site shows the sitemap.html as average listing of 5-8th for the site search.
site: search shows the root url with www as the first listing in Google.
Hope this helps...
Another issue that has been discussed and applies to me is that many of the pages are similar accross the website. I have the same 30 resources (apparel, travel, venues, etc.) under each of 50 some-odd regions. Some of them have not had substantial content (listings/ads) yet so they would probably appear identical to a machine.
But why should this be a bad thing? It's a completely natural occurance in a heirarchical-type-of-resource website. There's only so creative one can be before it looks sillier than the "duplication" - so called.
Who is it that we complain to? We were told to be patient. Now what?
One thing I should further add is that around Sept I applied the 301's to all four websites I mentioned previously and also removed all links pointing home with the keyword in the anchor.
Hope it helps...
I also see that only the first 2 levels beyond the home page are being indexed. Even the 2nd level is sketchy. I completely overhauled the site in Sept., 2005. It was indexed well right off the bat, so link structure is not the problem.
Possibly it could help to create a complete sitemap and link to it from a main part on every page as part of the navigation. Treat the sitemap as an important page in your site as it really is a directory of your entire site. Place the link to the sitemap high up in page code to further help establish importance. If Google deems it important, its likely to deem outgoing links on it important as well.
If you mean by having regions / duplication the following scenario:
this-widget-nantucket.html
this-widget-katmandu.html
and each page is really about the same content, with reworded text for the region and Title, then I could see why you might be nailed with a duplicate penalty.
(I review products for my site under different categories)
1. Changed the location from www.domain.com/product/ to www.domain.com/product.asp
2. Replaced underscores with hyphens in file/image names.
3. Moved many links that are currently not linking back to my site to my links page (they were on the side menu on every page, blog style).
4. Kept my 5 top traffic trades on the menu, and added only 5 other link-trade links on my menu.
three are straight trades --
Link 1 (PR 4)
Link 2 (PR 0)
Link 3 (PR 4)
two are a-b-c trades --
Link 4 (IN = 3 / OUT = PR 0)
Link 5 (IN = 3 / OUT = PR 0)
While i'm at it, my 5 traffic trades =
Link 1 (PR 4)
Link 2 (PR 4)
Link 3 (PR 3)
Link 4 (PR 4)
Link 5 (PR 0) *should I remove?*
So my question is the last one, should I remove the PR 0 traffic trade? The traffic isn't all that great.
Also, will the links on the links.asp page hurt me? Should I scrap them? A lot of them are not even hardlink trades, they're sort of traffic trades where if I send traffic, my link appears on their page in a hardlink format usually. Not an ideal way, so I may remove them if they hurt.
Thanks for any tips! Hopefully my changes show some results, as another page was dropped today :( Down to 5 listed, yikes.
Possibly it could help to create a complete sitemap and link to it from a main part on every page as part of the navigation.
Why should I? Thank you for the advice, you might be right, but why should I have to play games? Obviously the way I was doing things was just fine - until now. And it's technically sound. Visitor's find their way around.
Besides, such a sitemap would be complete nightmare to any user.
If you mean by having regions / duplication the following scenario:this-widget-nantucket.html
this-widget-katmandu.html
Something like that...
Something Region 1
--Region 1 apparel
--Region 1 food
--Region 1 lodging
Something Region 2
--Region 2 apparel
--Region 2 food
--Region 2 lodging
and each page is really about the same content, with reworded text for the region and Title, then I could see why you might be nailed with a duplicate penalty.
As I have mentioned elsewhere, why should this be suspect? It's a perfectly common, natural and legitimate practice - an obvious way to organize regional content in particular.
Google's policies are forcing people to do unnatural things to get around their "prevention" techniques.
I have been spedning my valuable time putting together articles and resources for my visitors. I don't have time to play games with these people.
And they say on their guide to webmasters to not create a site for search engines, but for visitors (wording to that effect). So it really shouldn't be a requirement to have a site map if your navigation is good enough for visitors to pottle around your site and find the page they require.
I'm sure if MSN or Yahoo! started asking for a sitemap submission people would be up in arms.
Ha!
Why should I? Thank you for the advice, you might be right, but why should I have to play games?
You are right, but I thought that you were asking for tips (and not a conversation on the ethics of Google)
If you choose to chase Google traffic then playing the game is inevitable.
Having said that, I dont think having a sitemap on every page is a hindrance to your visitors whatsoever, it doesnt have to be in your middle content area.
I have found Matt Cutts blog to be very helpful, I am glad they allow someone to post as much as he has on this subject. It made me take a closer look at my site and see some problems I have which I think are hurting my rankings [such as the mysite.com vs. www.mysite.com problem, bad outside links].
And that is the point. Now everyone looks for problems on their sites, because of bad ranking and lost pages. But before big daddy it worked very well with indexing, spidering and caching.
Why donīt someone suggest that google has some more problems with the new big daddy infrastructure. E.g. theres is Matt Cutts who explains that there is a new spider technic based on incoming links but on the other hand the google sitemaps team writes on their blog that they are looking why so many pages are dropping of the index. And they are hopefully not the reason. That does not match.
As an database engineering I know what it means to handle some million datasets. But google does billions with cache + old cache + deleted pages and so on. There is more than 100 rows of code but millions. Each row has the ability to run into an error. So if you do one wrong it will hurt the whole system. To find this small error is the case and will take time.
IMO
Firstly, it looks better and perhaps easier to remember for users, and secondly it was aimed at helping rankings etc.
Well, that's all gone to pot since google doesn't rank hardly any of my product pages now and not one single new page since May 2005.
I'm now assuming that google thinks my product pages are four directories deep when I could easily have them as top level. Given my low PR I assume that's why my indexed page count is low.
I suppose I might as well go back to using the first example if Google will index all my product pages - has anyone any experience of doing this and seeing more pages indexed (with a low PR).
category pages are like this:
www.mysite.com/first-category.php
www.mysite.com/second-category.php
product pages are like this:
www.mysite.com/products/00001/brandname/first-model.html
www.mysite.com/products/00002/brandname/new-model.html
www.mysite.com/products/00003/brandname/other-model.html
They don't follow a hierarchical structure as such. What you end up getting is this:
www.mysite.com/first-category.php
www.mysite.com/products/00001/brandname/first-model.html
www.mysite.com/products/00002/brandname/new-model.html
www.mysite.com/second-category.php
www.mysite.com/products/00002/brandname/new-model.html
www.mysite.com/products/00003/brandname/other-model.html
As opposed to
www.mysite.com/first-category/00001/brandname/first-model.html
www.mysite.com/first-category/00002/brandname/new-model.html
www.mysite.com/second-category/00002/brandname/new-model.html
www.mysite.com/second-category/00003/brandname/other-model.html
which results in duplication of pages.
That's why I didn't go down the hierarchical route.
I agree with Jo1ene, what is the problem in offering surfers the information they need in a well laid out form after all if you've got say a restaurant guide does G want to see one massive page with every restaurant on it or the guide split down into regions! so its easy for surfers to use. It really seems like G is asking webmasters to start going backward rather than forward
www.mysite.com/00001-brandname-first-model.html
www.mysite.com/00002-brandname-new-model.html
It makes no difference to me although those URLs will get remarkably long once I replace brandname and model with the right details.