Welcome to WebmasterWorld Guest from 188.8.131.52
How I noticed this, is that we have a huge directory of content arranged alphabetically with each letter being a seperate page a.html for example. From my front page I have a.html linked, and then all the content links on that page. The content that starts with a letter 'a' is all indexed. The pages like b.html and c.html are also indexed, but the individual content pages aren't.
So, what this means is that Google is giving an overall site PR which tells it how many levels down it will index. In my limited research it seems that a site with a front page of PR 5 will get indexed three levels down, and a site of PR 6 will get indexed four levels down. Those below PR 5 I have looked at are barely getting spidered.
When doing this, keep in mind that your front page counts as a level. So if you are only PR 5 it seems like if you have a huge directory don't split it up into sections, just have a huge page with the links to it all. This of course totally hoses usability but you will get spidered.
Also, externally linked pages will get spidered, as a few of the pages listed under the other letters are indexed, as they are linked in blogs and other sites. This is across the board what is happening on my site and the others I have looked at.
Count your levels getting spidered and you will notice how deep they are going. For me, three levels and that is it except for the externally linked individual pages I have seen.
[edited by: tedster at 6:16 pm (utc) on May 22, 2006]
[edit reason] formatting [/edit]
Websites are about linking, not typing some 150 character URL into the address bar. There is no reason to have two pages. Just link to one single page from two or more sections.
The thing is, is that I retained my original directory, I just link to the flat list in a place where users rarely go. That way I get the best of both worlds. If you have noticed, forums have long used long flat lists of topics that are the top results for many searches. This isn't new.
Heirarchical folder structures with breadcrumb navigation can do very well. They have good internal linking both up and down the structure, as well as supplying lots of nice keyword-loaded anchor text on all of the internal links.
Not to critique jolene, as this is about intention, but I have seen many of these sites, usualy which list:
hotels this location ¦ hotels that location
etc at the footer. I think its unfortunate that Google seems to penalize this for the rare site who uses it with good intentions.
The funny thing, is that since I put the flat navigation in site maps Google indexed all those pages, and liked them so much I guess that now it is crawling deep in other areas. I just think there is too much to gain in doing it this way to simply shove it off as being unhelpful. I think this method is especially helpful for those with prs of 5 and below as it will vastly improve the page rank dilution of deep pages. It helped everything for me. A 7 times increase in traffic and 90k increase in pages indexed after only one week is certainly proof enough for me.
You have to give the bots multiple paths, you have to distribute your PR wisely, and you have to have a volume of links (good PR or not) to every page.
Google has rewarded well laid out structures for a long time, and now it is even more true. Pages with one single link to them at the end of five click daisy chai are going to have a hard time every getting found by the new weakbot.
Of course you can. Why the hell can't you other than it "might not be feasible" or "google on likes 100 links" blah.
Steve you are missing the point of this thread. And here it is.
For some of us google is only indexing down to a certain level seemingly based on PR. So a PR 5 site may only get up to 3 levels indexed (depending on certain factors only google knows about). In other words it does not matter if you link 300 pages or just one page off of level 3 it won't get indexed or stay indexed. Related articles and sections linking to each other does not matter also...they won't get indexed.
Now this being the case; HOW CAN GOOGLE SUCCESSFULLY DISTRIBUTE PR WHEN NOT ALL PAGES ARE INDEXED AND CONTRIBUTING TO THE DISTRIBUTION?
Keep in mind also that it may not be wise to keep bulk content on higher levels from higher pr pages as this may not feasible from a business/visitor perspective, structurally, and presentation wise.
Since there is a problem that those pages are not being indexed the TEST/THEORY here is to somehow raise up that deep content up to a level where it can get indexed yet not take away from user experience. One way is to flatten the structure to where the bots hit the deep content on the third level. This can be done creating a table of contents or a sitemap that is flattened so the bots access most content on a higher indexable level. I believe this is what tsm26 did.
Another way is to increase links to main pages and deep pages. But now they must be the "right" kind of links and all that jazz making this a fix that will take much time and effor. (Even though we have many thousands of incoming links throught our site we still have problems) This is something that can and should be addressed but what do you do in the mean time?
The idea of flatening the site a bit is to help speed up the process a bit and get those important deep content pages indexed to where they CAN help distribute internal PR like they should. For some sites this may be all that is needed as PR distribution can be restored. For others it may require this and finding better links to completely restore the site but at least you can get some pages back in the index immediately.
This is what this thread is all about and I don't see one bit of a problem in testing a "flattened site map".
They are actually only one click away while my other product pages are all two clicks away (first click to a category, then click to a product). However, because of my directory structure they look as if they are buried deep on the site.
OK, my fault I suppose. But when I developed this site a couple of years ago this indexing by structure was never an issue.
I put some new pages up 4 weeks ago linked from the index (PR4) with several one way links from themed sites (friends) and that page took 3 weeks to get crawled and even now the links that are leaving the new page (3rd level) aren't getting touched
AND to cap it all my page count that for the last 4 weeks has slowly been going up has just taken another hit and dropped from 343 pages to 303 but I suppose thats better than the figure of 2 months ago 243!
Which is what I said. You can't put 57k links on one page and expect that to "work".
Directory structure level has zero to do with anything and never did and there is no reason to bring that into a discussion. Pages seven or eight folder levels down get indexed as easy an anything else. Googlebot follows links, it doesn't make address bar judgments.
"This can be done creating a table of contents or a sitemap that is flattened so the bots access most content on a higher indexable level."
And to beat the dead horse, this what Google Guy has been reccomending for years, and it goes back to waht I said. You need to give the bots different ways to a page, and you need to have sufficient PR and/or raw volume of links to pages you want crawled. One link from a PR5 page might do the trick, or 100 PR3 links.
Getting deep content to rank has always meant sacrificing some power of the top level pages. Now this becomes a higher priority since you aren't just striving to get deep contant ranked, but indexed at all. If pages aren't indexed, they can't help your top pages with anchor text or send their small amount of PR back, or send bots crawling back.
You are thinking in terms of PR distribution. My statement was about crawl priority in that sites under certain conditions with a pr of 5 will only get so many levels crawled. PR 4 gets 2 levels. All of my sites see this same effect. Now incoming links to deep content can change things a bit but since google has changed that aspect who knows. If other sites are suffering from this same occurance then deep content linked from deep content may not have much of an effect.
"Directory structure level has zero to do with anything and never did and there is no reason to bring that into a discussion. Pages seven or eight folder levels down get indexed as easy an anything else. Googlebot follows links, it doesn't make address bar judgments."
This is NOT about harddrive folder address bar / judgments! This is about how many levels off the index page your content is getting indexed no matter what the file name or file directory it is in. I can place all files under root or 40 directories down and link them all I want. What we are seeing is that google wants to index our pr 5 site down to 3 levels. This means 2 clicks off of home page. All links are crawled and indexed that are linked off of the home page. All links are crawled and idexed off the those pages. It won't index (rather index for very long) pages those 3rd level pages are linking to. Do you get what I am saying?
"And to beat the dead horse, this what Google Guy has been reccomending for years, and it goes back to waht I said. You need to give the bots different ways to a page, and you need to have sufficient PR and/or raw volume of links to pages you want crawled. One link from a PR5 page might do the trick, or 100 PR3 links."
This has never been the case. Why until now has there NEVER EVER been a problem indexing and keeping content on any of our sites an other people's sites? It never mattered if a page had 1 link or 5000 links. If it was linked somewhere on the site accessable to googlebot it would be crawled and indexed. But it isn't a problem of getting crawled necessarily it is more that pages at a certain level won't get indexed or they won't stay indexed. Hence dropped pages.
Our site is a pr of 5 should be sufficient plus thousands of incoming links to internal content should be more than sufficient to get 4 levels indexed.
The site map worked because the content got MOVED up 1 level in googlebots eyes. Now those links came from brand NEW site map pages with pr 0 so this looks as if this isn't necessarily directly PR distribution related but may be related to PR of the site (again 5) and crawl priorities based on that. To add to that the shear ammount of links diluting the PR distribution from those site maps shouldn't have all that much effect on PR anyway.
"If pages aren't indexed, they can't help your top pages with anchor text or send their small amount of PR back, or send bots crawling back. "
This is exactly my point. MAKING THE SITE FLAT THROUGH THE USE OF A LINKS PAGE DIRECTLY OFF THE HOME PAGE SO WE CAN GET THOSE DEEP PAGES INDEXED SO WE CAN THE PR OF THOSE PAGES ACCOUNTED FOR AND DISTRIBUTED BACK THROUGH THE SITE IS WHAT WE TRYING TO ACCOMPLISH!
Now if this were a PR distribution thing then please tell me how tsm26's HUGE link page(s) passed enough PR to get all of those pages indexed. The thing is the PR would get so diluted it wouldn't have hardly any effect. This makes me believe it has less to do with PR distribution and more to do with shear levels and priorities based on unkown factors.
Huh? No, I'm thinking of a page with 57,000 links on it. How can you say such a thing is possible?
"This is NOT about harddrive folder address bar / judgments!
That is what webdevfv is talking about, folder depth.
"This is about how many levels off the index page your content is getting indexed no matter what the file name or file directory it is in."
That is what I've been talking about. The issue involves the extreme importance of pagerank, and the importance of multiple paths to pages, and distributing pagerank (and volume of links) to those various paths.
"Do you get what I am saying?"
You seem to be stating the obvious now. You probably should go back and read the previous posts. Websites with a best page PR of a PR4 are in extremely dire straits. There is very little (though some) you can do to get maximum crawl value. Sites where the best PR page is a PR6 have a lot more to work with, including the ability to screw things up and only get 100 pages indexed instead of 10,000 or more.
"This has never been the case. Why until now has there NEVER EVER been a problem indexing and keeping content on any of our sites an other people's sites?"
You need to read more previous threads, but obviously the problem is more acute now. However, the basic solution remains EXACTLY the same... page rank, good structure, sacrifice some value of top pages to rescue a great volume of lower pages.
"It never mattered if a page had 1 link or 5000 links."
Of course it did. It just matters more now.
"If it was linked somewhere on the site accessable to googlebot it would be crawled and indexed."
That's just false. The IMdb or Amazon has always had many thousands of unindexed pages, despite many links to them.
"Now those links came from brand NEW site map pages with pr 0 so this looks as if this isn't necessarily directly PR distribution related"
No, pages created today have pagerank close to right away, whether the green bar shows it or not. Pagerank is pagerank, not the green bar. A new page linked from a PR8 page will instantly have links regularly crawled off it, because it is in reality instantly a PR7 or PR6 page, regardless of the green bar.
"KING THE SITE FLAT THROUGH...etc"
I've posted several times now about flattening and redistrubting PR, as well as the value of many links and crawl paths. And, I've pointed out this is the advice Google guy has been giving for years. But, you can't have a flat structure for a 57,000 page site. It's impossible. You can manipulate the structure one way or another to make it flatter or less flat, but flat is an impossibility.
There isn't something mysterious going on here. Google is just crawling weaker than before, by design. Single path sites were never optimum, but they are in much more trouble now. Big sites with low PR are in serious trouble, although better off with pages not indexed at all than with a high percentage of low reputation supplementals in the index.
These are mostly pages that don't change often -- which is another factor in the crawl pattern, I realize. Different domains most likely get profiled differently and crawled by different rules.
Set results to 100 per page, and it's fascinating to see a homepage rank 12th for a search, with no indented result, but then do a search with the -uncommon word and see two more appropraite results ranking 13th and 14th or whereever. I see this normally where individual pages have problems of some sort.
Url A -> dropped
Url B with 301 to Url A.
Url B ist very old and Supplemental in the index, but the Googlebot still visit this Url regularly.
First Url B, then Url A.
I still have 20% of my pages in the index and the Bot visit these regularly. The most of these pages have no old URL with 301.
I have changed my URL system in October 2005. Can it be that Google does not cope with that? Should I better send 410, instead of 301?
However, there also are a few counterexamples (Pages dropped out without 301, Pages in with 301).
I also did not have any problems on Bigdaddy by the end of February. (full indexed)
What has happened in my experience is this change in algorithm has raised focus from individual content pages to overall site results. This makes it so more and more results for searches I do on Google return main pages and then I have to go search for the results lower. It also means that new content is hard to find. When I am looking up problems with recent software patches, mysql etc. I am now getting more and more old forum sites that are not recent but are the only ones who got single posts indexed because of their high main page PR. This lack of deep indexing is hurting both sides. It has ruined my ability to find many topics newer than a month old on Google except rss feeds from news organizations.
Your assertion that good internal linking structure only goes so far. You can take two different sites with PHPbb installed with the exact same options and linking depth and one will get the single posts indexed while the other won't because one site with a pr of 6 gets four levels indexed while the other only gets down to three. And since PR is just going crazy right now it is hard to figure out what to do. My blog which only has 12 incoming links according to msn and yahoo is the same pr as our site which has over 900 incoming external links, about 400 of which are quality non directory links.
All of this is combining IMO into a nightmare of an algorithm. Sites that have been shutdown for years are still showing up in the top ten for important searches and the new relevant content rich sites are getting deindexed down so you can't find what is needed. I used to be able to find my answers in forums for programming problems in a few minutes and on the first page of results. For the first time ever now I have had to go into the fourth and fifth pages and go into Google groups or other sites. In this age of Digg, Myspace, blogging etc, putting so much emphasis on age of site and links runs counter to what people want. I know they have an interesting dilemma, because you have to protect against fly by night spam sites, but this certainly isn't working.
Sorry, but I don't understand what you are trying to ask. What does one have to do with the other? It's like you are asking if this was a PR distribution thing why is the text black?
"Maybe you meant feasible?"
No I meant not possible. Google reads 101k. Making all 57,000 pages on your site have 57,000 links on them is both ridiculous and irrelevant.
What I am asking is based on what you have been saying about PR distribution. If this problem is because of PR distribution how can a HUGE links page such as tsm26's pass enough pr to get those pages indexed. Why were those pages dropping out to begin with but are indexed and sticking now just because he flattened the structure. IMO there wouldn't be enough PR to pass into those deep content pages to make any more difference than going through the site with the normal directory structure.
To me this says there is a tad more going on here than just PR distribution.
Again, how does PR distribution relate to this? You are asking a non-sequitor. Most obviously, a huge amount of links on the Yahoo main page is different than a huge amount of links on a PR0 page. Besides that, I can't figue out how you connect the two things.
"Why were those pages dropping out to begin with but are indexed and sticking now just because he flattened the structure."
Again two different things. Pages dropping out doesn't necesarily relate to them gettong back in, but again most obviously, someone put moe links to a page. How is that not an obvious reason why a page might not get crawled more often. As I've mentioned too many times now, flatten will often lead to crawling of more pages but probably less often.
"IMO there wouldn't be enough PR to pass into those deep content pages to make any more difference than going through the site with the normal directory structure."
PR is just one factor in crawling. I don't want to keep repeating the same thing, but in some cases one link from a PR5 page might accomplsih your goal while in another 100 links from PR3 pages might.
"To me this says there is a tad more going on here than just PR distribution"
Of course. It's beating a dead horse now, but besides what I've mentioned five times, there are other factors like domain reputation, bad neighborhood factors, duplicate content on a domain, (a very huge one) staticness of the content on a domain, etc etc etc.
If you want you pages crawled, make many paths to them, get more pagerank, get a good domain reputation, update the pages every week, get rid of duplicate content like non-www/www, make sure your key pathway (usually the main page) has no problems like /index.asp being indexed separately, and create bot-friendly easy to crawl sitemap pages (as many as you need).
Little has changed fundamentally, but getting things right is more important and more difficult. Googlebot used to be like Arnold Schwarznegger, now it is like Pee Wee Herman. You have to do a lot more to help it get your heavy work done. That's not to say it could do everything before, it just means however much it could do before, it is not as strong now.
AGAIN LISTEN UP THIS ISN'T about pages not being crawled. It is about them being crawled and added/not added/dropping.
"get more pagerank"
Getting more PR can be achieved by two ways: incoming links and creation of new pages which can vote how ever you please. WITH that being said the problem with pr is that a site which is NOT fully indexed gets much of the #2 pr creation ability taken away. At least it does not work to it's full potential. You are left to #1 which is a time consuming process. It is something to be addressed for sure just as internal voting which possibly a quicker fix. For the most part you want to take care of internal isssues first before moving to external issues. This is what we are working towards.
"make many paths to them"
Not a problem plenty of incoming links into the thousands plus a well done pyramid structure. No orphans plus alternative routes.
"update the pages every week"
Makes no difference from our experience.
"get rid of duplicate content like non-www/www"
Not an issue either
"make sure your key pathway (usually the main page) has no problems like /index.asp being indexed separately"
Not an issue. Taken care of long ago. Newer sites suffering from similar problems never once allowed such duplicates.
"create bot-friendly easy to crawl sitemap pages (as many as you need)"
Depends - if the indexing ONLY goes 3 levels deep your sitemaps MUST point to all content to make all level 3. Deep level 4-infinity must be placed in the site map to where all pages are accessed directly off of the site map page. That site map page must be directly off the home page. Now with a site with 50k pages tell me how you can cram all of that in and not break the 100 link barrier on your main page and on the site map. The thing is you can't. This is why tsm26 created huge site maps. This was little choice in the matter.
I believe you are forgetting what you wrote earlier:
"You have to give the bots multiple paths, you have to distribute your PR wisely, and you have to have a volume of links (good PR or not) to every page."
You have repeatedly mentioned distributing PR. I am saying this has less to do with distributing PR.
You said you must have volumes of links. This isn't a problem with google not "finding" links. Google crawls level 3 and it know what links are on those pages. It crawls level 4 and knows what lins are on those pages and what is linking to those pages. But the fact is that level 4 is not sticking regardless of incoming links or how many links are pointing to it. It is about what level in your site structure the page resides.
BTW - you never needed a volume of links. You just at least need one link so the page is not orphaned.
Again you state the obvious. We know that. What are you not understanding? Pages have always dropped out of the index. (They have also gone URL only or supplemental... two things it appears Google may be doing less of now, dropping pages fast instead.)
"I am saying this has less to do with distributing PR."
Less, more, so what? You seem like you need only ONE answer. There isn't one. You have to do a lot of things, many of which are very important, some of which are less important, but all of which help.
"But the fact is that level 4 is not sticking regardless of incoming links or how many links are pointing to it"
Stop fixating on your own site and you'll be better off.
"You just at least need one link so the page is not orphaned."
Again, this is utter nonsense, and I don't want to go over this same ground again and again. Amazon and the imdb are examples of sites where "one link" wasn't enough. In fact, dozens of links weren't. One reason is duplicate content. All versions could be discarded because of dupes. But that is trivia at this point.
"I just am suprised you can't seem to admit that maybe just maybe Google's algo is going a bit wacko and that *gasp* maybe all those phds made a few mistakes."
Um, Google makes mistakes all the time, both screw ups and bad policies. What does that have to do with anything?
You guys just seem to want to complain instead of work on making things work. Good luck.
What I don't understand that you keep reverting back to the crawling - here is your own words:
"If you want you pages crawled"
You KEEP repeating this and I have said repeatedly that this isn't NOT the problem. What part do you not understand?
"Stop fixating on your own site and you'll be better off."
Weak argument. Won't go there. Stop fixating on your ego a second ok.
"You guys just seem to want to complain instead of work on making things work. Good luck. "
Ummm until you arrived we were doing just that - making things work. So good luck and hope to see you in another thread. Again drop the da#n ego for a second and LISTEN to what we have said and a temporary solution that was presented.
"Um, Google makes mistakes all the time, both screw ups and bad policies. What does that have to do with anything?"
EVERYTHING. If they messed up and there is a possible temporary way around I'll take it until they fix it. The policies they make are what WE have to live by if we want their traffic. IT matters ALOT.
"You guys just seem to want to complain instead of work on making things work. Good luck. "
That is exactly what we are doing, is making things work. I said I made a fix with bringing everything up and as of today I have 73k pages indexed up from 700. That sounds like making things work to me. My traffic went from 3k page views the days they were gone to 19k page views today and a ton in advertising revenue.
What I am saying is that it is awful I had to resort to that. They should be penalizing the too flat structure rather than rewarding it. I put a well organized pyramid structure in and what do I get. Deindexing. When I move to a flat one, they say "hey we will gobble up all 10k links on your one page no problem". How ridiculous is that.
What we are offering in this whole thread is a solution to help alleviate things in the near term by flattening out navigation. We are complaining so that in the longterm, Google, (if they are actually reading anymore), will reward lower pr sites (<=5) who have good navigation. Until then I will keep racking in the traffic from that "impossible" index page of 10k content links.