Forum Moderators: open
Part 2:
The crawler page obviously just has links to other pages, how do I insure that the pages that are linked to from the crawler page get indexed, and the crawler page itself does not get indexed? Does that make sense? I just don't want to crawler page to show up on search results.
thanks!
aron hoekstra
[edited by: WebGuerrilla at 7:34 pm (utc) on June 18, 2003]
[edit reason] no urls please [/edit]
[google.com...]
Beth
[edited by: WebGuerrilla at 8:14 pm (utc) on June 18, 2003]
[edit reason] Added Link [/edit]
Where do you get this information from?
From this forum (do a search for "101k") and from experience. Look at the page size shown in the SERPs. It never exceeds 101K. When you find a page that is listed as 101k, view the cached version of the page, then scroll down (way down). You will see that the bottom of the page is cut off...
Having more than 100 links per page (especially if they lack any types of descriptions) is asking for trouble.
The only way you are going to come close to getting 80k urls indexed is if you develop a site structure that allows Googlebot to crawl them naturally.
Each link does have a unique description. Does this not matter?
The page was well under the 101k limit, but I really don't recommend trying to get pages of links much bigger than that crawled - it must be possible to logically break it down a bit more.
maybe is just doesn't cache pages over 100k, but will process them?
Well, it depends on what you mean by "process them". Google will index the pages, but GoogleBot stops reading after 101k. So any text or links beyond this point with not be "known" to Google: after 101k, the text won't be considered for scoring purposes and it won't be able to follow any links beyond that point.
One page with links to 20 main category pages.
¦-> each main category pages has 35 or so links to sub category pages
¦-¦-> each sub category page his 100 or so links to product pages
total of about 70,000 links.
I think it's better to have multiple smaller pages, if only for security reasons (time out, page can't be fetched during crawl).
If you don't want the page to be indexed, use <meta robots="noindex, follow"> but this doesn't always work. I've seen <noindex>-pages show up anyway.
Ofcourse I could be totally wrong!
Why isn't google able to get to all the pages on your site through your normal navigation? If you are doing something that is causing Google to have problems, like using JS menus, you might want to reconsider your site design. Sitemaps are meant to help things along, not necessarily to replace good navigation.
I think that this myth of the 101k limit to indexing has been explained in a few threads with actual examples given, so I wont go any further on that one :)
I don't buy the PR argument either for how deep google will crawl. It seems like a huge pile of extra data to carry around when there is a much easier way to do it.
Pick a highly connected site or two and just start crawling.
Just for kicks let's start with dmoz.org and yahoo.com. Crawl their home pages and add all the links into the queue. Then just start working your way through the queue.
Of course this is all based on old deepbot behavior, so it might not matter any more.
The reason that it may look like PR plays a major influence is that higher PR sites are likely to have a couple of things going for them. They have a good chance of being closer to the root pages of the crawl, and they are more likely to have quite a few deep links.
My first month my root page was indexed and that was it. It got a PR4. My second month, two of my deepest pages got links from a site that was in both DMOZ and Yahoo. Both of those pages that had no PR were crawled before my PR4 page. Not only that, but every page that those pages linked to were crawled at almost the same time, including the root page. Everything in that section of the site got crawled, and not much from the other sections.
So while the depth may appear to be PR based, I think site structure and deep links will serve you better when trying to get monster sites crawled, than will a high PR root page.
I have a 2 month old site with over 9,500 pages. I created one index page that linked to 1,300 other pages. Those 1,300 pages link to the remaining 8,200 pages in the site. In just 1 day (this week) fredbot (freshdeepbot) crawled over 6,500 pages. This would not have been possible without visiting most of the links on the index page with 1,300 links.
i did further research on my situation and here is what i find to be true:
i have 9,500 pages
i created 7 index pages each linking to around 1,357 pages
fredbot visited around 6,300 pages in one day.
since the links in my pages are alphabetically listed i was quickly able to determine which links were visited by searching for pages in www3.
it seems that the cutoff point (at least in my case) was around 900 links per page.
so to get all of my pages in the index i'm going to have to further divide these 9,500 links.
it seems that the cutoff point (at least in my case) was around 900 links per page.
Hmmm...I have seen a few ~900 link-pages (listed as 101k in Google's cache). notsleepy, it would be interesting if you could test to see whether the size of the HTML of those pages, up to the ~900 link cutoff point, is equal to 101K.
BigDave...It does seem like alot of data that would need to be stored which is why I dont think its done...especially with changes coming that would rerank stuff on the fly :)
GoogleGuy expanded upon that in this forum somewhere by saying that perhaps the better way to look at is was to keep the pages to 100K. He didn't expand upon *exactly* why 100K, but if he says it, there must be a good reason - so why tempt fate?
Then there's the issue of pages that are user friendly...