GoogleBot Question(s)

Forum Moderators: open

Message Too Old, No Replies

GoogleBot Question(s)

nullvalue

7:27 pm on Jun 18, 2003 (gmt 0)

Part 1:
I have a page that I will be using as my "crawler page", in it contains nearly 80,000 links to within my web site. The file is over 5 MB in size and i'm on a rather slow DSL line. My question is - will GoogleBot timeout if it take more that a minute or so to load the page? I have heard about people breaking their crawler pages into like 100 kb files. Is this neccessary?

Part 2:
The crawler page obviously just has links to other pages, how do I insure that the pages that are linked to from the crawler page get indexed, and the crawler page itself does not get indexed? Does that make sense? I just don't want to crawler page to show up on search results.

thanks!

aron hoekstra

[edited by: WebGuerrilla at 7:34 pm (utc) on June 18, 2003]
[edit reason] no urls please [/edit]

CodeXXX

7:44 pm on Jun 18, 2003 (gmt 0)

I wouldnt recommend a page with more than 100 links.

nullvalue

7:50 pm on Jun 18, 2003 (gmt 0)

why not?

swerve

7:53 pm on Jun 18, 2003 (gmt 0)

Yes, it is necessary to break them down to 100K or less. GoogleBot stops crawling after 101K.

bether2

7:57 pm on Jun 18, 2003 (gmt 0)

The google guidelines say "If the site map is larger than 100 or so links, you may want to break the site map into separate pages."

[google.com...]

Beth

[edited by: WebGuerrilla at 8:14 pm (utc) on June 18, 2003]
[edit reason] Added Link [/edit]

AthlonInside

8:10 pm on Jun 18, 2003 (gmt 0)

> GoogleBot stops crawling after 101K.

Where do you get this information from?

nullvalue

8:13 pm on Jun 18, 2003 (gmt 0)

so is it 100k or 100 links?

I can fit 1000 links within a 100k document

if it's 100k then i need 80 pages to link to the 80000 links

if it's 100 links then i need 800 pages

WebGuerrilla

8:16 pm on Jun 18, 2003 (gmt 0)

Having more than 100 links per page (especially if they lack any types of descriptions) is asking for trouble.

The only way you are going to come close to getting 80k urls indexed is if you develop a site structure that allows Googlebot to crawl them naturally.

swerve

8:20 pm on Jun 18, 2003 (gmt 0)

Where do you get this information from?

From this forum (do a search for "101k") and from experience. Look at the page size shown in the SERPs. It never exceeds 101K. When you find a page that is listed as 101k, view the cached version of the page, then scroll down (way down). You will see that the bottom of the page is cut off...

nullvalue

8:25 pm on Jun 18, 2003 (gmt 0)

Having more than 100 links per page (especially if they lack any types of descriptions) is asking for trouble.
The only way you are going to come close to getting 80k urls indexed is if you develop a site structure that allows Googlebot to crawl them naturally.

Each link does have a unique description. Does this not matter?

nullvalue

8:26 pm on Jun 18, 2003 (gmt 0)

view the cached version of the page, then scroll down (way down). You will see that the bottom of the page is cut off...

maybe is just doesn't cache pages over 100k, but will process them?

olias

8:28 pm on Jun 18, 2003 (gmt 0)

I've just looked and found a page with 200 links that all got crawler last month, this page was 3 levels down the directory structure and i couldn't figure out a way to break it down any further.

The page was well under the 101k limit, but I really don't recommend trying to get pages of links much bigger than that crawled - it must be possible to logically break it down a bit more.

swerve

8:30 pm on Jun 18, 2003 (gmt 0)

maybe is just doesn't cache pages over 100k, but will process them?

Well, it depends on what you mean by "process them". Google will index the pages, but GoogleBot stops reading after 101k. So any text or links beyond this point with not be "known" to Google: after 101k, the text won't be considered for scoring purposes and it won't be able to follow any links beyond that point.

pmac

8:32 pm on Jun 18, 2003 (gmt 0)

>exceeds 101K.<

I think you can hang your hat on that.

nullvalue

8:40 pm on Jun 18, 2003 (gmt 0)

Ok, so what i could do would be this:


Crawler Page (80 links)
 �
 --Link to Page 1--> page with 1000 links
 �
 --Link to Page 2--> page with 1000 links
 �
 --Link to Page 3--> page with 1000 links
 �
 ...
 �
 --Link to Page 80--> page with 1000 links
 
 -----------------
 
 total: 80,000 links

right? ;)

farside847

9:32 pm on Jun 18, 2003 (gmt 0)

That still has too many links per page. Googlebot wont like it. I have about 70k pages that google crawls. They can be accessed multiple ways though links on the site (not just the site map) That way PR can trickle down better. But, in case you were interested my "Site Map" looks like this:

One page with links to 20 main category pages.
¦-> each main category pages has 35 or so links to sub category pages
¦-¦-> each sub category page his 100 or so links to product pages

total of about 70,000 links.

Marval

9:40 pm on Jun 18, 2003 (gmt 0)

I believe that the 101k figure is incorrectly being described here....that is the limit that Google will cache, not the limit of what they will crawl.
I have many pages much larger, as do alot of people that get crawled very nicely and are definitely more than 101k. And yes every link on the page gets crawled (even the 500th link), although I agree that you should attempt to cut down the number of links on a page. And I have seen no effect on differing PRs of the pages as to whether the bots (both the old deep and the new fresh bots) crawl all links.

HitProf

10:22 pm on Jun 18, 2003 (gmt 0)

[webmasterworld.com...]
I've seen similar posts as well.

I think it's better to have multiple smaller pages, if only for security reasons (time out, page can't be fetched during crawl).

If you don't want the page to be indexed, use <meta robots="noindex, follow"> but this doesn't always work. I've seen <noindex>-pages show up anyway.

HitProf

10:24 pm on Jun 18, 2003 (gmt 0)

Marval:

I have seen no effect on differing PRs of the pages as to whether the bots (both the old deep and the new fresh bots) crawl all links.

Could you explain what you mean by this please?

requiem

10:55 pm on Jun 18, 2003 (gmt 0)

I would recomend the following:
-Do not have more than 100 links on you site map, better safe than sorry.
-Build eight two level sitemaps and link to them all from your index page.
-Link back to both of the sitemaps relevant for your site in your site map three on every page.
-You really would like to think about the site design, try to avoid 6 or more levels. On every level in the design link back to the page linking to it and link to all relevant pages bellow and on the same level that is relevant to the page content. The futher down you go the fewer links.

Ofcourse I could be totally wrong!

BigDave

11:03 pm on Jun 18, 2003 (gmt 0)

nullvalue,

Why isn't google able to get to all the pages on your site through your normal navigation? If you are doing something that is causing Google to have problems, like using JS menus, you might want to reconsider your site design. Sitemaps are meant to help things along, not necessarily to replace good navigation.

Marval

12:43 am on Jun 19, 2003 (gmt 0)

HitProf...I misread the statement about scoring - some in the past have "intimated" that the PR of a page decides on how far the bot will go thru links (a supposed limit to the number it will index) and I was stating my experiences on my pages where Ive seen a PR2 and a PR5 page get indexed exactly the same no matter how many links existed.

I think that this myth of the 101k limit to indexing has been explained in a few threads with actual examples given, so I wont go any further on that one :)

midnightcoder

1:15 am on Jun 19, 2003 (gmt 0)

nullvalue

I'm guessing that you are talking about some sort of feed site that you just want crawled.
If you have category data available you could possibly split in that way.
Else your structure in msg#15 should get crawled ok in my opinion.

If your site contains spammy stuff or duplication then you could be in trouble, but I guess that's a different problem.

BigDave

1:39 am on Jun 19, 2003 (gmt 0)

Marval,

I don't buy the PR argument either for how deep google will crawl. It seems like a huge pile of extra data to carry around when there is a much easier way to do it.

Pick a highly connected site or two and just start crawling.

Just for kicks let's start with dmoz.org and yahoo.com. Crawl their home pages and add all the links into the queue. Then just start working your way through the queue.

Of course this is all based on old deepbot behavior, so it might not matter any more.

The reason that it may look like PR plays a major influence is that higher PR sites are likely to have a couple of things going for them. They have a good chance of being closer to the root pages of the crawl, and they are more likely to have quite a few deep links.

My first month my root page was indexed and that was it. It got a PR4. My second month, two of my deepest pages got links from a site that was in both DMOZ and Yahoo. Both of those pages that had no PR were crawled before my PR4 page. Not only that, but every page that those pages linked to were crawled at almost the same time, including the root page. Everything in that section of the site got crawled, and not much from the other sections.

So while the depth may appear to be PR based, I think site structure and deep links will serve you better when trying to get monster sites crawled, than will a high PR root page.

notsleepy

2:10 am on Jun 19, 2003 (gmt 0)

I buy the 100k limit for caching but I disagree with the 200 link limit for crawling.

I have a 2 month old site with over 9,500 pages. I created one index page that linked to 1,300 other pages. Those 1,300 pages link to the remaining 8,200 pages in the site. In just 1 day (this week) fredbot (freshdeepbot) crawled over 6,500 pages. This would not have been possible without visiting most of the links on the index page with 1,300 links.

notsleepy

2:29 am on Jun 19, 2003 (gmt 0)

i need to retract some of what i said.

i did further research on my situation and here is what i find to be true:

i have 9,500 pages

i created 7 index pages each linking to around 1,357 pages

fredbot visited around 6,300 pages in one day.

since the links in my pages are alphabetically listed i was quickly able to determine which links were visited by searching for pages in www3.

it seems that the cutoff point (at least in my case) was around 900 links per page.

so to get all of my pages in the index i'm going to have to further divide these 9,500 links.

swerve

1:43 pm on Jun 19, 2003 (gmt 0)

it seems that the cutoff point (at least in my case) was around 900 links per page.

Hmmm...I have seen a few ~900 link-pages (listed as 101k in Google's cache). notsleepy, it would be interesting if you could test to see whether the size of the HTML of those pages, up to the ~900 link cutoff point, is equal to 101K.

Marval

7:45 pm on Jun 19, 2003 (gmt 0)

this theory of max links has been tested and reported here (not sure what the appropriate search would be) and found to not have a limit for the number of links on a page anywhere as low as 101k...I think we're repeating something that has already been done

BigDave...It does seem like alot of data that would need to be stored which is why I dont think its done...especially with changes coming that would rerank stuff on the fly :)

HitProf

9:10 pm on Jun 19, 2003 (gmt 0)

Thanks for explaining Marval, just didn't know how to interprete your text.

wackmaster

9:35 pm on Jun 19, 2003 (gmt 0)

Google's site says try to limit to 100 links per page.

GoogleGuy expanded upon that in this forum somewhere by saying that perhaps the better way to look at is was to keep the pages to 100K. He didn't expand upon *exactly* why 100K, but if he says it, there must be a good reason - so why tempt fate?

Then there's the issue of pages that are user friendly...

This 33 message thread spans 2 pages: 33