Welcome to WebmasterWorld Guest from 54.242.53.253

Forum Moderators: open

Message Too Old, No Replies

GoogleBot Question(s)

GoogleBot Question(s)

     
7:27 pm on Jun 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


Part 1:
I have a page that I will be using as my "crawler page", in it contains nearly 80,000 links to within my web site. The file is over 5 MB in size and i'm on a rather slow DSL line. My question is - will GoogleBot timeout if it take more that a minute or so to load the page? I have heard about people breaking their crawler pages into like 100 kb files. Is this neccessary?

Part 2:
The crawler page obviously just has links to other pages, how do I insure that the pages that are linked to from the crawler page get indexed, and the crawler page itself does not get indexed? Does that make sense? I just don't want to crawler page to show up on search results.

thanks!

aron hoekstra

[edited by: WebGuerrilla at 7:34 pm (utc) on June 18, 2003]
[edit reason] no urls please [/edit]

7:44 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:May 28, 2003
posts:9
votes: 0


I wouldnt recommend a page with more than 100 links.
7:50 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


why not?
7:53 pm on June 18, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:103
votes: 0


Yes, it is necessary to break them down to 100K or less. GoogleBot stops crawling after 101K.
7:57 pm on June 18, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 2, 2002
posts:446
votes: 0


The google guidelines say "If the site map is larger than 100 or so links, you may want to break the site map into separate pages."

[google.com...]

Beth

[edited by: WebGuerrilla at 8:14 pm (utc) on June 18, 2003]
[edit reason] Added Link [/edit]

8:10 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 17, 2003
posts:687
votes: 0


> GoogleBot stops crawling after 101K.

Where do you get this information from?

8:13 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


so is it 100k or 100 links?

I can fit 1000 links within a 100k document

if it's 100k then i need 80 pages to link to the 80000 links

if it's 100 links then i need 800 pages

8:16 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2000
posts:2176
votes: 0



Having more than 100 links per page (especially if they lack any types of descriptions) is asking for trouble.

The only way you are going to come close to getting 80k urls indexed is if you develop a site structure that allows Googlebot to crawl them naturally.

8:20 pm on June 18, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:103
votes: 0


Where do you get this information from?

From this forum (do a search for "101k") and from experience. Look at the page size shown in the SERPs. It never exceeds 101K. When you find a page that is listed as 101k, view the cached version of the page, then scroll down (way down). You will see that the bottom of the page is cut off...

8:25 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


Having more than 100 links per page (especially if they lack any types of descriptions) is asking for trouble.
The only way you are going to come close to getting 80k urls indexed is if you develop a site structure that allows Googlebot to crawl them naturally.

Each link does have a unique description. Does this not matter?

8:26 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


view the cached version of the page, then scroll down (way down). You will see that the bottom of the page is cut off...

maybe is just doesn't cache pages over 100k, but will process them?

8:28 pm on June 18, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 3, 2001
posts:368
votes: 3


I've just looked and found a page with 200 links that all got crawler last month, this page was 3 levels down the directory structure and i couldn't figure out a way to break it down any further.

The page was well under the 101k limit, but I really don't recommend trying to get pages of links much bigger than that crawled - it must be possible to logically break it down a bit more.

8:30 pm on June 18, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:103
votes: 0


maybe is just doesn't cache pages over 100k, but will process them?

Well, it depends on what you mean by "process them". Google will index the pages, but GoogleBot stops reading after 101k. So any text or links beyond this point with not be "known" to Google: after 101k, the text won't be considered for scoring purposes and it won't be able to follow any links beyond that point.

8:32 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 26, 2001
posts:1422
votes: 0


>exceeds 101K.<

I think you can hang your hat on that.

8:40 pm on June 18, 2003 (gmt 0)

New User

10+ Year Member

joined:June 18, 2003
posts:6
votes: 0


Ok, so what i could do would be this:


Crawler Page (80 links)

--Link to Page 1--> page with 1000 links

--Link to Page 2--> page with 1000 links

--Link to Page 3--> page with 1000 links

...

--Link to Page 80--> page with 1000 links

-----------------

total: 80,000 links

right? ;)

9:32 pm on June 18, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 10, 2003
posts:157
votes: 0


That still has too many links per page. Googlebot wont like it. I have about 70k pages that google crawls. They can be accessed multiple ways though links on the site (not just the site map) That way PR can trickle down better. But, in case you were interested my "Site Map" looks like this:

One page with links to 20 main category pages.
¦-> each main category pages has 35 or so links to sub category pages
¦-¦-> each sub category page his 100 or so links to product pages

total of about 70,000 links.

9:40 pm on June 18, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2002
posts:404
votes: 0


I believe that the 101k figure is incorrectly being described here....that is the limit that Google will cache, not the limit of what they will crawl.
I have many pages much larger, as do alot of people that get crawled very nicely and are definitely more than 101k. And yes every link on the page gets crawled (even the 500th link), although I agree that you should attempt to cut down the number of links on a page. And I have seen no effect on differing PRs of the pages as to whether the bots (both the old deep and the new fresh bots) crawl all links.
10:22 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts:1377
votes: 0



[webmasterworld.com...]
I've seen similar posts as well.

I think it's better to have multiple smaller pages, if only for security reasons (time out, page can't be fetched during crawl).

If you don't want the page to be indexed, use <meta robots="noindex, follow"> but this doesn't always work. I've seen <noindex>-pages show up anyway.

10:24 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts:1377
votes: 0


Marval:
I have seen no effect on differing PRs of the pages as to whether the bots (both the old deep and the new fresh bots) crawl all links.

Could you explain what you mean by this please?

10:55 pm on June 18, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 12, 2003
posts:73
votes: 0


I would recomend the following:
-Do not have more than 100 links on you site map, better safe than sorry.
-Build eight two level sitemaps and link to them all from your index page.
-Link back to both of the sitemaps relevant for your site in your site map three on every page.
-You really would like to think about the site design, try to avoid 6 or more levels. On every level in the design link back to the page linking to it and link to all relevant pages bellow and on the same level that is relevant to the page content. The futher down you go the fewer links.

Ofcourse I could be totally wrong!

11:03 pm on June 18, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member bigdave is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 19, 2002
posts:3454
votes: 0


nullvalue,

Why isn't google able to get to all the pages on your site through your normal navigation? If you are doing something that is causing Google to have problems, like using JS menus, you might want to reconsider your site design. Sitemaps are meant to help things along, not necessarily to replace good navigation.

12:43 am on June 19, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2002
posts:404
votes: 0


HitProf...I misread the statement about scoring - some in the past have "intimated" that the PR of a page decides on how far the bot will go thru links (a supposed limit to the number it will index) and I was stating my experiences on my pages where Ive seen a PR2 and a PR5 page get indexed exactly the same no matter how many links existed.

I think that this myth of the 101k limit to indexing has been explained in a few threads with actual examples given, so I wont go any further on that one :)

1:15 am on June 19, 2003 (gmt 0)

New User

10+ Year Member

joined:June 12, 2003
posts:13
votes: 0


nullvalue

I'm guessing that you are talking about some sort of feed site that you just want crawled.
If you have category data available you could possibly split in that way.
Else your structure in msg#15 should get crawled ok in my opinion.

If your site contains spammy stuff or duplication then you could be in trouble, but I guess that's a different problem.
1:39 am on June 19, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member bigdave is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 19, 2002
posts:3454
votes: 0


Marval,

I don't buy the PR argument either for how deep google will crawl. It seems like a huge pile of extra data to carry around when there is a much easier way to do it.

Pick a highly connected site or two and just start crawling.

Just for kicks let's start with dmoz.org and yahoo.com. Crawl their home pages and add all the links into the queue. Then just start working your way through the queue.

Of course this is all based on old deepbot behavior, so it might not matter any more.

The reason that it may look like PR plays a major influence is that higher PR sites are likely to have a couple of things going for them. They have a good chance of being closer to the root pages of the crawl, and they are more likely to have quite a few deep links.

My first month my root page was indexed and that was it. It got a PR4. My second month, two of my deepest pages got links from a site that was in both DMOZ and Yahoo. Both of those pages that had no PR were crawled before my PR4 page. Not only that, but every page that those pages linked to were crawled at almost the same time, including the root page. Everything in that section of the site got crawled, and not much from the other sections.

So while the depth may appear to be PR based, I think site structure and deep links will serve you better when trying to get monster sites crawled, than will a high PR root page.

2:10 am on June 19, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 10, 2003
posts:106
votes: 0


I buy the 100k limit for caching but I disagree with the 200 link limit for crawling.

I have a 2 month old site with over 9,500 pages. I created one index page that linked to 1,300 other pages. Those 1,300 pages link to the remaining 8,200 pages in the site. In just 1 day (this week) fredbot (freshdeepbot) crawled over 6,500 pages. This would not have been possible without visiting most of the links on the index page with 1,300 links.

2:29 am on June 19, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 10, 2003
posts:106
votes: 0


i need to retract some of what i said.

i did further research on my situation and here is what i find to be true:

i have 9,500 pages

i created 7 index pages each linking to around 1,357 pages

fredbot visited around 6,300 pages in one day.

since the links in my pages are alphabetically listed i was quickly able to determine which links were visited by searching for pages in www3.

it seems that the cutoff point (at least in my case) was around 900 links per page.

so to get all of my pages in the index i'm going to have to further divide these 9,500 links.

1:43 pm on June 19, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 13, 2003
posts:103
votes: 0


it seems that the cutoff point (at least in my case) was around 900 links per page.

Hmmm...I have seen a few ~900 link-pages (listed as 101k in Google's cache). notsleepy, it would be interesting if you could test to see whether the size of the HTML of those pages, up to the ~900 link cutoff point, is equal to 101K.

7:45 pm on June 19, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Oct 30, 2002
posts:404
votes: 0


this theory of max links has been tested and reported here (not sure what the appropriate search would be) and found to not have a limit for the number of links on a page anywhere as low as 101k...I think we're repeating something that has already been done

BigDave...It does seem like alot of data that would need to be stored which is why I dont think its done...especially with changes coming that would rerank stuff on the fly :)

9:10 pm on June 19, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts:1377
votes: 0


Thanks for explaining Marval, just didn't know how to interprete your text.
9:35 pm on June 19, 2003 (gmt 0)

Junior Member

joined:Mar 6, 2003
posts:170
votes: 0


Google's site says try to limit to 100 links per page.

GoogleGuy expanded upon that in this forum somewhere by saying that perhaps the better way to look at is was to keep the pages to 100K. He didn't expand upon *exactly* why 100K, but if he says it, there must be a good reason - so why tempt fate?

Then there's the issue of pages that are user friendly...

This 33 message thread spans 2 pages: 33
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members