Forum Moderators: open

Message Too Old, No Replies

101k site indexing limit?

         

tejas_shah

4:07 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



Is there any tool which would tell me how much of my site is being crawled. It is my understanding that only 101k of your page is crawled. The tools i have come across, such as
[searchengineworld.com...]

crawls the entire site. If spider crawls the entire site then hypothesis that "[u][u]only 101k of your site is being crawled[/u][/u]" is incorrect.

?

Tejas

[edited by: heini at 5:49 pm (utc) on April 24, 2003]
[edit reason] No tools please per TOS / thanks! [/edit]

ikbenhet1

4:09 pm on Apr 24, 2003 (gmt 0)

10+ Year Member




The cache of your page shows exactly what is crawled, so you can see what is included in that 101k by clicking on cache of your site.

rogerd

4:13 pm on Apr 24, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



tejas_shah, the often mentioned 101K limit is a Google preference, not any inherent limitation of spiders.

tejas_shah

4:19 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



Doesn't Google's cache shows the snapshot of the page when google crawled it. It does not show <u>how much</u> of your page was crawled.

Basically hypothesis "Google crawls only 101k of your site", is still a hypothesis huh?

Tejas

PatrickDeese

4:31 pm on Apr 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Basically hypothesis "Google crawls only 101k of your site", is still a hypothesis huh?

Hypothesize this:

Look for any keyword combination on google.

In general, unless the site has requested that google not cache it, it will tell you the file size.

find one that is bigger than 101K.

For instance, certain forums *cough* slashdot *cough* have huge audiences and their comment forums can generate many thousands of comments, yet if you search for this forum, you will see there isn't a single one larger that 101K.

---
edit:
by "this forum", i meant slashdot. sorry it was unclear.

[edited by: PatrickDeese at 4:59 pm (utc) on April 24, 2003]

BGumble

4:32 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



To test whether or not Googlebot continues crawling these large pages seems simple. I searched for a unique phrase found at the bottom of that big list and it was not found in Google, though searching for unique phrases above the 101k limit work well.

PD>>yet if you search for this forum, you will see there isn't a single one larger that 101K.

But he was saying, it's possible Googlebot CRAWLS further than 101k but will only cache the first 101k. By the test above, it seems to stop cold.

tejas_shah

4:35 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



One more thing. Is 101k limit to the site or to the web page? it is web page right?

BGumble,

Thank you, this is what i was looking for.

PatrickDeese

Point taken. thanks.

Tejas

BGumble

5:01 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



I spent 15 minutes coding 101k.php just for this thread. I uploaded it to my site and using my IP address, I posted a link to the tool. This page had NO links to my site, NO links at all, and made NO reference to my real site.

Why is that link deleted yet stickysauce.com can stay even though it has banner advertising and links out to many other pages on the site. Isn't my tool just as valid as theirs? Especially since it performed a function that is not available elsewhere!

amoore

5:17 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



googlebot doesn't stop pulling the page after 101k, but it may very well not index or cache the latter parts of the pages it gets. Here are some examples of googlebot pulling large pages from my site: (the columns shift a bit in this forum)

mysql> select timeserved, useragent, remoteip, bytes from requests where useragent like '%ooglebot%' order by bytes desc limit 10;


+---------------------+----------------------------------------------------+-------------+--------+
¦ timeserved ¦ useragent ¦ remoteip ¦ bytes ¦
+---------------------+----------------------------------------------------+-------------+--------+
¦ 2003-03-18 17:25:12 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.65 ¦ 216615 ¦
¦ 2003-03-28 12:54:02 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.48 ¦ 214929 ¦
¦ 2003-03-18 17:35:37 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.37 ¦ 213792 ¦
¦ 2003-03-21 12:18:41 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.79 ¦ 211522 ¦
¦ 2003-04-13 04:15:03 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.65 ¦ 210099 ¦
¦ 2003-04-11 08:37:30 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.65 ¦ 208899 ¦
¦ 2003-04-11 10:06:27 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.48 ¦ 207742 ¦
¦ 2003-03-28 12:30:40 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.41 ¦ 202404 ¦
¦ 2003-04-06 13:18:42 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.55 ¦ 202393 ¦
¦ 2003-04-11 09:47:30 ¦ Googlebot/2.1 (+http://www.googlebot.com/bot.html) ¦ 64.68.82.52 ¦ 201193 ¦
+---------------------+----------------------------------------------------+-------------+--------+
10 rows in set (4.16 sec)

As you can see, my webserver served it about 200k on those pages. Google may ignore everything past 101k, though.

I wonder if that means that you can put stuff at the bottom of large pages that Google dislikes, but some of the other bots still like?

jrobbio

5:22 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



Hi, I have a question on this. Is the 101kb the absolute size or the amount the crawler downloads? I use Gzip on my site saving me loads of bandwidth although the number stated in the Google Serps is the uncompressed size. Just some food for thought that it would be nice if it gave the compressed size instead.

Rob

jrobbio

5:24 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



Bgumble, I'd be happy to check it out if you sticky mail me it.

tejas_shah

5:54 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



I wonder if that means that you can put stuff at the bottom of large pages that Google dislikes, but some of the other bots still like?

hmmm, interesting... good question...

tejas_shah

6:08 pm on Apr 24, 2003 (gmt 0)

10+ Year Member



Do you guys know if there is a tool which would tell you how much of your web site (not webpage) was crawled and where did the bot stopped and how deep it went?

i am giving ideas to programmers of what kind of demand is out there? so how much of the profit do i get? :)

Tejas

jimbeetle

6:48 pm on Apr 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Finally found it. Msg 11 in this thread [webmasterworld.com] Googleguys states that "...100K is a hard limit. All pages should be shorted than that."

<emphasis added>

matrix_neo

12:15 pm on May 2, 2003 (gmt 0)

10+ Year Member



Hi jimbeetle,

This 101k limit googleguy says is only for the html file or for the whole web page including images? Hope a valid question.

ericjunior

12:38 pm on May 2, 2003 (gmt 0)

10+ Year Member



matrix_neo - it's just for the pure html, images are not included

HayMeadows

1:39 pm on May 2, 2003 (gmt 0)

10+ Year Member



We are testing this on a site right now, but perhaps someone can jump in here and verify that offloading your Javascript is also a good way of getting by this size limitation?