Welcome to WebmasterWorld Guest from 54.152.38.154

Forum Moderators: open

Message Too Old, No Replies

101k site indexing limit?

     
4:07 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 25, 2003
posts:43
votes: 0


Is there any tool which would tell me how much of my site is being crawled. It is my understanding that only 101k of your page is crawled. The tools i have come across, such as
[searchengineworld.com...]

crawls the entire site. If spider crawls the entire site then hypothesis that "[u][u]only 101k of your site is being crawled[/u][/u]" is incorrect.

?

Tejas

[edited by: heini at 5:49 pm (utc) on April 24, 2003]
[edit reason] No tools please per TOS / thanks! [/edit]

4:09 pm on Apr 24, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 18, 2002
posts:638
votes: 0



The cache of your page shows exactly what is crawled, so you can see what is included in that 101k by clicking on cache of your site.
4:13 pm on Apr 24, 2003 (gmt 0)

Administrator

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 2, 2000
posts:9687
votes: 1


tejas_shah, the often mentioned 101K limit is a Google preference, not any inherent limitation of spiders.
4:19 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 25, 2003
posts:43
votes: 0


Doesn't Google's cache shows the snapshot of the page when google crawled it. It does not show <u>how much</u> of your page was crawled.

Basically hypothesis "Google crawls only 101k of your site", is still a hypothesis huh?

Tejas

4:31 pm on Apr 24, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 6, 2003
posts:2523
votes: 0


Basically hypothesis "Google crawls only 101k of your site", is still a hypothesis huh?

Hypothesize this:

Look for any keyword combination on google.

In general, unless the site has requested that google not cache it, it will tell you the file size.

find one that is bigger than 101K.

For instance, certain forums *cough* slashdot *cough* have huge audiences and their comment forums can generate many thousands of comments, yet if you search for this forum, you will see there isn't a single one larger that 101K.

---
edit:
by "this forum", i meant slashdot. sorry it was unclear.

[edited by: PatrickDeese at 4:59 pm (utc) on April 24, 2003]

4:32 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 17, 2003
posts:172
votes: 0


To test whether or not Googlebot continues crawling these large pages seems simple. I searched for a unique phrase found at the bottom of that big list and it was not found in Google, though searching for unique phrases above the 101k limit work well.

PD>>yet if you search for this forum, you will see there isn't a single one larger that 101K.

But he was saying, it's possible Googlebot CRAWLS further than 101k but will only cache the first 101k. By the test above, it seems to stop cold.

4:35 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 25, 2003
posts:43
votes: 0


One more thing. Is 101k limit to the site or to the web page? it is web page right?

BGumble,

Thank you, this is what i was looking for.

PatrickDeese

Point taken. thanks.

Tejas

5:01 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 17, 2003
posts:172
votes: 0


I spent 15 minutes coding 101k.php just for this thread. I uploaded it to my site and using my IP address, I posted a link to the tool. This page had NO links to my site, NO links at all, and made NO reference to my real site.

Why is that link deleted yet stickysauce.com can stay even though it has banner advertising and links out to many other pages on the site. Isn't my tool just as valid as theirs? Especially since it performed a function that is not available elsewhere!

5:17 pm on Apr 24, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 30, 2001
posts:373
votes: 0


googlebot doesn't stop pulling the page after 101k, but it may very well not index or cache the latter parts of the pages it gets. Here are some examples of googlebot pulling large pages from my site: (the columns shift a bit in this forum)

mysql> select timeserved, useragent, remoteip, bytes from requests where useragent like '%ooglebot%' order by bytes desc limit 10;


+---------------------+----------------------------------------------------+-------------+--------+
timeserved useragent remoteip bytes
+---------------------+----------------------------------------------------+-------------+--------+
2003-03-18 17:25:12 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.65 216615
2003-03-28 12:54:02 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.48 214929
2003-03-18 17:35:37 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.37 213792
2003-03-21 12:18:41 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.79 211522
2003-04-13 04:15:03 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.65 210099
2003-04-11 08:37:30 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.65 208899
2003-04-11 10:06:27 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.48 207742
2003-03-28 12:30:40 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.41 202404
2003-04-06 13:18:42 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.55 202393
2003-04-11 09:47:30 Googlebot/2.1 (+http://www.googlebot.com/bot.html) 64.68.82.52 201193
+---------------------+----------------------------------------------------+-------------+--------+
10 rows in set (4.16 sec)

As you can see, my webserver served it about 200k on those pages. Google may ignore everything past 101k, though.

I wonder if that means that you can put stuff at the bottom of large pages that Google dislikes, but some of the other bots still like?

5:22 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 15, 2003
posts:169
votes: 0


Hi, I have a question on this. Is the 101kb the absolute size or the amount the crawler downloads? I use Gzip on my site saving me loads of bandwidth although the number stated in the Google Serps is the uncompressed size. Just some food for thought that it would be nice if it gave the compressed size instead.

Rob

5:24 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 15, 2003
posts:169
votes: 0


Bgumble, I'd be happy to check it out if you sticky mail me it.
5:54 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 25, 2003
posts:43
votes: 0


I wonder if that means that you can put stuff at the bottom of large pages that Google dislikes, but some of the other bots still like?

hmmm, interesting... good question...

6:08 pm on Apr 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 25, 2003
posts:43
votes: 0


Do you guys know if there is a tool which would tell you how much of your web site (not webpage) was crawled and where did the bot stopped and how deep it went?

i am giving ideas to programmers of what kind of demand is out there? so how much of the profit do i get? :)

Tejas

6:48 pm on Apr 24, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 26, 2002
posts:3295
votes: 9


Finally found it. Msg 11 in this thread [webmasterworld.com] Googleguys states that "...100K is a hard limit. All pages should be shorted than that."

<emphasis added>

12:15 pm on May 2, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Feb 27, 2003
posts:298
votes: 0


Hi jimbeetle,

This 101k limit googleguy says is only for the html file or for the whole web page including images? Hope a valid question.

This 17 message thread spans 2 pages: 17
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members