Crawl behavior from 64.68.82.xxx - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Crawl behavior from 64.68.82.xxx

Deep bot back?

1
2
»

Clark

10:38 am on Jun 29, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Check your logs. I believe the first google deepcrawl in 2 months has started. I'm seeing pages indexed that have never been indexed before. I'm not 100% sure, but 80%. Please confirm

Brett_Tabke

10:49 am on Jun 29, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Lets just put it this way: 3k views from 64.68.82.XXX.

We don't know what it is really...

MyWifeSays

11:22 am on Jun 29, 2003 (gmt 0)

10+ Year Member

I've been seeing this bot too and it's been coming back to try again for requests that return an error. I don't think the freshbot does this (could be wrong on that though).

It's also been requesting some pages that don't and have never existed which is a bit strange. Any thoughts on this anyone?

I use subdomains for one of my domains and the bot has been asking for index.dom and www.sd.dom which don't exist, it has also got the pages sd.dom that do exist (sd is a range of sub-domain names, dom is my domain).

I've just noticed that this bot is responsible for a cached page which has the fresh date tag from 27th so make of it what you will.

takagi

11:31 am on Jun 29, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If you're asking whether you'll see crawls from the IP range formerly known as deepbot, I would say "probably not." If you're asking whether we'll continue to deeply crawl and index the web, then the answer is "definitely yes."
Hopefully that makes sense. The index that's coming out now was crawled by what people call freshbot. But freshbot can crawl deeply too. It would make more sense for people to rename it deepfreshbot.

(GoogleGuy at msg 20 of Google June 2003 : Update Esmeralda [webmasterworld.com] thread])

So starting from some 2 weeks ago, the behavior of the bots is different and there is no fresh and deep bot anymore.

Michaeldd

11:52 am on Jun 29, 2003 (gmt 0)

10+ Year Member

I've just noticed that this freshbot has crawled new pages never seen before. Looking good so far, it picked up about 60 extra pages. Keep Going!

Splosh

11:59 am on Jun 29, 2003 (gmt 0)

10+ Year Member

64.68.82.xxx has been to my new site 5 times since the 19th June. It tries to take the robots.txt file, which doesnt exist, and the index page, but nothing else. As yet there is no sign of my new site in G's index, with or without a fresh tag.

pontifex

1:17 pm on Jun 29, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

from my point of view:

yes this is kind of a deep crawl, but it is slower than before. this could mean that net is growing faster than the infrastructure of google.
we saw 5-7 pages a second in the earlier deep crawls, today I see a 1 page per minute rate...

regards,
P!

swampy webber

1:26 pm on Jun 29, 2003 (gmt 0)

10+ Year Member

Splosh,

Patience is a virtue. That's the exact same thing I was seeing but now I am seeing what everyone else here is reporting (or at least I was yesterday)

Freshbot IP (at least that's what it use to be) grabbing hundreds of pages. But, I must agree that it is a slow rate. Same as others mentioned - about 1 page every minute or two.

Clark

1:28 pm on Jun 29, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I was getting a page every 1-2 seconds...but now it's intermittent. Maybe there's some truth to that continuous rolling update thread?

Dolemite

1:37 pm on Jun 29, 2003 (gmt 0)

10+ Year Member

yes this is kind of a deep crawl, but it is slower than before. this could mean that net is growing faster than the infrastructure of google.

Doubtful. More boxes & bandwidth are cheaper than losing competitive advantages, especially come IPO time. Remember, this is a company bringing in hundreds of million of revenue per year with very low overhead. Read some of the MS papers. They're crawling a good portion of the visible web with a very modest cluster of boxes. Crawling is very unlikely to become a limitation.

From "Mercator: A Scalable, Extensible Web Crawler" [research.compaq.com]:

Our production crawling machine is a Digital Ultimate Workstation with two 533 MHz Alpha processors, 2 GB of RAM, 118 GB of local disk, and a 100 Mbit/sec FDDI connection to the Internet. We run Mercator under srcjava, a Java runtime developed at our lab [10]. Running on this platform, a Mercator crawl run in May 1999 made 77.4 million HTTP requests in 8 days, achieving an average download rate of 112 documents/sec and 1,682 KB/sec.
These numbers indicate that Mercator's performance compares favorably with that of the Google and the Internet Archive crawlers. The Google crawler is reported to have issued 26 million HTTP requests over 9 days, averaging 33.5 docs/sec and 200 KB/sec [4]. This crawl was performed using four machines running crawler processes, and at least one more machine running the other processes. The Internet Archive crawler, which also uses multiple crawler machines, is reported to fetch 4 million HTML docs/day, the average HTML page being 5KB [21]. This download rate is equivalent to 46.3 HTML docs/sec and 231 KB/sec. It is worth noting that Mercator fetches not only HTML pages, but documents of all other MIME types as well. This effect more than doubles the size of the average document downloaded by Mercator as compared to the other crawlers.

That's as of 1999. I'm guessing they've only gotten better at it since then.

we saw 5-7 pages a second in the earlier deep crawls, today I see a 1 page per minute rate...

So fredbot got polite and didn't want to overload your server...what's the problem?

Clark

7:39 am on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That crawl spidered exactly 10% of the urls that are being responded to in the allinurl command. Not saying this means anything, but thought I'd put it out and ask anyone else who watches their logs to check what percentage of their allinurl freshdeepbot picked up.

JasonHamilton

7:55 am on Jun 30, 2003 (gmt 0)

10+ Year Member

/me checks and finds:

[jason@www logs]$ cat access_log � awk '{print $1}' � grep 64.68.82 � wc
413 413 4956

(413 matches)

The majority of the urls crawled, were from pages that now have NOINDEX meta tags on them, and have had them that way for at least 2 months. You can draw your own conclusions from this statement :)

And for reference, here are the IP's seen:

[jason@www logs]$ cat access_log � awk '{print $1}' � grep 64.68.82 � sort � uniq
64.68.82.12
64.68.82.16
64.68.82.25
64.68.82.26
64.68.82.27
64.68.82.30
64.68.82.31
64.68.82.32
64.68.82.33
64.68.82.34
64.68.82.35
64.68.82.36
64.68.82.37
64.68.82.41
64.68.82.45
64.68.82.46
64.68.82.50
64.68.82.51
64.68.82.52
64.68.82.54
64.68.82.55
64.68.82.56
64.68.82.65
64.68.82.68
64.68.82.70
64.68.82.71
64.68.82.74
64.68.82.77
64.68.82.78
64.68.82.79

why2kit

8:58 am on Jun 30, 2003 (gmt 0)

10+ Year Member

From my logs 216 (what I used to call deepbot) is toast. All the bot traffic is from 64. BUT there is very strong peaks of traffic from the 64 (6/13, 6/20, 6/28) which makes me think that these "could" be the new deepcrawl.

Clark

10:43 am on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yup GG confirmed somewhere that we can forget about 216, freshdeepbot will do both from now on.

HitProf

11:01 am on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes, GG said that somewhere.

But there is still different behaviour between the bots and it would be nice if they named them diffentently in the user agent.

swampy webber

3:04 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

FreshieDeepBot or whatever we want to call it now days pounded me as mentioned above over the weekend but now I am back to seeing on a request for robots.txt from time to time. Still no fresh data in the index so I continue waiting.

Just wanted to share. Are others seeing the same thing?

HitProf

3:50 pm on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'm seeing this too swampy_webber. I expect those pages to show up after next update or otherwise sooner or later.

Marval

4:34 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

are you saying you are not seeing the 216 bots? Ive been getting them for the last week appearing as well as the 64 bots you are seeing, although I agree the 64 bots are going deeper

swampy webber

4:36 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

What 216.xx are you seeing? Because, no, I haven't seen deep actually since the April crawl.

JasonHamilton

5:06 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

I checked my logs and this is the last time I saw the 216.* googlebot:

216.239.46.63 - - [26/Apr/2003:10:35:28 -0400] Googlebot/2.1 (+http://www.googlebot.com/bot.html)

[edited by: JasonHamilton at 5:55 pm (utc) on June 30, 2003]

HitProf

5:38 pm on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

GG didn't say deepbot was gone, but the IP-range wouldn't be used again.
64.* had already been acting like a deepbot for some time.

vbjaeger

5:48 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

I see the same results in our logs as those being posted. I did see that we have jumped several pages in a positive direction on -ex, -sj, -ab, -zu, www2, and www3. Hopefully this is a sign of good things. None of our backlinks have been updated on them though.

DavidT

6:08 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

All I am really seeing is the Mediapartners bot crawling pages that have no AdSense ads, never have, never will.

Kirby

6:32 pm on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

this could mean that net is growing faster than the infrastructure of google.

I dont believe the size of the net or Google's capacity is the issue, but rather something else specific to Google. Im seeing 64.68.82.xx over the last few days. At the same time I'm seeing AV more frequently and AV is indexing the changes with 24 hours. What Freshie got 72 hours ago is still a no-show on Google.

Dolemite

8:40 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

Has anyone seen new sites or new pages get "freshbotted" in the traditional, pre-Esmerelda sense, where they are only temporarily in the index?

I had a few pages from a new site indexed for a few days, and now they're gone. It looks like even though the bots are unified, the crawling and the indexing that results are not.

Marval

8:46 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

Heres an example of what Ive seen...
216.239.45.4 - - [30/Jun/2003:09:35:49 -0400] "GET / HTTP/1.0" 200 39031 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
Been seeing these just about every day on some sites..of course Im seeing the 64. bots as well

<added> kind of weird, but Im seeing it do repetitive requests for the same page - and getting server 200s so I know its getting the page...seeing the same thing on the 64 bots...loads of repetitive requests for the same page and all good responses from the server</added>

why2kit

9:00 pm on Jun 30, 2003 (gmt 0)

10+ Year Member

Now that's interesting. Deepbots ip used to be 216.239.46.xx

Clark

7:39 am on Jul 3, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This crawler is out again. I'm getting 4 pages per minute and definitely brand new pages. A good sign. Hopefully it means that the freshdeepbot has begun. Still looking for a real deep crawl. Anybody seeing very active bots out?

teeceo

8:07 am on Jul 3, 2003 (gmt 0)

10+ Year Member

Yup, i see freshdeep going to some pages that only get a visit once a month.

swizz

2:20 pm on Jul 3, 2003 (gmt 0)

10+ Year Member

Hello Guys,

GG said the new deepbot won't be using the same IP, but he also added to the sentence that they were going to give him a new ip... I don't remeber the sentence well, try searching on the last update thread.

I wanted to write about some bot behavior on 64.68.84-85, 5 days before the big update, but it was deleted because of the premoderetaion.

I think freshbot was optimized and is still running on 64.68.82.* and the new deepbot is running on 64.68.84-85.* ip range... it is weird that 5 days after I saw 84-85 there was the big "E" update.

/SwiZZ

This 44 message thread spans 2 pages: 44

1
2
»