Forum Moderators: open
It's also been requesting some pages that don't and have never existed which is a bit strange. Any thoughts on this anyone?
I use subdomains for one of my domains and the bot has been asking for index.dom and www.sd.dom which don't exist, it has also got the pages sd.dom that do exist (sd is a range of sub-domain names, dom is my domain).
I've just noticed that this bot is responsible for a cached page which has the fresh date tag from 27th so make of it what you will.
If you're asking whether you'll see crawls from the IP range formerly known as deepbot, I would say "probably not." If you're asking whether we'll continue to deeply crawl and index the web, then the answer is "definitely yes."(GoogleGuy at msg 20 of Google June 2003 : Update Esmeralda [webmasterworld.com] thread])Hopefully that makes sense. The index that's coming out now was crawled by what people call freshbot. But freshbot can crawl deeply too. It would make more sense for people to rename it deepfreshbot.
So starting from some 2 weeks ago, the behavior of the bots is different and there is no fresh and deep bot anymore.
Patience is a virtue. That's the exact same thing I was seeing but now I am seeing what everyone else here is reporting (or at least I was yesterday)
Freshbot IP (at least that's what it use to be) grabbing hundreds of pages. But, I must agree that it is a slow rate. Same as others mentioned - about 1 page every minute or two.
yes this is kind of a deep crawl, but it is slower than before. this could mean that net is growing faster than the infrastructure of google.
Doubtful. More boxes & bandwidth are cheaper than losing competitive advantages, especially come IPO time. Remember, this is a company bringing in hundreds of million of revenue per year with very low overhead. Read some of the MS papers. They're crawling a good portion of the visible web with a very modest cluster of boxes. Crawling is very unlikely to become a limitation.
From "Mercator: A Scalable, Extensible Web Crawler" [research.compaq.com]:
Our production crawling machine is a Digital Ultimate Workstation with two 533 MHz Alpha processors, 2 GB of RAM, 118 GB of local disk, and a 100 Mbit/sec FDDI connection to the Internet. We run Mercator under srcjava, a Java runtime developed at our lab [10]. Running on this platform, a Mercator crawl run in May 1999 made 77.4 million HTTP requests in 8 days, achieving an average download rate of 112 documents/sec and 1,682 KB/sec.These numbers indicate that Mercator's performance compares favorably with that of the Google and the Internet Archive crawlers. The Google crawler is reported to have issued 26 million HTTP requests over 9 days, averaging 33.5 docs/sec and 200 KB/sec [4]. This crawl was performed using four machines running crawler processes, and at least one more machine running the other processes. The Internet Archive crawler, which also uses multiple crawler machines, is reported to fetch 4 million HTML docs/day, the average HTML page being 5KB [21]. This download rate is equivalent to 46.3 HTML docs/sec and 231 KB/sec. It is worth noting that Mercator fetches not only HTML pages, but documents of all other MIME types as well. This effect more than doubles the size of the average document downloaded by Mercator as compared to the other crawlers.
That's as of 1999. I'm guessing they've only gotten better at it since then.
we saw 5-7 pages a second in the earlier deep crawls, today I see a 1 page per minute rate...
So fredbot got polite and didn't want to overload your server...what's the problem?
[jason@www logs]$ cat access_log ¦ awk '{print $1}' ¦ grep 64.68.82 ¦ wc
413 413 4956
(413 matches)
The majority of the urls crawled, were from pages that now have NOINDEX meta tags on them, and have had them that way for at least 2 months. You can draw your own conclusions from this statement :)
And for reference, here are the IP's seen:
[jason@www logs]$ cat access_log ¦ awk '{print $1}' ¦ grep 64.68.82 ¦ sort ¦ uniq
64.68.82.12
64.68.82.16
64.68.82.25
64.68.82.26
64.68.82.27
64.68.82.30
64.68.82.31
64.68.82.32
64.68.82.33
64.68.82.34
64.68.82.35
64.68.82.36
64.68.82.37
64.68.82.41
64.68.82.45
64.68.82.46
64.68.82.50
64.68.82.51
64.68.82.52
64.68.82.54
64.68.82.55
64.68.82.56
64.68.82.65
64.68.82.68
64.68.82.70
64.68.82.71
64.68.82.74
64.68.82.77
64.68.82.78
64.68.82.79
this could mean that net is growing faster than the infrastructure of google.
I dont believe the size of the net or Google's capacity is the issue, but rather something else specific to Google. Im seeing 64.68.82.xx over the last few days. At the same time I'm seeing AV more frequently and AV is indexing the changes with 24 hours. What Freshie got 72 hours ago is still a no-show on Google.
I had a few pages from a new site indexed for a few days, and now they're gone. It looks like even though the bots are unified, the crawling and the indexing that results are not.
<added> kind of weird, but Im seeing it do repetitive requests for the same page - and getting server 200s so I know its getting the page...seeing the same thing on the 64 bots...loads of repetitive requests for the same page and all good responses from the server</added>
GG said the new deepbot won't be using the same IP, but he also added to the sentence that they were going to give him a new ip... I don't remeber the sentence well, try searching on the last update thread.
I wanted to write about some bot behavior on 64.68.84-85, 5 days before the big update, but it was deleted because of the premoderetaion.
I think freshbot was optimized and is still running on 64.68.82.* and the new deepbot is running on 64.68.84-85.* ip range... it is weird that 5 days after I saw 84-85 there was the big "E" update.
/SwiZZ