Crawlzilla/1.0

Forum Moderators: open

Message Too Old, No Replies

Crawlzilla/1.0

Checked robots.

pendanticist

5:55 pm on Oct 6, 2004 (gmt 0)

66.235.180.*** - - [05/Oct/2004:18:08:04 -0700] "GET /robots.txt HTTP/1.0" 200 1681 "-" "Crawlzilla/1.0 (Crawlzilla; http//www.crawlzilla.com; crawler@crawlzilla.com)"
66.235.180.*** - - [05/Oct/2004:18:08:04 -0700] "GET / HTTP/1.0" 200 20402 "-" "Crawlzilla/1.0 (Crawlzilla; http//www.crawlzilla.com; crawler@crawlzilla.com)"

Page can not be displayed warning accompanies link...

...and Google gives you Crawlzilla/1.0 [google.com] nothing...as of this date.

10/06/04 10:59:48 IP block 66.235.180.***
Trying 66.235.180.*** at ARIN
Trying 66.235.180 at ARIN
OrgName: HopOne Internet Corporation
OrgID: HOPO
Address: 1010 Wisconsin Avenue N.W.
City: Washington
StateProv: DC
PostalCode: 20007-3603
Country: US

Pendanticist.

[edited by: volatilegx at 8:45 pm (utc) on Oct. 10, 2004]

crawlzilla

6:11 pm on Oct 11, 2004 (gmt 0)

Hello,

We are the owners of crawlzilla/1.0

We have been running worldwide crawls with this crawler and would like to know any problems etc with our crawler/spider.

Currently its a mix of both heritrix,and nutch. We are aware that the nutch crawler can be naughty at times and not obey robots.txt but were not sure as to if we have the problem fixed.

Currently as the gentlmen before me stated. We dont have much information up on whats going on and we hope to have full information up as soon as we can as well as a feedback forum so that you can let us know any problems our robots may cause so that we can fix them as soon as possible.

Currently here is a list of the robots owned by us that are in testing.

Crawlzilla/1.0
Crawlzilla-webbot-beta
Crawlzilla-bot-crawler-beta
Crawlzilla~spider-bot.1.0

Once again if our crawlers and spiders are misbehaving please let us know. Our goal is not to harm anyones site, nor do we want anyone to get the wrong impression. We have currently indexed Approx, 4 terrabytes of data, and currently have a index of somewhere around 800 million pages spread across 55 servers.

We will have the engine public around Nov1 2004 as we are still testing and fixing bugs etc.

If you would like more information please do not hesitate to contact us at crawler@crawlzilla.com

Thankyou.

volatilegx

6:41 pm on Oct 11, 2004 (gmt 0)

crawlzilla, welcome to WebmasterWorld and thank you for the information :)

ncw164x

6:47 pm on Oct 11, 2004 (gmt 0)

Welcome to WebmasterWorld crawlzilla,

I wish you the best of luck with the launch of your new search engine

ncw164x

pendanticist

7:15 pm on Oct 11, 2004 (gmt 0)

Thank You for the introduction.

Welcome! :)

Pendanticist.

wilderness

12:44 am on Oct 12, 2004 (gmt 0)

We have been running worldwide crawls with this crawler

I'm likely the most difficlut participant in this forum to appease :(

Perhaps you may convey when you began spidering with a UA?

We have currently indexed Approx, 4 terrabytes of data, and currently have a index of somewhere around 800 million pages spread across 55 servers.

Likely much of that was spidered WITHOUT a UA?

I have this log line from March of 2004:

66.235.184.65 - - [24/Mar/2004:18:41:37 -0800] "GET /myfolder[case error]/mypage.htm
HTTP/1.1" 404 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

Perhaps I'm overlooking something?
Most everybody here recognizes me as narrow minded, however you have a history of crawling without idenity and yet still request access even though your pages are not functional?

Thanks in advance

pendanticist

12:59 am on Oct 12, 2004 (gmt 0)

Crawlzilla/1.0
Crawlzilla-webbot-beta
Crawlzilla-bot-crawler-beta
Crawlzilla~spider-bot.1.0

66.235.180.244 - - [11/Oct/2004:17:12:28 -0700] "GET /robots.txt HTTP/1.0" 200 1681 "-" "Crawlzilla-web-bot/1.0 (Crawlzilla; http//www.crawlzilla.com; crawler@crawlzilla.com)"

Musta added a new one....

crawlzilla

2:34 am on Oct 12, 2004 (gmt 0)

Hi Wilderness,

66.235.184.65 - - [24/Mar/2004:18:41:37 -0800] "GET /myfolder[case error]/mypage.htm
HTTP/1.1" 404 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

By looking at the date 24/Mar/2004
I can tell you without a shadow of a doubt... That is not us. We only started crawling in 8/5/2004 and that was mostly test crawls on a very small # of sites. Although the IP may be matching we did not own that Ip at that time. Only in Sept did we aquire that IP.

Sorry for any confusion.

crawlzilla

2:41 am on Oct 12, 2004 (gmt 0)

Also I would like to add that this is the spider you will be seeing from now on crawling your site.
(Crawlzilla-web-bot) We are now done fully testing all of our robots and we are happiest with this one. If for some reason it disobeys robots.txt please let us now either here or at (snip).

Thankyou,
Mike Dell
(snip)

[edited by: volatilegx at 2:17 pm (utc) on Oct. 12, 2004]
[edit reason] No signatures please [/edit]

wilderness

10:40 am on Oct 12, 2004 (gmt 0)

you have a history of crawling without idenity

My apologies.

Many thanks for your willingness to respond and participate.

Crawlzilla/1.0

Checked robots.

pendanticist

crawlzilla

volatilegx

ncw164x

pendanticist

wilderness

pendanticist

crawlzilla

crawlzilla

wilderness

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week