Forum Moderators: open
66.235.180.*** - - [05/Oct/2004:18:08:04 -0700] "GET /robots.txt HTTP/1.0" 200 1681 "-" "Crawlzilla/1.0 (Crawlzilla; http//www.crawlzilla.com; crawler@crawlzilla.com)"
66.235.180.*** - - [05/Oct/2004:18:08:04 -0700] "GET / HTTP/1.0" 200 20402 "-" "Crawlzilla/1.0 (Crawlzilla; http//www.crawlzilla.com; crawler@crawlzilla.com)"
Page can not be displayed warning accompanies link...
...and Google gives you Crawlzilla/1.0 [google.com] nothing...as of this date.
10/06/04 10:59:48 IP block 66.235.180.***
Trying 66.235.180.*** at ARIN
Trying 66.235.180 at ARINOrgName: HopOne Internet Corporation
OrgID: HOPO
Address: 1010 Wisconsin Avenue N.W.
City: Washington
StateProv: DC
PostalCode: 20007-3603
Country: US
Pendanticist.
[edited by: volatilegx at 8:45 pm (utc) on Oct. 10, 2004]
We are the owners of crawlzilla/1.0
We have been running worldwide crawls with this crawler and would like to know any problems etc with our crawler/spider.
Currently its a mix of both heritrix,and nutch. We are aware that the nutch crawler can be naughty at times and not obey robots.txt but were not sure as to if we have the problem fixed.
Currently as the gentlmen before me stated. We dont have much information up on whats going on and we hope to have full information up as soon as we can as well as a feedback forum so that you can let us know any problems our robots may cause so that we can fix them as soon as possible.
Currently here is a list of the robots owned by us that are in testing.
Crawlzilla/1.0
Crawlzilla-webbot-beta
Crawlzilla-bot-crawler-beta
Crawlzilla~spider-bot.1.0
Once again if our crawlers and spiders are misbehaving please let us know. Our goal is not to harm anyones site, nor do we want anyone to get the wrong impression. We have currently indexed Approx, 4 terrabytes of data, and currently have a index of somewhere around 800 million pages spread across 55 servers.
We will have the engine public around Nov1 2004 as we are still testing and fixing bugs etc.
If you would like more information please do not hesitate to contact us at crawler@crawlzilla.com
Thankyou.
We have been running worldwide crawls with this crawler
I'm likely the most difficlut participant in this forum to appease :(
Perhaps you may convey when you began spidering with a UA?
We have currently indexed Approx, 4 terrabytes of data, and currently have a index of somewhere around 800 million pages spread across 55 servers.
Likely much of that was spidered WITHOUT a UA?
I have this log line from March of 2004:
66.235.184.65 - - [24/Mar/2004:18:41:37 -0800] "GET /myfolder[case error]/mypage.htm
HTTP/1.1" 404 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
Perhaps I'm overlooking something?
Most everybody here recognizes me as narrow minded, however you have a history of crawling without idenity and yet still request access even though your pages are not functional?
Thanks in advance
66.235.184.65 - - [24/Mar/2004:18:41:37 -0800] "GET /myfolder[case error]/mypage.htm
HTTP/1.1" 404 - "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
By looking at the date 24/Mar/2004
I can tell you without a shadow of a doubt... That is not us. We only started crawling in 8/5/2004 and that was mostly test crawls on a very small # of sites. Although the IP may be matching we did not own that Ip at that time. Only in Sept did we aquire that IP.
Sorry for any confusion.
Thankyou,
Mike Dell
(snip)
[edited by: volatilegx at 2:17 pm (utc) on Oct. 12, 2004]
[edit reason] No signatures please [/edit]