Forum Moderators: open
First one is gw.ocg-corp.com 209.126.176.3
Sometimes it reverse-resolves and sometimes it doesn't.
They aren't in NetSol's database as of 2002-05-13.
They've grabbed 18,000 pages from my site starting on May 6th,
and are still going strong. Never seen them before this.
Below is the only thing I can find on them, which has to do
more with their upstream provider, I presume.
California Regional Internet, Inc. (NETBLK-CARI)
8929A COMPLEX DRIVE
SAN DIEGO, CA 92123
US
Netname: CARI
Netblock: 209.126.128.0 - 209.126.207.255
Maintainer: CALI
Coordinator:
California Regional Intranet, Inc. (IC63-ARIN) sysadmin@cari.net
858-974-5080
Domain System inverse mapping provided by:
NS1.ASPADMIN.COM 216.98.128.74
NS2.ASPADMIN.COM 216.98.128.75
ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE
Record last updated on 18-Mar-2002.
Database last updated on 12-May-2002 19:57:36 EDT.
Last three traceroute hops:
19 94 ms 94 ms 94 ms g7.ba21.b006588-1.san01.atlas.cogentco.com [66.28.66.106]
20 97 ms 94 ms 94 ms ge1-2.gw65-02.kmc01.sdcix.net [66.28.28.126]
21 94 ms 94 ms 93 ms gw.ocg-corp.com [209.126.176.3]
________________________________________
Here's the second one I can't find:
66.237.60.* (last quad varies a lot)
They've grabbed 33,000 pages this month alone. In April they got
46,000 from the main site and 64,000 from the mirror site. In
March it was only 300 from one site. This bot will even do dynamic
pages if they aren't disallowed.
Their provider appears to be:
XO Communications (NETBLK-XOX1-BLK-2)
1400 Parkmoor Avenue
San Jose, CA 95126-3429
US
Netname: XOX1-BLK-2
Netblock: 66.236.0.0 - 66.237.255.255
Maintainer: XOXO
Coordinator:
DNS and IP ADMIN (DIA-ORG-ARIN) hostmaster@CONCENTRIC.NET
(408) 817-2800
Fax- - - (408) 817-2630
Domain System inverse mapping provided by:
NAMESERVER1.CONCENTRIC.NET 207.155.183.73
NAMESERVER2.CONCENTRIC.NET 207.155.184.72
NAMESERVER3.CONCENTRIC.NET 206.173.119.72
NAMESERVER.CONCENTRIC.NET 207.155.183.72
Last hops on traceroute:
13 54 ms 54 ms 53 ms ge5-3-1.RAR1.Washington-DC.us.xo.net [64.220.0.222]
14 143 ms 132 ms 132 ms p1-0-0.RAR1.SanJose-CA.us.xo.net [65.106.0.38]
15 145 ms 134 ms 133 ms p0-0-0-1.RAR2.SanJose-CA.us.xo.net [65.106.1.66]
16 134 ms 132 ms 132 ms p15-0.DCR1.DC-Fremont-CA.us.xo.net [65.106.2.154]
17 135 ms 133 ms 133 ms 205.158.60.38
18 134 ms 135 ms 145 ms 205.158.60.50
19 136 ms 133 ms 133 ms 66.237.60.44
Both of these bots check the robots.txt file regularly. Sorry, I don't log User-agent.
Why!! Would you allow somebody to travel YOUR site and absorb YOUR bandwidth not being pleased with their activity?
CARI
Netblock: 209.126.128.0 - 209.126.207.255
QNI1BLK
Netblock: 209.126.0.0 - 209.126.127.255
209.129.208+ is not in use.
solution
deny 209.126.
I denied the XO blocks
66.236. & 66.237.
from the following
Openfind data gatherer, Openbot/3.0+(robot-response@openfind.com.tw;+http://www.openfind.com.tw/robot.html)", 66.237.60.45, 04/08/02
I found some info on Openfind, and put in a Disallow for their Openbot. Here's the info:
[commercenet.org.tw...]
On that larbin_2.6.2, I'll try that as the User-agent in robots.txt and see if it goes away. It looks like larbin is sort of a French equivalent to Wget -- a spider that anyone can download and use. It's a general crawler, not connected to any indexing system.
209.126.176.3 - - [12/May/2002:00:37:34 -0700] "GET / HTTP/1.0" 403 - "-" ""Opera/6.01 larbin@unspecified.mail"
217.127.255.221 - - [12/May/2002:00:39:42 -0700] "GET / HTTP/1.0" 200 10182 "" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
209.126.176.3 - - [12/May/2002:00:44:50 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" ""Opera/6.01 (larbin@unspecified.mail)"
Amazing coincedence that another IP read the robots in the exact order of log entries.
Dear sir:
Please advise who has the domain gw.ocg-corp.com [209.126.176.3], or who I should contact for this information.
They are not listed in the Network Solutions database.
They have crawled 18,000 pages on our site since May 6.
The User-agent is larbin_2.6.2, which is a downloadable crawler from France, and is not connected to any indexing system. No email address is listed in the User-agent.
A search on Google for either the domain name or the IP number (sometimes it doesn't reverse-resolve) reveals some crawled access-log statistics from some sites that make them available to Google. It appears that this particular bot has been very active on a number of sites.
Regards,
[deleted]
I have 312 robots.txt accesses from 209.126.176.3 since May 6, and 198 from gw.ocg-corp.com. I was fooled by this into thinking he was a "good" bot; apparently it's either done automatically by the larbin software and can get overridden, or he's doing it as a smokescreen.
It's curious that he can be so active around the web with just one IP number. There are a lot of innocent websites on that same Class C, judging from a lookup of all 255 numbers. He must have that one IP connected to a T3 or something, and a round-robin of PCs pumping out the GETs.
[webmasterworld.com...]
He's all over the web. I looked at the logs in other domains on our Class C, and he's been busy there too.
You have to "grep" separately for 209.126.176.3 and gw.ocg-corp.com because it only reverse-resolves about half the time.
Webmasters, feel free to complain to the sysadmin. This guy should be taken down. He gives bots a bad name, which is not what we need these days.
Don't just block -- block AND complain!
This guy at [phonezilla.net ] saw it, too (perhaps one of you) and mentions it in his blog.
Assuming that you never need to forward-resolve from a domain name to an IP number (which this guy didn't do and couldn't do since he wasn't in the root servers), how easy is it for someone to forge a domain for reverse-resolving?
Does it take the cooperation of the sysadmin of the name servers? Does the upstream provider have to be involved? I don't know much about DNS. I suspect that this guy's upstream provider is only involved to the point of a single 4-quad IP number, and everything else this guy is doing himself. But the fact that he's forging the domain, forging the e-mail in the User-agent, forging the User-agent, and ignoring the contents of robots.txt, ought to be enough to get the upstream provider involved.
It's the wild, wild west on the web. Let's go rob some banks! The New York Times has a piece today about highly-organized credit-card black marketeering with stolen numbers, most of it happening in Russia, where they sell the numbers from chat rooms.
But this guy was just sucking up websites. Where's the profit? What's on his mind? Is he hoping to harvest something interesting this way?
Everyman,
When first beginning with htaccess I was actually naive to think that if I mailed ISP's and backbone providers with these infrcations and showing their own TERMS infractions that the providers might take some action :-(
The few responses I did receive were quite interesting. The majority of the time there were not any responses. As a result I ceased notification and just add the denies.
I was also fortunate or unfortuante to learn early about lieniency. Every attempt I made to deny a solitary lone IP # ( Ex: 209.126.176.3) the visitor returned fast with a different # in the last 255 block.
As a result I've learned to explore blocks to the highest level possible while still maintaining some kind of minimum WITHOUT causing myself too much work in the process.
Ex: with this block as an example:
Netblock: 209.126.128.0 - 209.126.207.255;
I would never take the time to list the third block 128-255 individually. Unless I had a valued visitor contained in that block.
As a result of that, a portion of innocent Missouri suffers (QNI1BLK
Netblock: 209.126.0.0 - 209.126.127.255)
I would never allow a visitor (IP) to travel my site as extensively as you have allowed these two.
At times it is really surprising what visitors think they can do while vsiting a website. Without retaliation or restrictions imposed.
I've had visitors read a 403 page dozens of times because they just can't believe they've been stopped.
I had a DE IP one night use software repeatedly for nearly four hours in an attempt to overload the sites cpu and gain entry. Because I had implemented a 403 response page. When I removed the 403 response page the visitor tried once more and has not been back since (at least not that I'm aware of.)
The content of OUR websites and their pages are our property and home. You wouldn't allow a visitor to ramble around your personal home in every nook and cranny?
Why allow it on your website?
Their are entire continents and reigons which can gain no benefit from my websites content (except to grab email addys) yet they persist upon attempting to visit.
I would never allow a visitor (IP) to travel my site as extensively as you have allowed these two.
I watch CPU load mostly. Our bandwidth isn't a problem. The CPU load was a problem when the site was 99 percent dynamic. Now it's duplicated with static files. As long as the bots stay out of cgi-bin, they don't even even make a dent in CPU load any more, no matter how fast they crawl. And if they get into cgi-bin, by now I have auto-detection that locks out anyone fetching too fast.
But the conversion to static files is at least ten times better than CGI files, in terms of keeping CPU load down. I don't have a lot to worry about anymore.
These two bots were very polite -- about once every thirty seconds they'd GET a page. By Google's crawling standards, that's extremely polite. That's why it took me so long to get curious. Both of these bots seemed tame. The difference is that I need Google, but I don't need a bot from Taiwan and another that forges everything he can possibly forge, and breaks all the rules. But I admit, they flew under my radar for a time. Speaking of impolite, AltaVista can go absolutely crazy -- even getting into some insane loop on static files, believe it or not.
This guy won't be back, unless he reads this thread and wants to teach me a lesson. The nice thing about having your own router is that to him, it looks like we are off-line. His requests just hang until his bot times out -- no response at all. That's what it means to "black hole" him; the router discards anything from his IP number as if it had never arrived in the first place, and our server never sees it. Very clean and efficient. He won't waste time on us when there are so many more sites to crawl out there.