Welcome to WebmasterWorld Guest from 23.20.13.165

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Two mystery spiders

Heavy crawlers

     
7:26 pm on May 13, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


Can anyone identify these two spiders?

First one is gw.ocg-corp.com 209.126.176.3

Sometimes it reverse-resolves and sometimes it doesn't.

They aren't in NetSol's database as of 2002-05-13.

They've grabbed 18,000 pages from my site starting on May 6th,
and are still going strong. Never seen them before this.

Below is the only thing I can find on them, which has to do
more with their upstream provider, I presume.

California Regional Internet, Inc. (NETBLK-CARI)
8929A COMPLEX DRIVE
SAN DIEGO, CA 92123
US

Netname: CARI
Netblock: 209.126.128.0 - 209.126.207.255
Maintainer: CALI

Coordinator:
California Regional Intranet, Inc. (IC63-ARIN) sysadmin@cari.net
858-974-5080

Domain System inverse mapping provided by:

NS1.ASPADMIN.COM 216.98.128.74
NS2.ASPADMIN.COM 216.98.128.75

ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE

Record last updated on 18-Mar-2002.
Database last updated on 12-May-2002 19:57:36 EDT.

Last three traceroute hops:

19 94 ms 94 ms 94 ms g7.ba21.b006588-1.san01.atlas.cogentco.com [66.28.66.106]
20 97 ms 94 ms 94 ms ge1-2.gw65-02.kmc01.sdcix.net [66.28.28.126]
21 94 ms 94 ms 93 ms gw.ocg-corp.com [209.126.176.3]

________________________________________

Here's the second one I can't find:

66.237.60.* (last quad varies a lot)

They've grabbed 33,000 pages this month alone. In April they got
46,000 from the main site and 64,000 from the mirror site. In
March it was only 300 from one site. This bot will even do dynamic
pages if they aren't disallowed.

Their provider appears to be:

XO Communications (NETBLK-XOX1-BLK-2)
1400 Parkmoor Avenue
San Jose, CA 95126-3429
US

Netname: XOX1-BLK-2
Netblock: 66.236.0.0 - 66.237.255.255
Maintainer: XOXO

Coordinator:
DNS and IP ADMIN (DIA-ORG-ARIN) hostmaster@CONCENTRIC.NET
(408) 817-2800
Fax- - - (408) 817-2630

Domain System inverse mapping provided by:

NAMESERVER1.CONCENTRIC.NET 207.155.183.73
NAMESERVER2.CONCENTRIC.NET 207.155.184.72
NAMESERVER3.CONCENTRIC.NET 206.173.119.72
NAMESERVER.CONCENTRIC.NET 207.155.183.72

Last hops on traceroute:

13 54 ms 54 ms 53 ms ge5-3-1.RAR1.Washington-DC.us.xo.net [64.220.0.222]
14 143 ms 132 ms 132 ms p1-0-0.RAR1.SanJose-CA.us.xo.net [65.106.0.38]
15 145 ms 134 ms 133 ms p0-0-0-1.RAR2.SanJose-CA.us.xo.net [65.106.1.66]
16 134 ms 132 ms 132 ms p15-0.DCR1.DC-Fremont-CA.us.xo.net [65.106.2.154]
17 135 ms 133 ms 133 ms 205.158.60.38
18 134 ms 135 ms 145 ms 205.158.60.50
19 136 ms 133 ms 133 ms 66.237.60.44

Both of these bots check the robots.txt file regularly. Sorry, I don't log User-agent.

7:30 pm on May 13, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 5, 2001
posts:2729
votes: 8


Id like to say the first one you mentioned 'gw.ocg-corp.com' has been all over my sites this morning.
9:21 pm on May 13, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


Perhaps I'm dense and over bearing?
I have a question first before adding some information?

Why!! Would you allow somebody to travel YOUR site and absorb YOUR bandwidth not being pleased with their activity?

CARI
Netblock: 209.126.128.0 - 209.126.207.255

QNI1BLK
Netblock: 209.126.0.0 - 209.126.127.255

209.129.208+ is not in use.
solution
deny 209.126.

I denied the XO blocks
66.236. & 66.237.

from the following
Openfind data gatherer, Openbot/3.0+(robot-response@openfind.com.tw;+http://www.openfind.com.tw/robot.html)", 66.237.60.45, 04/08/02

11:18 pm on May 13, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


My apologies for replying further.

I might add that I previously added in the deny to 209.126.
on May 8th.

209.126.176.3 - - [07/May/2002:20:45:19 -0700] "GET /robots.txt HTTP/1.0" 200 2108 "-" "larbin_2.6.2 (larbin2.6.2@unspecified.mail)"

11:35 pm on May 13, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


Thanks, wilderness.

I found some info on Openfind, and put in a Disallow for their Openbot. Here's the info:

[commercenet.org.tw...]

On that larbin_2.6.2, I'll try that as the User-agent in robots.txt and see if it goes away. It looks like larbin is sort of a French equivalent to Wget -- a spider that anyone can download and use. It's a general crawler, not connected to any indexing system.

12:51 am on May 14, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


here's some more info which I neglected to add previously:

209.126.176.3 - - [12/May/2002:00:37:34 -0700] "GET / HTTP/1.0" 403 - "-" ""Opera/6.01 larbin@unspecified.mail"
217.127.255.221 - - [12/May/2002:00:39:42 -0700] "GET / HTTP/1.0" 200 10182 "" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
209.126.176.3 - - [12/May/2002:00:44:50 -0700] "GET /robots.txt HTTP/1.0" 403 - "-" ""Opera/6.01 (larbin@unspecified.mail)"

Amazing coincedence that another IP read the robots in the exact order of log entries.

12:54 am on May 14, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


To: sysadmin@cari.net

Dear sir:

Please advise who has the domain gw.ocg-corp.com [209.126.176.3], or who I should contact for this information.

They are not listed in the Network Solutions database.

They have crawled 18,000 pages on our site since May 6.

The User-agent is larbin_2.6.2, which is a downloadable crawler from France, and is not connected to any indexing system. No email address is listed in the User-agent.

A search on Google for either the domain name or the IP number (sometimes it doesn't reverse-resolve) reveals some crawled access-log statistics from some sites that make them available to Google. It appears that this particular bot has been very active on a number of sites.

Regards,
[deleted]

1:42 am on May 14, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


Darn, he's playing around with the User-agent; my robots.txt disallow will never work. Also, I saw an entry from a blogger via Google, who complained that this guy was ignoring his robots.txt and going into forbidden places on his site. Maybe the larbin_2.6.2 is just accessing the robots.txt to make itself look good, but doesn't care what's in it. I think I'll get him blocked at the router.

I have 312 robots.txt accesses from 209.126.176.3 since May 6, and 198 from gw.ocg-corp.com. I was fooled by this into thinking he was a "good" bot; apparently it's either done automatically by the larbin software and can get overridden, or he's doing it as a smokescreen.

It's curious that he can be so active around the web with just one IP number. There are a lot of innocent websites on that same Class C, judging from a lookup of all 255 numbers. He must have that one IP connected to a T3 or something, and a round-robin of PCs pumping out the GETs.

1:47 am on May 14, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


I though that IP looked familiar.

[webmasterworld.com...]

2:13 am on May 14, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


So we have proof now that he a) changes the User-agent, which means that b) he necessarily ignores everything in robots.txt, and c) he uses either invalid or at least inconsistent email addresses, and d) NetSol has never heard of him.

He's all over the web. I looked at the logs in other domains on our Class C, and he's been busy there too.

You have to "grep" separately for 209.126.176.3 and gw.ocg-corp.com because it only reverse-resolves about half the time.

Webmasters, feel free to complain to the sysadmin. This guy should be taken down. He gives bots a bad name, which is not what we need these days.

Don't just block -- block AND complain!

2:25 am on May 14, 2002 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4875
votes: 15


I have seen this one going for a long time

Just by looking at webtrends (my puny modem cant import my logs every day)

I see
larbin_2.6.2 larbin2.6.2@unspecified.mail
larbin_2.6.2 (larbin2.6.2@unspecified.mail)

I wonder what they are up to

3:34 am on May 14, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 4, 2002
posts:73
votes: 0


One of the "mystery" user agents I came across [webmasterworld.com ] was using an IP from cari/cali. Perhaps an easy proxy for folks with nefarious ideas.

This guy at [phonezilla.net ] saw it, too (perhaps one of you) and mentions it in his blog.

4:08 am on May 14, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


We've got him black-holed at the router now for our entire Class C.

Assuming that you never need to forward-resolve from a domain name to an IP number (which this guy didn't do and couldn't do since he wasn't in the root servers), how easy is it for someone to forge a domain for reverse-resolving?

Does it take the cooperation of the sysadmin of the name servers? Does the upstream provider have to be involved? I don't know much about DNS. I suspect that this guy's upstream provider is only involved to the point of a single 4-quad IP number, and everything else this guy is doing himself. But the fact that he's forging the domain, forging the e-mail in the User-agent, forging the User-agent, and ignoring the contents of robots.txt, ought to be enough to get the upstream provider involved.

It's the wild, wild west on the web. Let's go rob some banks! The New York Times has a piece today about highly-organized credit-card black marketeering with stolen numbers, most of it happening in Russia, where they sell the numbers from chat rooms.

But this guy was just sucking up websites. Where's the profit? What's on his mind? Is he hoping to harvest something interesting this way?

4:33 am on May 14, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


<snip>To: sysadmin@cari.net>

Everyman,
When first beginning with htaccess I was actually naive to think that if I mailed ISP's and backbone providers with these infrcations and showing their own TERMS infractions that the providers might take some action :-(

The few responses I did receive were quite interesting. The majority of the time there were not any responses. As a result I ceased notification and just add the denies.

I was also fortunate or unfortuante to learn early about lieniency. Every attempt I made to deny a solitary lone IP # ( Ex: 209.126.176.3) the visitor returned fast with a different # in the last 255 block.
As a result I've learned to explore blocks to the highest level possible while still maintaining some kind of minimum WITHOUT causing myself too much work in the process.
Ex: with this block as an example:
Netblock: 209.126.128.0 - 209.126.207.255;
I would never take the time to list the third block 128-255 individually. Unless I had a valued visitor contained in that block.
As a result of that, a portion of innocent Missouri suffers (QNI1BLK
Netblock: 209.126.0.0 - 209.126.127.255)

I would never allow a visitor (IP) to travel my site as extensively as you have allowed these two.
At times it is really surprising what visitors think they can do while vsiting a website. Without retaliation or restrictions imposed.
I've had visitors read a 403 page dozens of times because they just can't believe they've been stopped.
I had a DE IP one night use software repeatedly for nearly four hours in an attempt to overload the sites cpu and gain entry. Because I had implemented a 403 response page. When I removed the 403 response page the visitor tried once more and has not been back since (at least not that I'm aware of.)

The content of OUR websites and their pages are our property and home. You wouldn't allow a visitor to ramble around your personal home in every nook and cranny?
Why allow it on your website?

Their are entire continents and reigons which can gain no benefit from my websites content (except to grab email addys) yet they persist upon attempting to visit.

5:09 am on May 14, 2002 (gmt 0)

Preferred Member

joined:Apr 13, 2001
posts:360
votes: 0


I would never allow a visitor (IP) to travel my site as extensively as you have allowed these two.

I watch CPU load mostly. Our bandwidth isn't a problem. The CPU load was a problem when the site was 99 percent dynamic. Now it's duplicated with static files. As long as the bots stay out of cgi-bin, they don't even even make a dent in CPU load any more, no matter how fast they crawl. And if they get into cgi-bin, by now I have auto-detection that locks out anyone fetching too fast.

But the conversion to static files is at least ten times better than CGI files, in terms of keeping CPU load down. I don't have a lot to worry about anymore.

These two bots were very polite -- about once every thirty seconds they'd GET a page. By Google's crawling standards, that's extremely polite. That's why it took me so long to get curious. Both of these bots seemed tame. The difference is that I need Google, but I don't need a bot from Taiwan and another that forges everything he can possibly forge, and breaks all the rules. But I admit, they flew under my radar for a time. Speaking of impolite, AltaVista can go absolutely crazy -- even getting into some insane loop on static files, believe it or not.

This guy won't be back, unless he reads this thread and wants to teach me a lesson. The nice thing about having your own router is that to him, it looks like we are off-line. His requests just hang until his bot times out -- no response at all. That's what it means to "black hole" him; the router discards anything from his IP number as if it had never arrived in the first place, and our server never sees it. Very clean and efficient. He won't waste time on us when there are so many more sites to crawl out there.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members