Welcome to WebmasterWorld Guest from 54.159.105.39

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

ia archiver

random hits from China

     
9:23 pm on Mar 30, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Anyone else seeing random ia_archiver hits from Chinese broadband IPs?

Either ia_archiver had gone distributed or it's being forged in the hopes of scraping or worse.

A quick check on one IP does not show any open ports but it was only one IP.
2:05 am on Mar 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13736
votes: 458


fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?" Don't remember their contact address offhand, but it's somewhere at archive dot org and wasn't hard to find.
2:32 am on Mar 31, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8600
votes: 381



Block both the Internet Archive and China :)
9:17 am on Mar 31, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14663
votes: 99


I've been seeing a ton of ia_archiver from China and it requests and then FAILS to honor robots.txt so far.

Then it endlessly asks for index.html like a spoiled brat when I serve it up a nice steaming 403 forbidden:

2012-03-18,16:58:30,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:01:09,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:02:19,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:08:43,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:37,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:39,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:14:07,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:21,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:53,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:32,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:59,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:20:49,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:23:11,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:04:26,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:15:54,114.218.226.235,"ia_archiver","/index.html"

Just a small sample of one of their tantrums.
10:23 am on Mar 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13736
votes: 458


114.216.0.0/13 is China?! I've never set eyes on anything in the whole range. (Thank you, Spotlight.)

:: scrambling to add yet another line to Yellow Peril* section of htaccess ::

Nothing ever asks for my index.html. They pick something at random from either /fun/ or /ebooks/ ** and gobble gobble gobble until they get bored and move on to some other site.


* Uhm. Sorry.
** If they tried it with /fonts/ or /hovercraft/ they'd find me waiting for them with a bazooka.
2:04 pm on Mar 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?"


lucy,
I'd had multiple IP's of ia denied since 1999.
Four-five years ago I decided that I'd like my then-sites in their archive and was unable to determine all the IP's (at least the primary one which would remove the "exclusion note from their server"), and contacted them.

They did respond, however when I explained the IP confusion and asked for confirmation of the IP's they used?
That was the end of communications ;)

Today, I'm content that the 4-5-year-ago change failed, because I would not be pleased with my sites in their archive.
8:45 pm on Mar 31, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Bill - similar pattern here.

I've blocked ia_archiver for pretty-near ever.

Lucy - I have China under various ISPs going down to 114.208 (below that, down to 200, is a KR block) and up to 114.255.
10:07 pm on Mar 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13736
votes: 458


:: detour to investigate as expeditiously as possible before Whois starts yattering about carbon-based lifeforms* ::

114.208-211 + 114.212 + 114.213 + 114.214 + 114.215 + 114.116-223 + 114.224-239 + 114.240-255
=
114.208-223 = 114.208.0.0/12
114.224-255 = 114.224.0.0/11

If it were the US I'd mark all those providers separately, but if I'm blocking the whole country it doesn't matter.

Oops, here's another at 116.0.8.0/21 (nagvaaqtara) and 116.0.24.0/21 (nanijara) with Australia tucked in between. But those are too small to bother blocking unless they start actively vexing me.

Now, if you could do Allow/Deny in layers... ("Block this whole A class except this B piece, and then block these C ranges except for this D bit in the middle") Yes, I realize you could spell it all out in mod_rewrite, but Regular Expressions simply weren't designed for a binary system.


* They can't be too serious, though, because I've never been asked to redo a captcha.
10:58 pm on Mar 31, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8600
votes: 381


@Lucy - I don't bother with China servers, colos, etc. I just block the country ranges (which there are many.)

China country range
114.0.0.0 - 114.255.255.255
114.0.0.0/8
3:49 am on Apr 1, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13736
votes: 458


Rotten luck for those poor Filipinos at 144.108.192-255 :) No idea why I've got that particular range labeled, but it's the only thing in the area that isn't China-colored. No, wait, there are some Thais in the neighborhood too. Can't remember what they were looking for, but I've had bona fide human visitors from Thailand. Well, some of those blocked Chinese visitors were probably human too, but that's just tough on them.

Don't think they were doing outside research for TIA, though.

While tucking away China I stumbled across what must be the world's smallest country block. Papua New Guinea has a quarter of a D range (aa.bb.cc.dd/26) somewhere. Even Guam has bigger pieces. Heh.
4:12 am on Apr 1, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


If keyplr is catching heat for a mere 114 Class A?

Try this on for size?
RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]
7:40 pm on Apr 1, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


The 114/8 block - there's Australia, NZ, Japan in there as well as IN, PH and TW (all of which I block on a per-site basis). Also BD and a few others. I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

Apart from that there are a lot of /8 ranges assigned to APNIC, lots of which resolve to CN and similar. 114 is merely one of the newest.
10:07 pm on Apr 1, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.


h. None of the above
11:21 pm on Apr 1, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8600
votes: 381


I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

In over a year blocking this China range, I have yet to deny even one legit, human visitor from AU, NZ or JP and I check every single request that is denied on a daily basis.

Just FYI - besides blocking all the hackers, scrapers, clippers, etc. coming China, the main reason I block all China country ranges is political. Vietnam, Thailand, and others using those ranges get blocked as a bonus :)
12:27 am on Apr 2, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13736
votes: 458


RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

Got a friend at ^120. ?

Punch line:
I made up an IP wholly at random to double-check, and landed squarely on
120.128.0.0/14
:: detour to shared htaccess ::

the main reason I block all China country ranges is political

Ay-yup.
1:50 am on Apr 2, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Got a friend at ^120. ?


Perhaps, I've just not had a malicious visitor from that Class A yet.

At one time I had "12[0-6]", however and without digging deeply, I cannot tell you what prompted to change it.
9:31 pm on Apr 2, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14663
votes: 99


I'm thinking the "ia_archiver" hits from China are fakes (like I ever thought otherwise) because all the legit hits appear to come from AWS or archive.org only and have the following UA:

"ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"

The UA that I see coming from archive.org itself is the following:

"ia_archiver(OS-Wayback)"

The bots from China never check robots.txt, only go straight for index.html, and the non-China bots only check robots.txt and seem to go away after.

However, they often ask multiple times a day like some spoiled brat that just won't take "NO!" for an answer:

2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:11:50,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:12:08,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"

Yup, asked 2 times in the same second, twice that day, and multiple times during the day, and that's the genius server from their own IP range.

How often do they think robots.txt gets updated in a day?

if you wanted to have fun, try scraping via AWS using ia_archiver's UA as people that allow it would probably think it's legit!
2:14 am on Apr 15, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14663
votes: 99


This has turned into madness, it doubled up from 50 attempts per day from China to 100 now, it's very insistent on getting something and it's never gonna happen.
9:14 pm on Apr 15, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Yeah. Still coming at me, too. There must be a better way of killing them outright. :(