homepage Welcome to WebmasterWorld Guest from 54.166.123.2
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
ia archiver
random hits from China
dstiles




msg:4435423
 9:23 pm on Mar 30, 2012 (gmt 0)

Anyone else seeing random ia_archiver hits from Chinese broadband IPs?

Either ia_archiver had gone distributed or it's being forged in the hopes of scraping or worse.

A quick check on one IP does not show any open ports but it was only one IP.

 

lucy24




msg:4435485
 2:05 am on Mar 31, 2012 (gmt 0)

fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?" Don't remember their contact address offhand, but it's somewhere at archive dot org and wasn't hard to find.

keyplyr




msg:4435488
 2:32 am on Mar 31, 2012 (gmt 0)


Block both the Internet Archive and China :)

incrediBILL




msg:4435526
 9:17 am on Mar 31, 2012 (gmt 0)

I've been seeing a ton of ia_archiver from China and it requests and then FAILS to honor robots.txt so far.

Then it endlessly asks for index.html like a spoiled brat when I serve it up a nice steaming 403 forbidden:

2012-03-18,16:58:30,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:01:09,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:02:19,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:08:43,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:37,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:39,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:14:07,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:21,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:53,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:32,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:59,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:20:49,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:23:11,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:04:26,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:15:54,114.218.226.235,"ia_archiver","/index.html"

Just a small sample of one of their tantrums.

lucy24




msg:4435541
 10:23 am on Mar 31, 2012 (gmt 0)

114.216.0.0/13 is China?! I've never set eyes on anything in the whole range. (Thank you, Spotlight.)

:: scrambling to add yet another line to Yellow Peril* section of htaccess ::

Nothing ever asks for my index.html. They pick something at random from either /fun/ or /ebooks/ ** and gobble gobble gobble until they get bored and move on to some other site.


* Uhm. Sorry.
** If they tried it with /fonts/ or /hovercraft/ they'd find me waiting for them with a bazooka.

wilderness




msg:4435572
 2:04 pm on Mar 31, 2012 (gmt 0)

fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?"


lucy,
I'd had multiple IP's of ia denied since 1999.
Four-five years ago I decided that I'd like my then-sites in their archive and was unable to determine all the IP's (at least the primary one which would remove the "exclusion note from their server"), and contacted them.

They did respond, however when I explained the IP confusion and asked for confirmation of the IP's they used?
That was the end of communications ;)

Today, I'm content that the 4-5-year-ago change failed, because I would not be pleased with my sites in their archive.

dstiles




msg:4435678
 8:45 pm on Mar 31, 2012 (gmt 0)

Bill - similar pattern here.

I've blocked ia_archiver for pretty-near ever.

Lucy - I have China under various ISPs going down to 114.208 (below that, down to 200, is a KR block) and up to 114.255.

lucy24




msg:4435712
 10:07 pm on Mar 31, 2012 (gmt 0)

:: detour to investigate as expeditiously as possible before Whois starts yattering about carbon-based lifeforms* ::

114.208-211 + 114.212 + 114.213 + 114.214 + 114.215 + 114.116-223 + 114.224-239 + 114.240-255
=
114.208-223 = 114.208.0.0/12
114.224-255 = 114.224.0.0/11

If it were the US I'd mark all those providers separately, but if I'm blocking the whole country it doesn't matter.

Oops, here's another at 116.0.8.0/21 (nagvaaqtara) and 116.0.24.0/21 (nanijara) with Australia tucked in between. But those are too small to bother blocking unless they start actively vexing me.

Now, if you could do Allow/Deny in layers... ("Block this whole A class except this B piece, and then block these C ranges except for this D bit in the middle") Yes, I realize you could spell it all out in mod_rewrite, but Regular Expressions simply weren't designed for a binary system.


* They can't be too serious, though, because I've never been asked to redo a captcha.

keyplyr




msg:4435727
 10:58 pm on Mar 31, 2012 (gmt 0)

@Lucy - I don't bother with China servers, colos, etc. I just block the country ranges (which there are many.)

China country range
114.0.0.0 - 114.255.255.255
114.0.0.0/8

lucy24




msg:4435770
 3:49 am on Apr 1, 2012 (gmt 0)

Rotten luck for those poor Filipinos at 144.108.192-255 :) No idea why I've got that particular range labeled, but it's the only thing in the area that isn't China-colored. No, wait, there are some Thais in the neighborhood too. Can't remember what they were looking for, but I've had bona fide human visitors from Thailand. Well, some of those blocked Chinese visitors were probably human too, but that's just tough on them.

Don't think they were doing outside research for TIA, though.

While tucking away China I stumbled across what must be the world's smallest country block. Papua New Guinea has a quarter of a D range (aa.bb.cc.dd/26) somewhere. Even Guam has bigger pieces. Heh.

wilderness




msg:4435774
 4:12 am on Apr 1, 2012 (gmt 0)

If keyplr is catching heat for a mere 114 Class A?

Try this on for size?
RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

dstiles




msg:4435925
 7:40 pm on Apr 1, 2012 (gmt 0)

The 114/8 block - there's Australia, NZ, Japan in there as well as IN, PH and TW (all of which I block on a per-site basis). Also BD and a few others. I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

Apart from that there are a lot of /8 ranges assigned to APNIC, lots of which resolve to CN and similar. 114 is merely one of the newest.

wilderness




msg:4435976
 10:07 pm on Apr 1, 2012 (gmt 0)

I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.


h. None of the above

keyplyr




msg:4435992
 11:21 pm on Apr 1, 2012 (gmt 0)

I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

In over a year blocking this China range, I have yet to deny even one legit, human visitor from AU, NZ or JP and I check every single request that is denied on a daily basis.

Just FYI - besides blocking all the hackers, scrapers, clippers, etc. coming China, the main reason I block all China country ranges is political. Vietnam, Thailand, and others using those ranges get blocked as a bonus :)

lucy24




msg:4436007
 12:27 am on Apr 2, 2012 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

Got a friend at ^120. ?

Punch line:
I made up an IP wholly at random to double-check, and landed squarely on
120.128.0.0/14
:: detour to shared htaccess ::

the main reason I block all China country ranges is political

Ay-yup.

wilderness




msg:4436031
 1:50 am on Apr 2, 2012 (gmt 0)

Got a friend at ^120. ?


Perhaps, I've just not had a malicious visitor from that Class A yet.

At one time I had "12[0-6]", however and without digging deeply, I cannot tell you what prompted to change it.

incrediBILL




msg:4436413
 9:31 pm on Apr 2, 2012 (gmt 0)

I'm thinking the "ia_archiver" hits from China are fakes (like I ever thought otherwise) because all the legit hits appear to come from AWS or archive.org only and have the following UA:

"ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"

The UA that I see coming from archive.org itself is the following:

"ia_archiver(OS-Wayback)"

The bots from China never check robots.txt, only go straight for index.html, and the non-China bots only check robots.txt and seem to go away after.

However, they often ask multiple times a day like some spoiled brat that just won't take "NO!" for an answer:

2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:11:50,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:12:08,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"

Yup, asked 2 times in the same second, twice that day, and multiple times during the day, and that's the genius server from their own IP range.

How often do they think robots.txt gets updated in a day?

if you wanted to have fun, try scraping via AWS using ia_archiver's UA as people that allow it would probably think it's legit!

incrediBILL




msg:4440983
 2:14 am on Apr 15, 2012 (gmt 0)

This has turned into madness, it doubled up from 50 attempts per day from China to 100 now, it's very insistent on getting something and it's never gonna happen.

dstiles




msg:4441129
 9:14 pm on Apr 15, 2012 (gmt 0)

Yeah. Still coming at me, too. There must be a better way of killing them outright. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved