homepage Welcome to WebmasterWorld Guest from 54.237.78.165
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
ia archiver
random hits from China
dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4435421 posted 9:23 pm on Mar 30, 2012 (gmt 0)

Anyone else seeing random ia_archiver hits from Chinese broadband IPs?

Either ia_archiver had gone distributed or it's being forged in the hopes of scraping or worse.

A quick check on one IP does not show any open ports but it was only one IP.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4435421 posted 2:05 am on Mar 31, 2012 (gmt 0)

fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?" Don't remember their contact address offhand, but it's somewhere at archive dot org and wasn't hard to find.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 2:32 am on Mar 31, 2012 (gmt 0)


Block both the Internet Archive and China :)

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 9:17 am on Mar 31, 2012 (gmt 0)

I've been seeing a ton of ia_archiver from China and it requests and then FAILS to honor robots.txt so far.

Then it endlessly asks for index.html like a spoiled brat when I serve it up a nice steaming 403 forbidden:

2012-03-18,16:58:30,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:01:09,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:02:19,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:08:43,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:37,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:13:39,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:14:07,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:21,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:18:53,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:32,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:19:59,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:20:49,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,17:23:11,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:04:26,114.218.226.235,"ia_archiver","/index.html"
2012-03-18,18:15:54,114.218.226.235,"ia_archiver","/index.html"

Just a small sample of one of their tantrums.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4435421 posted 10:23 am on Mar 31, 2012 (gmt 0)

114.216.0.0/13 is China?! I've never set eyes on anything in the whole range. (Thank you, Spotlight.)

:: scrambling to add yet another line to Yellow Peril* section of htaccess ::

Nothing ever asks for my index.html. They pick something at random from either /fun/ or /ebooks/ ** and gobble gobble gobble until they get bored and move on to some other site.


* Uhm. Sorry.
** If they tried it with /fonts/ or /hovercraft/ they'd find me waiting for them with a bazooka.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 2:04 pm on Mar 31, 2012 (gmt 0)

fwiw, ia_archiver is one of only two robots in my experience who have promptly answered e-mail along the lines of "Is this your robot?"


lucy,
I'd had multiple IP's of ia denied since 1999.
Four-five years ago I decided that I'd like my then-sites in their archive and was unable to determine all the IP's (at least the primary one which would remove the "exclusion note from their server"), and contacted them.

They did respond, however when I explained the IP confusion and asked for confirmation of the IP's they used?
That was the end of communications ;)

Today, I'm content that the 4-5-year-ago change failed, because I would not be pleased with my sites in their archive.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4435421 posted 8:45 pm on Mar 31, 2012 (gmt 0)

Bill - similar pattern here.

I've blocked ia_archiver for pretty-near ever.

Lucy - I have China under various ISPs going down to 114.208 (below that, down to 200, is a KR block) and up to 114.255.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4435421 posted 10:07 pm on Mar 31, 2012 (gmt 0)

:: detour to investigate as expeditiously as possible before Whois starts yattering about carbon-based lifeforms* ::

114.208-211 + 114.212 + 114.213 + 114.214 + 114.215 + 114.116-223 + 114.224-239 + 114.240-255
=
114.208-223 = 114.208.0.0/12
114.224-255 = 114.224.0.0/11

If it were the US I'd mark all those providers separately, but if I'm blocking the whole country it doesn't matter.

Oops, here's another at 116.0.8.0/21 (nagvaaqtara) and 116.0.24.0/21 (nanijara) with Australia tucked in between. But those are too small to bother blocking unless they start actively vexing me.

Now, if you could do Allow/Deny in layers... ("Block this whole A class except this B piece, and then block these C ranges except for this D bit in the middle") Yes, I realize you could spell it all out in mod_rewrite, but Regular Expressions simply weren't designed for a binary system.


* They can't be too serious, though, because I've never been asked to redo a captcha.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 10:58 pm on Mar 31, 2012 (gmt 0)

@Lucy - I don't bother with China servers, colos, etc. I just block the country ranges (which there are many.)

China country range
114.0.0.0 - 114.255.255.255
114.0.0.0/8

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4435421 posted 3:49 am on Apr 1, 2012 (gmt 0)

Rotten luck for those poor Filipinos at 144.108.192-255 :) No idea why I've got that particular range labeled, but it's the only thing in the area that isn't China-colored. No, wait, there are some Thais in the neighborhood too. Can't remember what they were looking for, but I've had bona fide human visitors from Thailand. Well, some of those blocked Chinese visitors were probably human too, but that's just tough on them.

Don't think they were doing outside research for TIA, though.

While tucking away China I stumbled across what must be the world's smallest country block. Papua New Guinea has a quarter of a D range (aa.bb.cc.dd/26) somewhere. Even Guam has bigger pieces. Heh.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 4:12 am on Apr 1, 2012 (gmt 0)

If keyplr is catching heat for a mere 114 Class A?

Try this on for size?
RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4435421 posted 7:40 pm on Apr 1, 2012 (gmt 0)

The 114/8 block - there's Australia, NZ, Japan in there as well as IN, PH and TW (all of which I block on a per-site basis). Also BD and a few others. I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

Apart from that there are a lot of /8 ranges assigned to APNIC, lots of which resolve to CN and similar. 114 is merely one of the newest.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 10:07 pm on Apr 1, 2012 (gmt 0)

I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.


h. None of the above

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 11:21 pm on Apr 1, 2012 (gmt 0)

I would be very wary about blocking it as a /8 unless I were very sure I didn't want (eg) AU, NZ and JP.

In over a year blocking this China range, I have yet to deny even one legit, human visitor from AU, NZ or JP and I check every single request that is denied on a daily basis.

Just FYI - besides blocking all the hackers, scrapers, clippers, etc. coming China, the main reason I block all China country ranges is political. Vietnam, Thailand, and others using those ranges get blocked as a bonus :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4435421 posted 12:27 am on Apr 2, 2012 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

Got a friend at ^120. ?

Punch line:
I made up an IP wholly at random to double-check, and landed squarely on
120.128.0.0/14
:: detour to shared htaccess ::

the main reason I block all China country ranges is political

Ay-yup.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 1:50 am on Apr 2, 2012 (gmt 0)

Got a friend at ^120. ?


Perhaps, I've just not had a malicious visitor from that Class A yet.

At one time I had "12[0-6]", however and without digging deeply, I cannot tell you what prompted to change it.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 9:31 pm on Apr 2, 2012 (gmt 0)

I'm thinking the "ia_archiver" hits from China are fakes (like I ever thought otherwise) because all the legit hits appear to come from AWS or archive.org only and have the following UA:

"ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"

The UA that I see coming from archive.org itself is the following:

"ia_archiver(OS-Wayback)"

The bots from China never check robots.txt, only go straight for index.html, and the non-China bots only check robots.txt and seem to go away after.

However, they often ask multiple times a day like some spoiled brat that just won't take "NO!" for an answer:

2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,07:47:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:11:50,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,09:12:08,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"
2012-01-17,17:19:21,207.241.224.41,"ia_archiver(OS-Wayback)"

Yup, asked 2 times in the same second, twice that day, and multiple times during the day, and that's the genius server from their own IP range.

How often do they think robots.txt gets updated in a day?

if you wanted to have fun, try scraping via AWS using ia_archiver's UA as people that allow it would probably think it's legit!

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4435421 posted 2:14 am on Apr 15, 2012 (gmt 0)

This has turned into madness, it doubled up from 50 attempts per day from China to 100 now, it's very insistent on getting something and it's never gonna happen.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4435421 posted 9:14 pm on Apr 15, 2012 (gmt 0)

Yeah. Still coming at me, too. There must be a better way of killing them outright. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved