Forum Moderators: phranque
"The Alexa crawler (robot), which identifies itself as ia_archiver in the HTTP "User-agent" header field..."
Okay, fine. That's old news. So is the fact Alexa's been a real pain to deal with, results-wise. Surprisingly inept, incorrect, and borderline legally actionable. (See: "Alexa: Now showing other sites owned [webmasterworld.com]")
Alas, here's the new bad news:
---
2.) Alexa's not sending ANY agent info at all. None.
Yes, its IPs are asking for robots.txt, but since I block blank UAs, 'it' keeps hitting me, always and only via IP, asking for and getting redirected away from robots.txt and umpteen other good files, AND trying to follow umpteen BAD links, almost every single day.
Last month, Alexa and I were A-OK. From one site:
209.237.238.224 - - [06/Mar/2006:10:48:18 -0800] "GET /robots.txt HTTP/1.0" 200 7134 "-" "ia_archiver"
209.237.238.224 - - [08/Mar/2006:02:30:57 -0800] "GET /robots.txt HTTP/1.0" 200 7134 "-" "ia_archiver"
209.237.238.224 - - [14/Mar/2006:17:53:20 -0800] "GET /robots.txt HTTP/1.0" 200 7166 "-" "ia_archiver"
209.237.238.224 - - [22/Mar/2006:20:00:22 -0800] "GET /robots.txt HTTP/1.0" 200 7643 "-" "ia_archiver"
209.237.238.224 - - [24/Mar/2006:10:12:18 -0800] "GET /robots.txt HTTP/1.0" 200 7643 "-" "ia_archiver"
209.237.238.224 - - [28/Mar/2006:17:19:28 -0800] "GET /robots.txt HTTP/1.0" 200 7641 "-" "ia_archiver"
209.237.238.224 - - [29/Mar/2006:19:16:02 -0800] "GET /robots.txt HTTP/1.0" 200 7641 "-" "ia_archiver"
This month, after April 2, we're not. From two sites:
209.237.238.224 - - [18/Apr/2006:21:52:52 -0700] "GET /robots.txt HTTP/1.0" 302 211 "-" ""
209.237.238.224 - - [18/Apr/2006:21:52:52 -0700] "GET /dir1/file1.html HTTP/1.0" 302 211 "-" ""
209.237.238.224 - - [18/Apr/2006:21:53:25 -0700] "GET /dir2/file2.html HTTP/1.0" 302 211 "-" ""
209.237.238.177 - - [19/Apr/2006:20:11:51 -0700] "GET /robots.txt HTTP/1.0" 302 211 "-" ""
209.237.238.177 - - [19/Apr/2006:20:11:51 -0700] "GET /dir3/file3.html HTTP/1.0" 302 211 "-" ""
209.237.238.229 - - [20/Apr/2006:23:08:44 -0700] "GET /robots.txt HTTP/1.0" 302 197 "-" ""
209.237.238.229 - - [20/Apr/2006:23:08:44 -0700] "GET / HTTP/1.0" 302 197 "-" ""
209.237.238.229 - - [20/Apr/2006:23:08:50 -0700] "GET /badlink.php?x=BAD1 HTTP/1.0" 302 206 "-" ""
209.237.238.229 - - [20/Apr/2006:23:08:59 -0700] "GET /file.html HTTP/1.0" 302 197 "-" ""
209.237.238.229 - - [20/Apr/2006:23:09:05 -0700] "GET /BAD HTTP/1.0" 302 197 "-" ""
209.237.238.229 - - [20/Apr/2006:23:09:43 -0700] "GET /badlink.php?x=BAD2 HTTP/1.0" 302 205 "-" ""
209.237.238.229 - - [20/Apr/2006:23:09:44 -0700] "GET /badlink.php?x=BAD3 HTTP/1.0" 302 206 "-" ""
209.237.238.229 - - [20/Apr/2006:23:09:50 -0700] "GET /badlink.php?x=BAD4 HTTP/1.0" 302 206 "-" ""
209.237.238.229 - - [20/Apr/2006:23:10:02 -0700] "GET /badlink.php?x=BAD5 HTTP/1.0" 302 207 "-" ""
Even if I sent 403s instead of custom error-related 302s, Alexa still isn't following its own rules. It's not identifying itself at all. Even alexa.com instead of an IP would be better than nothing.
Plus Alexa now says results are "Powered by Google" so your guess is as good as mine why the almost covert spidering.
---
3.) Bottom Line:
After exposing a client's private WHOIS info (and then getting the correction wrong), plus outrageously and offensively showing a porn site instead of my personal, definitely un-porn one, and now playing games with UAs on a client's two distinct sites -- it'll be a cold day before I trust them again.
/rant
Still no UA (boohiss) but a switch to a host name in recent days:
vm01-staging.alexa.com - - [23/Apr/2006:01:43:01 -0700] "GET /robots.txt HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:01:43:05 -0700] "GET / HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:55:42 -0700] "GET /dir/file.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:55:57 -0700] "GET / HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:56:24 -0700] "GET /dir/file1.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:56:51 -0700] "GET /dir/file2.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:57:22 -0700] "GET /dir/file3.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:07:58:38 -0700] "GET /dir/file4.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:08:00:15 -0700] "GET /dir/file5.html HTTP/1.0" 302 211 "-" ""
vm01-staging.alexa.com - - [23/Apr/2006:08:00:35 -0700] "GET /dir/file6.html HTTP/1.0" 302 211 "-" ""
(LOL. Looks like I'm talking to myself in this thread. Oh, well:)
I orinally banned Alexa because they kept on trying to follow Javascript links but badly so I just saw a load of 404s. Sounds to me like they have a couple of script kiddies in as programmers. I wonder how they got the thumbnail of my site since I ban them in my robots.txt. Maybe a DMCA notice would work.