Forum Moderators: bakedjake

Message Too Old, No Replies

Alexa's user-agent?

They follow dynamic links...

         

mivox

7:36 pm on Apr 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got duplicates of most of my site linked through dynamic links from our web store (long story...), and I don't want pesky dynamic-link-savvy robots thinking I'm trying to spam them with duplicate pages.

I've got Google blocked from the directory in question (but not before their last crawl... yipe!), but Alexa's browser/OS comes up as 'undefined' in my stat program. Anyone know what to use for Alexa's U/A for robots.txt? Would just plain "User-agent: Alexa" work?

jeremy goodrich

7:46 pm on Apr 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think I'm doing it by IP, which wouldn't work for your purposes. Anybody else? If you want the ip's, I'll sticky them to you, but it will take some digging.

mivox

7:51 pm on Apr 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, if nobody can come up with a U/A for Alexa, I'll just ban everyone from the directory in question... no reason to have it available to any of them pesky crawlers.

theperlyking

9:56 pm on Apr 16, 2001 (gmt 0)

10+ Year Member



These may be of some help, taken from the logs of one of my sites:

host¦IP¦UA

arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc20-public.alexa.com¦209.247.40.170¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc22-public.alexa.com¦209.247.40.172¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc24-public.alexa.com¦209.247.40.174¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc18-public.alexa.com¦209.247.40.168¦ia_archiver
arc20-public.alexa.com¦209.247.40.170¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver
arc21-public.alexa.com¦209.247.40.171¦ia_archiver

mivox

10:02 pm on Apr 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks much! I'll try using ia_archiver then...

Son_House

8:44 am on Apr 19, 2001 (gmt 0)

10+ Year Member



I added ia_archiver to our robots.txt a month ago but they still come to our site and take what ever they want. The reason? They still have not even requested robots.txt Maybe they only do it once a year.

Mivox, I suggest doing by ip like Jeremy Goodrich is doing.

mivox

7:27 pm on Apr 27, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, I added them to robots txt... and they came back last night...

So Jeremy, if you're willing to send some IPs my way, I'd like to ban them via htaccess this time. Or would it work to ban them via "crawl3-public.alexa.com", etc., instead of actual IP number?

jeremy goodrich

8:03 pm on Apr 27, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



theperlyking gave a good start on it. It would take me a while to look through my stuff to find the ips (i'm not too organized :) ).

Use those IP's posted, and I'll try to add more to that when I find them.

msgraph

8:20 pm on Apr 27, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>The question is, who is Alexa really working for? I think it is neferious black op's like people (seriously).

I've been doing some research and although I have no true solid evidence I have found clues.

First clue:

Notice Alexa listed

The Internet Archive [archive.org]

Second clue

....take a look at the fantastic paper NFFC posted a while back in one of the forums

[research.compaq.com...]

Their spidering activity fits this quote:

""The Internet Archive also uses multiple machines to crawl the web [6,14]. Each crawler process is assigned up to 64 sites to crawl, and no site is assigned to more than one crawler. Each single-threaded crawler process reads a list of seed URLs for its assigned sites from disk into per-site queues, and then uses asynchronous I/O to fetch pages from these queues in parallel. Once a page is downloaded, the crawler extracts the links contained in it. If a link refers to the site of the page it was contained in, it is added to the appropriate site queue; otherwise it is logged to disk. Periodically, a batch process merges these logged ``cross-site'' URLs into the site-specific seed sets, filtering out duplicates in the process. ""

Third clue:

ia_archiver = InternetArchive_Archiver?

What do you think?

mivox

8:41 pm on Apr 27, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would use the IPs given above, but I'm not getting hit by the "arcXX-public" spiders, I'm getting hit by the "crawlXX-public" spiders. I'm trying to ban them by machine name for now...

theperlyking

10:25 pm on Apr 27, 2001 (gmt 0)

10+ Year Member



Ok, pulled these from another site - they match the "crawlXX-public" type you mention.

¦crawl1-public.alexa.com¦209.247.40.104
¦crawl2-public.alexa.com¦209.247.40.105
¦crawl3-public.alexa.com¦209.247.40.106
¦crawl4-public.alexa.com¦209.247.40.107
¦crawl5-public.alexa.com¦209.247.40.108
¦crawl6-public.alexa.com¦209.247.40.109
¦crawl7-public.alexa.com¦209.247.40.98
¦crawl8-public.alexa.com¦209.247.40.99

There doesnt appear to be a crawl9 or above.

mivox

10:34 pm on Apr 27, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks perly! That's great... Once my logs resolve the IP to a machine name, I don't seem to have a way to get the IPs back.

Gorufu

11:45 am on Apr 29, 2001 (gmt 0)

10+ Year Member



Hi Mivox,

If your logs resolve IP's to machine names you can deny by domain name instead of IP's or machine names in .htaccess

deny from .alexa.com

should block anything.alexa.com