SEOmoz Crawler/Bot

Forum Moderators: open

Message Too Old, No Replies

SEOmoz Crawler/Bot

g1smd

1:46 pm on Oct 6, 2008 (gmt 0)

SEOmoz has been building an index of several billion URLs over the last few months, for a link analysis tool.

They have data about several sites I am involved with... but I can't immediately see the UA of their bot in the stats.

I have seen a bunch of various UAs appearing at various sites I work with, starting somewhere around August 20th - many of which have subsequently been banned as their intent was deemed shady or unknown:

Java/1.6.0_07
Java/1.6.0_05
Java/1.6.0_04
Java/1.5.0_10
Java/1.4.1_04
libwww-perl/5.79
Mozilla/4.0
SimpleHttpClient/1.0

Is the SEOmoz bot using one of those, or is it some other?

[edited by: incrediBILL at 10:30 pm (utc) on Oct. 7, 2008]
[edit reason] added missing last line at OPs request [/edit]

incrediBILL

6:07 pm on Oct 6, 2008 (gmt 0)

The only thing I've seen is this:

209.40.116.* "SEOmoz-bot"

But I think that's one of their other tools.

[edited by: incrediBILL at 6:31 pm (utc) on Oct. 6, 2008]

g1smd

6:08 pm on Oct 6, 2008 (gmt 0)

Hmm. I'll go look again.

Can I borrow your glasses?

incrediBILL

6:30 pm on Oct 6, 2008 (gmt 0)

It only showed up once back in March, I don't think that's what indexed the web.

I'm checking accesses from HopOne.net which is where their servers are hosted and don't see a smoking gun yet.

I'm thinking they might be using someone else's data that already has a big index available but that's purely speculation at the moment.

g1smd

10:40 pm on Oct 6, 2008 (gmt 0)

I have found them in the logs.

In July and August, almost simultaneous requests from two different IP addresses, and then (in August only) one further request a few minutes later:

2008-July-xx - Almost Simultaneous Requests:
209.40.100.248 - 209.40.100.248 - HopOne Internet Corporation
209.160.24.62 - seomoz.org - HopOne Internet Corporation

2008-August-xx - Almost Simultaneous Requests:
209.103.165.xx - client.covesoft.net - HopOne Internet Corporation
209.160.24.62 - seomoz.org - HopOne Internet Corporation

2008-August-xx - A few minutes after the previous request:
209.40.112.202 - 209.40.112.202 - HopOne Internet Corporation

These logs don't record the UA that was presented.

No crawling found in June or July... and I didn't look any earlier than that.

I didn't find anything from 209.nnn.nnn.nnn in September, and it is too early to be thinking about October.

It is possible that not all of those are SEOmoz; some are unidentified.

These might also be from other tools on their site and nothing to do with the new data.

[edited by: incrediBILL at 11:01 pm (utc) on Oct. 6, 2008]

incrediBILL

11:06 pm on Oct 6, 2008 (gmt 0)

The only thing I see coming from their IP repeatedly is "WWW-Mechanize/1.20" which is "Handy web browsing in a Perl object" so I suspect that's another one of their pre-existing tools.

Also covesoft.net is in a domain park at the moment.

g1smd

11:41 pm on Oct 6, 2008 (gmt 0)

Looking again, the two July requests are likely connected with one of their other tools.

The UA for the very last visit, mentioned above [209.40.112.202], was simply Mac_FireFox (Yeah, that's the full UA) but I have no UA data for any of the other visits.

incrediBILL

1:10 am on Oct 7, 2008 (gmt 0)

Teaser news...

I think I have the answer and it's not what I originally thought but I was on the right track that it appears to have been an existing bot we already know.

Just crossing the Ts and dotting the Is, then checking my Ps and Qs, before I release my findings.

Statsfreak

2:08 am on Oct 7, 2008 (gmt 0)

I checked all my logs of the many domains I own or my customers own but none referrer to their site or name or tool.

After a more intensive search I found a regular visiting bot to all the domains average once a month. It never requested the robots.txt! and has no robot agent string! but use IE6 (from 10/8/2007) or IE7 (from 12/13/2007) as agent string.

HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)

Notice IE do only HTTP/1.1 requests only bots want to be compatible with HTTP/1.0

Hosts:
ec2-xx-xx-xx-xx.compute-1.amazonaws.com

IP ranges:
72.44.32.0 - 72.44.63.255 AMAZON-EC2-2
67.202.0.0 - 67.202.63.255 AMAZON-EC2-3
75.101.128.0 - 75.101.255.255 AMAZON-EC2-4
174.129.0.0 - 174.129.255.255 AMAZON-EC2-5

Some IP's used from February 2008 to June 2008:
75.101.251.51 ec2-75-101-251-51.compute-1.amazonaws.com
67.202.44.145 ec2-67-202-44-145.compute-1.amazonaws.com
67.202.50.193 ec2-67-202-50-193.compute-1.amazonaws.com
67.202.55.177 ec2-67-202-55-177.compute-1.amazonaws.com
72.44.41.164 ec2-72-44-41-164.z-2.compute-1.amazonaws.com
All: HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)

And some IP's used from December 2007 to January 2008
72.44.56.224 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
67.202.4.63 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
67.202.27.159 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)
67.202.24.230 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)

For people who want to block this IP ranges there are also bots who request robots.txt and use a robot agent string and also use the Amazon EC2 service as:

d1g, find mobi, netseer, accelobot, archive, alexa, enabal, etc.

[edited by: Statsfreak at 2:20 am (utc) on Oct. 7, 2008]

incrediBILL

4:29 am on Oct 7, 2008 (gmt 0)

I can tell you amazonaws.com most likely had nothing to do with SEOMoz.

I have data that directly links LinkScape to a single source and I'm waiting on confirmation before I spill the beans.

incrediBILL

6:38 pm on Oct 7, 2008 (gmt 0)

On the SEOMoz feedback thread Nick Gerner, one of the SEOMoz engineers, comments this:

[seomoz.org...]

As with any crawler you should check out when bots visit your site. If you don't like what they're doing you should certainly check out the Robots Exclusion Policy. Robots.txt is a great way to limit what (good behaving) robots crawl.

That statement has the implication that their crawler is a visible bot, not stealth, and will honor robots.txt.

incrediBILL

7:25 pm on Oct 7, 2008 (gmt 0)

Things aren't moving along as fast as I'd like for absolute confirmation so I'm going to go ahead and release my research and you can draw your own conclusions.

Some of my sites cloak tracking codes into content, including titles, urls and anchor text which is all that LinkScape uses.

Those tracking codes indicate the following crawler collected LinkScape's data:

208.115.111.* -> crawl*.dotnetdotcom.org
"Mozilla/5.0 (compatible; DotBot/1.1; [dotnetdotcom.org...] crawler@dotnetdotcom.org)"

Now the question becomes:
"How did SEOMoz get my content indicating that DotBot downloaded the pages?"

Considering dotnetdotcom.org has no pages to index other than the home page, their site wasn't crawled for the content.

However, dotnetdotcom has a large 22GB (zipped) torrent file on their home page that you could download and import theoretically to kick start your own index. Since it's a torrent download, it's not possible they just crawled it and indexed it, they would have to download it, unzip it, then import it.

Additionally, I checked about 10 search engines to see if someone had used the content of the DotBot index torrent file and got it indexed somewhere and couldn't find any references whatsoever.

This leads to two possible conclusions:

a) Linkscape uses dotnetdotcom.org's DotBot as their crawler

b) SEOMoz downloaded and seeded their index with the initial DotBot index torrent file.

It's also obvious DotBot is getting data from some other source because it knew about pages it was not allowed to access, which aren't directly linked by other sites and only appear in search engines index. The implication here is they have been potentially mining search engines for pages on sites.

That's all I have for now.

If I find out anything more I'll post it later.

Samizdata

7:37 pm on Oct 7, 2008 (gmt 0)

That statement has the implication that their crawler is a visible bot, not stealth, and will honor robots.txt

Except that they conspicuously ignore repeated requests to identify the bot's user-agent.

I counted about ten on the linked page, and none were answered.

These people make themselves very easy to dislike.

...

g1smd

7:42 pm on Oct 7, 2008 (gmt 0)

Hmm. The plot thickens.

This is the first and only visit seen from that bot on a two year old site:

2008-10-03 -- 208.115.111.248 -- Mozilla/5.0_(compatible;_DotBot/1.1;_http://www.dotnetdotcom.org/,_crawler@dotnetdotcom.org)

Was that early enough (4 days ago) to still be able to get into the SEOmoz index (i.e. less than 72 hours before launch)?

[edited by: g1smd at 8:03 pm (utc) on Oct. 7, 2008]

Receptional Andy

7:44 pm on Oct 7, 2008 (gmt 0)

Good stuff, Bill :)

It's very strange to publicise that you're crawling the web and then refuse to say which UA.

SEOMoz downloaded and seeded their index with the initial DotBot index torrent file.

I'm assume you've checked, but did you track any codes within the SEOMOZ index not within the dotbot torrent?

incrediBILL

7:49 pm on Oct 7, 2008 (gmt 0)

I'm assume you've checked, but did you track any codes within the SEOMOZ index not within the dotbot torrent?

Working on it.

I must sheepishly admit I didn't have enough free disk space to download and decompress a 22GB zip file readily available. OK, had the space to download the file but the decompressed version had no place to go.

However, it's being downloaded and processed elsewhere thanks to a good bot blocking friend. ;)

[edited by: incrediBILL at 7:50 pm (utc) on Oct. 7, 2008]

pageoneresults

7:50 pm on Oct 7, 2008 (gmt 0)

User-agent: dotbot 
Disallow: /

Born 2008-06-10

Previous "short" discussion in 2008 July...

DotBot
[webmasterworld.com...]

whoisgregg

9:59 pm on Oct 8, 2008 (gmt 0)

What does the 22GB decompress to? I couldn't find that information on the dotnetdotcom site.

g1smd

10:11 pm on Oct 8, 2008 (gmt 0)

I would guess something about 8 to 10 times that amount.. and it is unlikely to be less than 3 or 4 times that amount.

incrediBILL

10:23 pm on Oct 8, 2008 (gmt 0)

What does the 22GB decompress to? I couldn't find that information on the dotnetdotcom site.

I do know the decompression tool claimed it would take about 4 hours to complete and we stopped it at 2 hours when it was obvious there was insufficient space.

It's in their "Download our index" section and I have no clue how big the file will be because so far none of us have had enough space to decompress it.

Short of heading out to buy a 1TB drive, I may never know.

whoisgregg

3:31 pm on Oct 10, 2008 (gmt 0)

I'm supposing they may be using MajesticSEO's data after all... The denial of such has been deleted from the official feedback thread (along with g1smd's and my reply to that denial).

<rant>
I must say I prefer WebmasterWorld's approach to moderation: delete when necessary, then private message those involved to let them know what's going on. The whole "delete and act like it never happened" approach is rather unsettling.
</rant>

sistrix

4:35 pm on Oct 15, 2008 (gmt 0)

I'm quite sure they are using (at least to some extent) dotnetdotcom data

[edited by: incrediBILL at 9:53 pm (utc) on Oct. 15, 2008]
[edit reason] removed link, see TOS #25 [/edit]

incrediBILL

9:49 pm on Oct 15, 2008 (gmt 0)

Welcome sistrix, would you mind giving us a little detail on how you made your determination it was DotBot?

I'm interested in how others independently came to the same conclusion.

sistrix

9:52 pm on Oct 15, 2008 (gmt 0)

incrediBILL, sadly I'm not allowed to post my blog posting with my conclusions here. I'll send it via stick mail.

incrediBILL

9:54 pm on Oct 15, 2008 (gmt 0)

You can post some of the details, just not the links.

Please let us know briefly how you detected DotBot.

sistrix

9:57 pm on Oct 15, 2008 (gmt 0)

"The general idea is that the crawler has visited all of the sites that were cited as link-sources in the Linkscape report, which means it has to be found in the webserver-logfiles. What I did was compile reports for my own domains which have (very) few incoming links, which were all set by myself and to whose logfiles I have access and then compare the logs of the linking sites over the last few month for the same useragents. During all comparisons I was always left with the �Dotbot� by Seattle's dotnetdotcom.org."

pageoneresults

5:56 am on Oct 17, 2008 (gmt 0)

2008-10-16 Updates

Excluding Your Data From Linkscape on Specific Webpages
The best way to restrict data from all of Linkscape's data sources is with the Robots <META> tag. Linkscape obeys either "ROBOTS" or "SEOMOZ" in the meta tag's "name" attribute. For example:

Really? SEOmoz get their own metadata attribute? I'm impressed. Maybe this will spread as well as the revisit-after metadata?

g1smd

8:19 am on Oct 17, 2008 (gmt 0)

Err,

<meta name="seomoz" content="disallow">

is pretty standard stuff.
It exactly mirrors the format that you would use to stop Google indexing a site.

However, that is a per-page disallow. What about robots.txt? Does it cater for

User-agent: seomoz
Disallow: /

in the same way?

Receptional Andy

8:37 am on Oct 17, 2008 (gmt 0)

The implication is that there are too many different UAs to block effectively. They now list a bunch of data sources they use "now or in future" at [seomoz.org...]

They want to maintain their air of mystery, it seems ;)

g1smd

12:51 pm on Oct 17, 2008 (gmt 0)

How would disallowing "seomoz" ensure that the data they get from third party suppliers, doesn't contain any data about a site that uses such tag? Yahoo isn't going to be looking for that tag are they? So their crawler would ignore it, and not even record they have seen it, and there would be no reference to it should you pull data through the Yahoo API -- and the same goes for all of the other data suppliers. I am a bit confused as to how it actually solves the problem.

This 49 message thread spans 2 pages: 49