Forum Moderators: open
They have data about several sites I am involved with... but I can't immediately see the UA of their bot in the stats.
I have seen a bunch of various UAs appearing at various sites I work with, starting somewhere around August 20th - many of which have subsequently been banned as their intent was deemed shady or unknown:
Java/1.6.0_07
Java/1.6.0_05
Java/1.6.0_04
Java/1.5.0_10
Java/1.4.1_04
libwww-perl/5.79
Mozilla/4.0
SimpleHttpClient/1.0
Is the SEOmoz bot using one of those, or is it some other?
[edited by: incrediBILL at 10:30 pm (utc) on Oct. 7, 2008]
[edit reason] added missing last line at OPs request [/edit]
I'm checking accesses from HopOne.net which is where their servers are hosted and don't see a smoking gun yet.
I'm thinking they might be using someone else's data that already has a big index available but that's purely speculation at the moment.
In July and August, almost simultaneous requests from two different IP addresses, and then (in August only) one further request a few minutes later:
2008-July-xx - Almost Simultaneous Requests:
209.40.100.248 - 209.40.100.248 - HopOne Internet Corporation
209.160.24.62 - seomoz.org - HopOne Internet Corporation
2008-August-xx - Almost Simultaneous Requests:
209.103.165.xx - client.covesoft.net - HopOne Internet Corporation
209.160.24.62 - seomoz.org - HopOne Internet Corporation
2008-August-xx - A few minutes after the previous request:
209.40.112.202 - 209.40.112.202 - HopOne Internet Corporation
These logs don't record the UA that was presented.
No crawling found in June or July... and I didn't look any earlier than that.
I didn't find anything from 209.nnn.nnn.nnn in September, and it is too early to be thinking about October.
It is possible that not all of those are SEOmoz; some are unidentified.
These might also be from other tools on their site and nothing to do with the new data.
[edited by: incrediBILL at 11:01 pm (utc) on Oct. 6, 2008]
After a more intensive search I found a regular visiting bot to all the domains average once a month. It never requested the robots.txt! and has no robot agent string! but use IE6 (from 10/8/2007) or IE7 (from 12/13/2007) as agent string.
HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)
Notice IE do only HTTP/1.1 requests only bots want to be compatible with HTTP/1.0
Hosts:
ec2-xx-xx-xx-xx.compute-1.amazonaws.com
IP ranges:
72.44.32.0 - 72.44.63.255 AMAZON-EC2-2
67.202.0.0 - 67.202.63.255 AMAZON-EC2-3
75.101.128.0 - 75.101.255.255 AMAZON-EC2-4
174.129.0.0 - 174.129.255.255 AMAZON-EC2-5
Some IP's used from February 2008 to June 2008:
75.101.251.51 ec2-75-101-251-51.compute-1.amazonaws.com
67.202.44.145 ec2-67-202-44-145.compute-1.amazonaws.com
67.202.50.193 ec2-67-202-50-193.compute-1.amazonaws.com
67.202.55.177 ec2-67-202-55-177.compute-1.amazonaws.com
72.44.41.164 ec2-72-44-41-164.z-2.compute-1.amazonaws.com
All: HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)
And some IP's used from December 2007 to January 2008
72.44.56.224 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
67.202.4.63 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
67.202.27.159 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)
67.202.24.230 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1)
For people who want to block this IP ranges there are also bots who request robots.txt and use a robot agent string and also use the Amazon EC2 service as:
d1g, find mobi, netseer, accelobot, archive, alexa, enabal, etc.
[edited by: Statsfreak at 2:20 am (utc) on Oct. 7, 2008]
[seomoz.org...]
As with any crawler you should check out when bots visit your site. If you don't like what they're doing you should certainly check out the Robots Exclusion Policy. Robots.txt is a great way to limit what (good behaving) robots crawl.
That statement has the implication that their crawler is a visible bot, not stealth, and will honor robots.txt.
Some of my sites cloak tracking codes into content, including titles, urls and anchor text which is all that LinkScape uses.
Those tracking codes indicate the following crawler collected LinkScape's data:
208.115.111.* -> crawl*.dotnetdotcom.org"Mozilla/5.0 (compatible; DotBot/1.1; [dotnetdotcom.org...] crawler@dotnetdotcom.org)"
Now the question becomes:
"How did SEOMoz get my content indicating that DotBot downloaded the pages?"
Considering dotnetdotcom.org has no pages to index other than the home page, their site wasn't crawled for the content.
However, dotnetdotcom has a large 22GB (zipped) torrent file on their home page that you could download and import theoretically to kick start your own index. Since it's a torrent download, it's not possible they just crawled it and indexed it, they would have to download it, unzip it, then import it.
Additionally, I checked about 10 search engines to see if someone had used the content of the DotBot index torrent file and got it indexed somewhere and couldn't find any references whatsoever.
This leads to two possible conclusions:
a) Linkscape uses dotnetdotcom.org's DotBot as their crawler
or
b) SEOMoz downloaded and seeded their index with the initial DotBot index torrent file.
It's also obvious DotBot is getting data from some other source because it knew about pages it was not allowed to access, which aren't directly linked by other sites and only appear in search engines index. The implication here is they have been potentially mining search engines for pages on sites.
That's all I have for now.
If I find out anything more I'll post it later.
That statement has the implication that their crawler is a visible bot, not stealth, and will honor robots.txt
Except that they conspicuously ignore repeated requests to identify the bot's user-agent.
I counted about ten on the linked page, and none were answered.
These people make themselves very easy to dislike.
...
This is the first and only visit seen from that bot on a two year old site:
2008-10-03 -- 208.115.111.248 -- Mozilla/5.0_(compatible;_DotBot/1.1;_http://www.dotnetdotcom.org/,_crawler@dotnetdotcom.org) Was that early enough (4 days ago) to still be able to get into the SEOmoz index (i.e. less than 72 hours before launch)?
[edited by: g1smd at 8:03 pm (utc) on Oct. 7, 2008]
It's very strange to publicise that you're crawling the web and then refuse to say which UA.
SEOMoz downloaded and seeded their index with the initial DotBot index torrent file.
I'm assume you've checked, but did you track any codes within the SEOMOZ index not within the dotbot torrent?
I'm assume you've checked, but did you track any codes within the SEOMOZ index not within the dotbot torrent?
Working on it.
I must sheepishly admit I didn't have enough free disk space to download and decompress a 22GB zip file readily available. OK, had the space to download the file but the decompressed version had no place to go.
However, it's being downloaded and processed elsewhere thanks to a good bot blocking friend. ;)
[edited by: incrediBILL at 7:50 pm (utc) on Oct. 7, 2008]
User-agent: dotbot
Disallow: / Born 2008-06-10
Previous "short" discussion in 2008 July...
DotBot
[webmasterworld.com...]
What does the 22GB decompress to? I couldn't find that information on the dotnetdotcom site.
I do know the decompression tool claimed it would take about 4 hours to complete and we stopped it at 2 hours when it was obvious there was insufficient space.
It's in their "Download our index" section and I have no clue how big the file will be because so far none of us have had enough space to decompress it.
Short of heading out to buy a 1TB drive, I may never know.
<rant>
I must say I prefer WebmasterWorld's approach to moderation: delete when necessary, then private message those involved to let them know what's going on. The whole "delete and act like it never happened" approach is rather unsettling.
</rant>
Excluding Your Data From Linkscape on Specific Webpages
The best way to restrict data from all of Linkscape's data sources is with the Robots <META> tag. Linkscape obeys either "ROBOTS" or "SEOMOZ" in the meta tag's "name" attribute. For example:
Really? SEOmoz get their own metadata attribute? I'm impressed. Maybe this will spread as well as the revisit-after metadata?
They want to maintain their air of mystery, it seems ;)