SEOmoz Crawler/Bot

Forum Moderators: open

Message Too Old, No Replies

SEOmoz Crawler/Bot

g1smd

1:46 pm on Oct 6, 2008 (gmt 0)

SEOmoz has been building an index of several billion URLs over the last few months, for a link analysis tool.

They have data about several sites I am involved with... but I can't immediately see the UA of their bot in the stats.

I have seen a bunch of various UAs appearing at various sites I work with, starting somewhere around August 20th - many of which have subsequently been banned as their intent was deemed shady or unknown:

Java/1.6.0_07
Java/1.6.0_05
Java/1.6.0_04
Java/1.5.0_10
Java/1.4.1_04
libwww-perl/5.79
Mozilla/4.0
SimpleHttpClient/1.0

Is the SEOmoz bot using one of those, or is it some other?

[edited by: incrediBILL at 10:30 pm (utc) on Oct. 7, 2008]
[edit reason] added missing last line at OPs request [/edit]

incrediBILL

7:08 pm on Oct 17, 2008 (gmt 0)

According to SEOMoz, all of their data sources supply meta tags so theoretically once the "seomoz" tag is on the page and all of those sources have crawled your page again, it will be out of their tool.

dazzlindonna

7:09 pm on Oct 17, 2008 (gmt 0)

Yep, they got the official smackdown today for that sly move they pulled. (search for smackdown blogsblogsblogs to find what i mean). I generally don't like to take sides when there's some kind of industry tension going on, but there's just no denying the this was all a little too "slick". First the lies about having their own crawler, and then the hedging, and finally the pretending to be open by giving info that is mostly useless... There's just no way to take any other position that I can see. Slick salesmanship, sure, but slick is icky.

mvandemar

7:17 pm on Oct 17, 2008 (gmt 0)

However, that is a per-page disallow. What about robots.txt? Does it cater for
User-agent: seomoz
Disallow: /
in the same way?

g1smd, they don't have their own crawler, they were lying about that. Since robots.txt files generally aren't included as part of those other crawlers indexes, moz would never even see a directive like that. They couldn't obey it even if they wanted to.

-Michael

Receptional Andy

7:20 pm on Oct 17, 2008 (gmt 0)

I think the meta element idea is that they will remove the page after the event - i.e. once they get the crawl data from the various sources.

incrediBILL

7:21 pm on Oct 17, 2008 (gmt 0)

Just for giggles I looked at their list to see which items I already block vs. what was allowed to crawl.

DENY * Dotnetdotcom.org
DENY * Grub.org/Wikia
DENY * Page-Store.com
DENY * Amazon/Alexa�s crawl and internet archive resources
DENY * Exalead�s commercially available data
ALLOW * Gagablast�s commercially available data
ALLOW * Yahoo!�s BOSS API and other data sources
ALLOW * Microsoft�s Live API and other data sources
ALLOW * Google�s API and other data sources
ALLOW * Ask.com�s API and other data sources
DENY * Additional crawls from open source, commercial and academic projects

Page-Store caught my attention as I didn't remember seeing it before but the bot blocker had been stopping it reliably since 7/30/07.

The evolution of the Page-Store UAs was interesting:

"Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)"

"Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]"

"Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]"

"Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com"

It's been operating out of compute-1.amazonaws.com which is already blocked as well, for good reasons, plus heritrix would've been blocked as well, batting 100% ;)

[edited by: incrediBILL at 7:25 pm (utc) on Oct. 17, 2008]

jimbeetle

7:41 pm on Oct 17, 2008 (gmt 0)

I think the meta element idea is that they will remove the page after the event - i.e. once they get the crawl data from the various sources.

Yeah, that's the way I read it, the only way it makes sense as their "own crawler" is made up of so many different third party sources.

blend27

6:01 pm on Oct 18, 2008 (gmt 0)

I don't think it realy matters if the commercial bots are blocked on your site if other sites are not blocking them. The whole idea was to get an Idea what is linking there from the content that was already scraped by someone.

I aggree with 100% dazzlindonna on 'Slick salesmanship'.

incrediBILL

11:00 pm on Oct 18, 2008 (gmt 0)

Heck, I'm still just pleased with that my honeypot code was right about DotBot (Dotnetdotcom.org) which is on the top of their list. ;)

BTW, has anyone considered that maybe they don't use all those sources and it's just a big fat smokescreen to the fact that they're much easier to disable than you think?

blend27

11:41 pm on Oct 18, 2008 (gmt 0)

If I had to bet my 2 cents on it I say the bots are comming from Amazonaws Computing Clouds. Bill I have the same order of that list banned and Allowed, and the most recent scrape Activity would point Amazonaws.

incrediBILL

12:45 am on Oct 19, 2008 (gmt 0)

I know some of what they have listed comes from Amazonaws Computing Clouds, see Page-Store comments above. I feed tracking data to almost everything, except the Great Firewall of China which is a real firewall, just to see where all that junk goes, but only DotBot showed up on my radar.

However, I'm still watching them ;)

incrediBILL

2:40 am on Oct 19, 2008 (gmt 0)

I normally wouldn't link to an article on this site but Rand's comment needs to be shared:

[sphinn.com...]

I wish I could comment, but the decisions made internally are firm. The information that's been released so far is the extent of what we're offering right now.

Which is basically a complete slap in the face to webmasters everywhere because we 100% deserve to know everything being done at our expense.

Linkscape itself was built because we wanted more disclosure on what the engines did, and felt that as a community, we deserved to have a resource from which to obtain it.

So which "community" are we talking about here?

I'm thinking it's the competitive spying community because some of the rest of us, the other "community", want nothing to do with it.

Now it's almost like they're taking Google's stance:
"we know what's best for you, shup up and sit down"

Sorry, you don't DESERVE anything that we the webmasters don't want you to have.

I have to stop now before the top of my head blows off and I say some stuff I shouldn't.

mvandemar

3:05 am on Oct 19, 2008 (gmt 0)

See, personally I love the "I wish I could comment" followed by 5 paragraphs. :)

[edited by: incrediBILL at 7:12 pm (utc) on Oct. 21, 2008]
[edit reason] see TOS #4 [/edit]

incrediBILL

6:11 am on Oct 19, 2008 (gmt 0)

[sphinn.com...]

On the accusation that we don't have our own web crawl - patently false. Go use the tool, note how the information is different than any other publicly available source. Linkscape runs on a completely unique index we built from scratch.

OH GIVE ME A BREAK!

Has anyone here seen anything called SEOMOZ crawling your site other than the one hit accesses from his other older tools which already used the UA of SEOMOZ?

So what's the crawler, DotBot?

It's also based out of Seattle, hint hint, run by an ex-microsofter, is that it?

[edited by: incrediBILL at 6:13 am (utc) on Oct. 19, 2008]

mvandemar

6:36 am on Oct 19, 2008 (gmt 0)

It's not even that, Bill... now he's trying to pull off that he thought the word "crawl" was fully interchangeable with the word "index", so he can say, "Oh, but we just meant index all along!"

-Michael

[edited by: incrediBILL at 7:11 pm (utc) on Oct. 21, 2008]
[edit reason] see TOS #4 [/edit]

whoisgregg

11:28 pm on Oct 21, 2008 (gmt 0)

I have to stop now before the top of my head blows off and I say some stuff I shouldn't.

I've started at least half a dozen forum replies (on various sites) and a couple blog posts about this subject and end up closing the window before hitting "submit" for this very reason. :/

Clark

12:56 pm on Oct 29, 2008 (gmt 0)

I wonder if sending SEOMoz an email asking for your domain to be removed from the data they are selling will work? Anyone tried it? Don't know if they broke any laws, but I imagine the way they are doing things on such a vast scale, they are opening the door to lots of backdraft.

[edited by: incrediBILL at 7:20 pm (utc) on Oct. 29, 2008]
[edit reason] See TOS#26 [/edit]

incrediBILL

7:18 pm on Oct 29, 2008 (gmt 0)

I wonder if sending SEOMoz an email asking for your domain to be removed from the data they are selling will work?

They should remove your domains and the data on your domains but the backlinks don't belong to you, those are found on other domains, and you have no right to ask them to restrict information that doesn't belong to you.

Best I can tell a properly whitelisted robots.txt file would've stopped your domain from being indexed there in the first place.

For all those people that continue blacklisting, you never learn...

pageoneresults

7:21 pm on Oct 29, 2008 (gmt 0)

Best I can tell a properly whitelisted robots.txt file would've stopped your domain from being indexed there in the first place.

Gotta love that concept! Am now using it for all new site launches. I'm also going back and retrofitting all sites on our network with the Whitelist robots.txt file.

GaryK

9:20 pm on Oct 29, 2008 (gmt 0)

Best I can tell a properly whitelisted robots.txt file would've stopped your domain from being indexed there in the first place.

Wish I could do this, but I rely on that info to make sure my files are up to date. I do stop all but about five bots after I capture their header data for both Owen (Ocean) and me.

This 49 message thread spans 2 pages: 49