Forum Moderators: open
They have data about several sites I am involved with... but I can't immediately see the UA of their bot in the stats.
I have seen a bunch of various UAs appearing at various sites I work with, starting somewhere around August 20th - many of which have subsequently been banned as their intent was deemed shady or unknown:
Java/1.6.0_07
Java/1.6.0_05
Java/1.6.0_04
Java/1.5.0_10
Java/1.4.1_04
libwww-perl/5.79
Mozilla/4.0
SimpleHttpClient/1.0
Is the SEOmoz bot using one of those, or is it some other?
[edited by: incrediBILL at 10:30 pm (utc) on Oct. 7, 2008]
[edit reason] added missing last line at OPs request [/edit]
However, that is a per-page disallow. What about robots.txt? Does it cater for
User-agent: seomoz
Disallow: /
in the same way?
g1smd, they don't have their own crawler, they were lying about that. Since robots.txt files generally aren't included as part of those other crawlers indexes, moz would never even see a directive like that. They couldn't obey it even if they wanted to.
-Michael
DENY * Dotnetdotcom.org
DENY * Grub.org/Wikia
DENY * Page-Store.com
DENY * Amazon/Alexa’s crawl and internet archive resources
DENY * Exalead’s commercially available data
ALLOW * Gagablast’s commercially available data
ALLOW * Yahoo!’s BOSS API and other data sources
ALLOW * Microsoft’s Live API and other data sources
ALLOW * Google’s API and other data sources
ALLOW * Ask.com’s API and other data sources
DENY * Additional crawls from open source, commercial and academic projects
Page-Store caught my attention as I didn't remember seeing it before but the bot blocker had been stopping it reliably since 7/30/07.
The evolution of the Page-Store UAs was interesting:
"Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)"
"Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]"
"Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]"
"Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com"
It's been operating out of compute-1.amazonaws.com which is already blocked as well, for good reasons, plus heritrix would've been blocked as well, batting 100% ;)
[edited by: incrediBILL at 7:25 pm (utc) on Oct. 17, 2008]
BTW, has anyone considered that maybe they don't use all those sources and it's just a big fat smokescreen to the fact that they're much easier to disable than you think?
However, I'm still watching them ;)
[sphinn.com...]
I wish I could comment, but the decisions made internally are firm. The information that's been released so far is the extent of what we're offering right now.
Which is basically a complete slap in the face to webmasters everywhere because we 100% deserve to know everything being done at our expense.
Linkscape itself was built because we wanted more disclosure on what the engines did, and felt that as a community, we deserved to have a resource from which to obtain it.
So which "community" are we talking about here?
I'm thinking it's the competitive spying community because some of the rest of us, the other "community", want nothing to do with it.
Now it's almost like they're taking Google's stance:
"we know what's best for you, shup up and sit down"
Sorry, you don't DESERVE anything that we the webmasters don't want you to have.
I have to stop now before the top of my head blows off and I say some stuff I shouldn't.
On the accusation that we don't have our own web crawl - patently false. Go use the tool, note how the information is different than any other publicly available source. Linkscape runs on a completely unique index we built from scratch.
OH GIVE ME A BREAK!
Has anyone here seen anything called SEOMOZ crawling your site other than the one hit accesses from his other older tools which already used the UA of SEOMOZ?
So what's the crawler, DotBot?
It's also based out of Seattle, hint hint, run by an ex-microsofter, is that it?
[edited by: incrediBILL at 6:13 am (utc) on Oct. 19, 2008]
[edited by: incrediBILL at 7:20 pm (utc) on Oct. 29, 2008]
[edit reason] See TOS#26 [/edit]
I wonder if sending SEOMoz an email asking for your domain to be removed from the data they are selling will work?
They should remove your domains and the data on your domains but the backlinks don't belong to you, those are found on other domains, and you have no right to ask them to restrict information that doesn't belong to you.
Best I can tell a properly whitelisted robots.txt file would've stopped your domain from being indexed there in the first place.
For all those people that continue blacklisting, you never learn...
Best I can tell a properly whitelisted robots.txt file would've stopped your domain from being indexed there in the first place.