Forum Moderators: open

Message Too Old, No Replies

"semanticdiscovery" revisited...

sleddogcafe was (sort of) right all along! :)

         

balam

3:27 pm on Jul 8, 2003 (gmt 0)

10+ Year Member



In early March a new bot hit the scene - "semanticdiscovery". The original thread about this is here [webmasterworld.com].

It dropped by last night, the first time I've really noticed it. But, looking back specifically in my early March logs, I see it did drop by for a couple of pages. I missed that - and the original thread - because I was on an extended vacation at the time, and there was just such a flood of *carp* when I got back that I couldn't catch up on everything...

Anyway, if you read the thread you'll see an important issue brought up:

sleddogcafe, who seems to represent Semantic Discovery, said at one point;

It's strange that we have included an email address in our UA since day one, wonder why you didn't see it before.

jdMorgan, gentleman representative of our community responded;

Well, my logs show what they show. These are raw logs, so I suspect something went wrong on your end[...]

As it turns out, it can be argued both are right... (to some extent. I'm just presenting facts, not looking for trouble. :)

An email address has been supplied all along, only it's in 'HTTP_FROM'!

:)

wilderness

11:32 pm on Jul 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



balam
I'm still hung on their third party references to "customers."
http ://www.semanticdiscovery.com/sd/

And yet they don't outline a plan for sharing the profit with the webmasters they are mining from?

Don

jdMorgan

1:05 am on Jul 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



balam,

Well, I've been called a lot of things in my time, but...

I'm glad you got to the bottom of this, but the contents of the FROM header are invisible to the great number of webmasters whose server access logs are in Common Log Format or NCSA extended/combined log format.

Therefore, harumph, ahem...

An open letter to all Web spider authors and users:


Dear spider authors/users,

Please include a spider information page URL and an e-mail contact address in your User-agent string. Here are some outstanding examples:

64.68.82.28 - - [08/Jul/2003:07:29:38 -0400] "GET /robots.txt HTTP/1.0" 200 2646 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
66.196.65.39 - - [08/Jul/2003:10:54:39 -0400] "GET /robots.txt HTTP/1.0" 200 2646 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.77.73.88 - - [08/Jul/2003:11:26:07 -0400] "GET /robots.txt HTTP/1.0" 200 2646 "-" "FAST-WebCrawler/3.7/FirstPage (atw-crawler at fast dot no;http://fast.no/support/crawler.asp)"

The spider information page should clearly identify the using organization, specify the robot's purpose, and specify the exact User-agent string which webmasters should enter in their robots.txt files in order to control indexing by your spider. As a courtesy, it would be nice if you also provide a link to the original Standard for Robots Exclusion and the newer RFC version.

Thank you,
jdMorgan


(It's simple, it's easy, and my real pages are much more tasty as spider-food than those nasty old 403 responses.)

Jim

balam

1:16 pm on Jul 10, 2003 (gmt 0)

10+ Year Member



wilderness, all I had to do was read the thread I earlier referenced to decide to add SD to my robots.txt. I expect the ban will be respected...

jdMorgan, would "polite" be a more tasteful word? ;)

I'm glad you got to the bottom of this, but the contents of the FROM header are invisible to the great number of webmasters whose server access logs are in Common Log Format or NCSA extended/combined log format.

Indeed - My logs are in combined format and I only caught the email address due to some custom logging I do myself. And it's with that thought in mind that I add my signature to your open letter.