semanticdiscovery/0.1

Forum Moderators: open

Message Too Old, No Replies

semanticdiscovery/0.1

A new information harvester

jdMorgan

5:05 pm on Mar 7, 2003 (gmt 0)

Found this one awhile ago:

69.3.78.160 - - [07/Mar/2003:10:00:55 -0500] "GET /robots.txt HTTP/1.1" 200 2116 "-" "semanticdiscovery/0.1" 
69.3.78.160 - - [07/Mar/2003:10:00:59 -0500] "GET / HTTP/1.1" 200 34118 "-" "semanticdiscovery/0.1"

It did check robots.txt. However, since I've never seen it before, I had no Disallow for it. I have put a Disallow in place in case it comes back. If that doesn't work, there are stronger fortifications beyond the wire.

I'd like to see a Webmaster info page on their site, and a URL in the robot's user-agent string. From what I saw on their web site - found in google, it looks like just another bandwidth drain. :(

If it comes back, I'll post a follow-up on robots.txt compliance.

Jim

carfac

11:53 pm on Mar 7, 2003 (gmt 0)

Jim;

Hey! Just ran a quick GREP- did not see this in my logs, so nothing to offer you...

dave

daroz

8:49 pm on Mar 8, 2003 (gmt 0)

I think I found their Website... (Thank you new Google Index)

If this is against the ToS I apologize but I'll mask the URL to prevent a direct link. (I'm not affiliated with them in any way.. etc etc)

http*//www.semanticdiscovery.com/sd/products.html

wilderness

1:05 pm on Mar 14, 2003 (gmt 0)

Had a visit from these folks this AM.
This explains a lot of unidentified previous traffic from both Utah and and Boulder, CO.

I've denied their entire domain.

Thanks for the link Daroz.

cchooper

4:05 am on Mar 16, 2003 (gmt 0)

wow, a spam spider! O_o potential customers, my ass =] thanks for this info, i'll add it before i even get hit ;)

wilderness

5:05 am on Mar 16, 2003 (gmt 0)

<snip>O_o potential customers</snip>

Suggest you read "Conditions of Service"
[webmasterworld.com...]

sleddogcafe

5:23 pm on Mar 17, 2003 (gmt 0)

This is a real spider, not a spam spider (whatever that is, we are not harvesting email addresses for any reason). Our company is building a new search engine. We do check the robots.txt file, and believe we comply with robot rules.

Our user-agent does supply an email address for you to contact us; if our spiders are bothering you, just put a robots.txt file out there (a proper one!) or email the address we provide.

Nancy

weesnich

5:44 pm on Mar 17, 2003 (gmt 0)

I suggest you add an url to your spiders UA describing:

-who you (or better your company) are
-why you spider
(building a searchengine)
-for what other purposes the collected data will be used
(if there are any/are they sold to third parties etc.)
-that your robot is friendly and obeys robots.txt
(which name to use in robots.txt for him)
-how to contact you

This makes the decision whether to block a robot or not much easier for us. Many (most) respected searchengines have such a page.

jdMorgan

6:16 pm on Mar 17, 2003 (gmt 0)

Nancy,

Welcome to WebmasterWorld [webmasterworld.com]!

Our user-agent does supply an email address for you to contact us

I presume that this e-mail address was added after the date in the log entry I posted, since the posted log entry shows the entire user-agent provided at that time. The inclusion of a URL, as suggested by weesnich, is an even better idea - Otherwise, you will have to handle that e-mail, as opposed to just letting webmasters check a web page for information on your spider.

Here are two examples, one from FAST, which provides e-mail and an informational web page, and the other from Google, which provides a web page.

66.77.73.88 - - [17/Mar/2003:03:03:32 -0500] "GET / HTTP/1.0" 200 36799 "-" "FAST-WebCrawler/3.7/FirstPage (atw-crawler at fast dot no;http://fast.no/support/crawler.asp)"

64.68.82.38 - - [17/Mar/2003:00:05:24 -0500] "GET / HTTP/1.0" 200 37104 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

As Webmasters and Web site operators, we pay for the bandwidth used on our sites. It is not unreasonable to ask that unknown high-bandwidth visitors identify themselves before being allowed to consume our resources.

Please understand that many of our sites are subject to a veritable onslaught of e-mail address harvesters, marketing awareness 'bots, and others which can consume 50% of our bandwidth if left unchecked. Often, the only way to tell the good from the bad is whether the user-agent obeys robots.txt and identifies itself clearly. It is also important that it identifies itself in a manner that is convenient for us to check; I simply do not have time to send e-mails and wait for replies. If I cannot find out in one click who it is that just downloaded half my site, and why, that user-agent or IP gets blocked. I'm sorry, but I can't afford the bandwidth for unknown 'bots - it comes off my bottom line.

I have modified my user-agent check to allow the symanticdiscovery user-agent if it includes either an e-mail address or a link to a spider information page. When I see this information, I will check it so as to make a more informed decision about the use of my server resources.

Thanks,
Jim

sleddogcafe

6:49 pm on Mar 17, 2003 (gmt 0)

Jim, thanks for the pointers -- we will add an informational page for webmasters, like several of you guys have described. It's strange that we have included an email address in our UA since day one, wonder why you didn't see it before.

Without spelling the company name out :-) I wanted to let you know that you enabled symanticd... not semanticd... in your robots file.

Thanks for all the pointers everyone.
Nancy

jdMorgan

7:36 pm on Mar 17, 2003 (gmt 0)

Nancy,

Well, my logs show what they show. These are raw logs, so I suspect something went wrong on your end, and this merits checking.

As to symantic vs. semantic, I guess I've just typed "Symantec/Norton Anti-Virus" too many times... The error is only here, not in my site access filter code.

Best,
Jim

weesnich

9:14 pm on Mar 17, 2003 (gmt 0)

Hi Nancy,

I got the same UA as Jim from the semanticdiscovery-bot a few days later (xxx-ed some numbers for privacy).

69.3.78.160 - - [11/Mar/2003:XX:XX:XX +0100] "GET /robots.txt HTTP/1.1" 200 XXX "-" "semanticdiscovery/0.1"
69.3.78.160 - - [11/Mar/2003:XX:XX:XX +0100] "GET / HTTP/1.1" 200 XXX "-" "semanticdiscovery/0.1"

Regards, Weesnich

sleddogcafe

9:20 pm on Mar 17, 2003 (gmt 0)

Hmmm, well I guess it's good that I changed the new UA to include a URL that includes robot info :). I thought that the UA would correctly append the email address I specified in the robot's configuration, but I guess not. The next time your sites hear from us, our sig will look like this:

semanticdiscovery/0.2(http://www.semanticdiscovery.com/sd/robot.html)

Thanks for all the sanity checking and tips!
Nancy

cchooper

9:04 pm on Mar 18, 2003 (gmt 0)

Well then let me start by apologizing about jumping to conclusions about the spider. Normally, when I read about organizations collecting my (although public) information, and see the word "customer" anywhere around, I get a little nervous.

Now let me also say that I never doubted you guys from the beginning *whistles innocently as he removes the entry from robots.txt*

good luck crawling ^_^

wilderness

10:38 pm on Mar 18, 2003 (gmt 0)

<snip>he removes the entry from</snip>

A bit hasty for me.
From what I read on their product page, semantic still in the business to gather data from your/my websites to provide data for its customers?

BTW I happen to be cleaing up some marked older searches yesterday and stumbled across the old Oingo (sp?) SE which semantics something or other had taken over and was developing.
I don't recall every using the particular SE or when I had it marked from. Shame I didn'y make a note of the name before I deleted it :(

jdMorgan

11:30 pm on Mar 18, 2003 (gmt 0)

Oingo -> AppliedSemantics dot com

Jim

fiestagirl

8:48 pm on Mar 25, 2003 (gmt 0)

sleddogcafe,

Does this mean that you are planning on giving up using the rico/0.1 user agent also?

sleddogcafe

3:47 pm on Mar 26, 2003 (gmt 0)

yes, rico was our first version, and has been replaced
with semanticdiscovery (first 0.1 then 0.2 which has the
pointer to a url with robot info y'all were interested
in seeing)

Nancy