Forum Moderators: open
69.3.78.160 - - [07/Mar/2003:10:00:55 -0500] "GET /robots.txt HTTP/1.1" 200 2116 "-" "semanticdiscovery/0.1"
69.3.78.160 - - [07/Mar/2003:10:00:59 -0500] "GET / HTTP/1.1" 200 34118 "-" "semanticdiscovery/0.1"
It did check robots.txt. However, since I've never seen it before, I had no Disallow for it. I have put a Disallow in place in case it comes back. If that doesn't work, there are stronger fortifications beyond the wire.
I'd like to see a Webmaster info page on their site, and a URL in the robot's user-agent string. From what I saw on their web site - found in google, it looks like just another bandwidth drain. :(
If it comes back, I'll post a follow-up on robots.txt compliance.
Jim
Suggest you read "Conditions of Service"
[webmasterworld.com...]
Our user-agent does supply an email address for you to contact us; if our spiders are bothering you, just put a robots.txt file out there (a proper one!) or email the address we provide.
Nancy
-who you (or better your company) are
-why you spider
(building a searchengine)
-for what other purposes the collected data will be used
(if there are any/are they sold to third parties etc.)
-that your robot is friendly and obeys robots.txt
(which name to use in robots.txt for him)
-how to contact you
This makes the decision whether to block a robot or not much easier for us. Many (most) respected searchengines have such a page.
Welcome to WebmasterWorld [webmasterworld.com]!
Our user-agent does supply an email address for you to contact us
I presume that this e-mail address was added after the date in the log entry I posted, since the posted log entry shows the entire user-agent provided at that time. The inclusion of a URL, as suggested by weesnich, is an even better idea - Otherwise, you will have to handle that e-mail, as opposed to just letting webmasters check a web page for information on your spider.
Here are two examples, one from FAST, which provides e-mail and an informational web page, and the other from Google, which provides a web page.
66.77.73.88 - - [17/Mar/2003:03:03:32 -0500] "GET / HTTP/1.0" 200 36799 "-" "FAST-WebCrawler/3.7/FirstPage (atw-crawler at fast dot no;http://fast.no/support/crawler.asp)"
64.68.82.38 - - [17/Mar/2003:00:05:24 -0500] "GET / HTTP/1.0" 200 37104 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
As Webmasters and Web site operators, we pay for the bandwidth used on our sites. It is not unreasonable to ask that unknown high-bandwidth visitors identify themselves before being allowed to consume our resources.
Please understand that many of our sites are subject to a veritable onslaught of e-mail address harvesters, marketing awareness 'bots, and others which can consume 50% of our bandwidth if left unchecked. Often, the only way to tell the good from the bad is whether the user-agent obeys robots.txt and identifies itself clearly. It is also important that it identifies itself in a manner that is convenient for us to check; I simply do not have time to send e-mails and wait for replies. If I cannot find out in one click who it is that just downloaded half my site, and why, that user-agent or IP gets blocked. I'm sorry, but I can't afford the bandwidth for unknown 'bots - it comes off my bottom line.
I have modified my user-agent check to allow the symanticdiscovery user-agent if it includes either an e-mail address or a link to a spider information page. When I see this information, I will check it so as to make a more informed decision about the use of my server resources.
Thanks,
Jim
Without spelling the company name out :-) I wanted to let you know that you enabled symanticd... not semanticd... in your robots file.
Thanks for all the pointers everyone.
Nancy
Well, my logs show what they show. These are raw logs, so I suspect something went wrong on your end, and this merits checking.
As to symantic vs. semantic, I guess I've just typed "Symantec/Norton Anti-Virus" too many times... The error is only here, not in my site access filter code.
Best,
Jim
I got the same UA as Jim from the semanticdiscovery-bot a few days later (xxx-ed some numbers for privacy).
69.3.78.160 - - [11/Mar/2003:XX:XX:XX +0100] "GET /robots.txt HTTP/1.1" 200 XXX "-" "semanticdiscovery/0.1"
69.3.78.160 - - [11/Mar/2003:XX:XX:XX +0100] "GET / HTTP/1.1" 200 XXX "-" "semanticdiscovery/0.1"
Regards, Weesnich
semanticdiscovery/0.2(http://www.semanticdiscovery.com/sd/robot.html)
Thanks for all the sanity checking and tips!
Nancy
Now let me also say that I never doubted you guys from the beginning *whistles innocently as he removes the entry from robots.txt*
good luck crawling ^_^
A bit hasty for me.
From what I read on their product page, semantic still in the business to gather data from your/my websites to provide data for its customers?
BTW I happen to be cleaing up some marked older searches yesterday and stumbled across the old Oingo (sp?) SE which semantics something or other had taken over and was developing.
I don't recall every using the particular SE or when I had it marked from. Shame I didn'y make a note of the name before I deleted it :(