Forum Moderators: open
User Agent: Project Kolinka Forum Search (www.kolinka.com)
IP: 67.102.63.82 // h-67-102-63-82.phlapafg.covad.net
There is not much info given on their site aside from the fact that they are a project that is "developing a new way to search community driven web forums and message boards".
According to its domain information it is owned by eCatcher, Inc. / dealcatcher.com which are online coupon/discount sites.
I plan on sending them an email in a few moments and ill update the thread with their response.
This bot is aimed at creating a search engine for JUST forums. A nifty idea indeed but yes...it is the most active bot in the past 9 months or so that has failed to ask for robots.txt...ever.
I'm not sure if he has it automated or still runs manual crawls...I think it was manual from what I remember.
For a little background, we are writing a search engine for all the forums out there. We don't feel like google crawls them deep enough and google can also index extraneous pages. Our search engine will only index the content of your posts.
If there's anything else the crawler is doing that is not playing well with anyone's servers, we do want to know immediately so we can fix it.
I'd also like to ask the opinion of the members here. Many pieces of forum software block robots from accessing the printable version of their forums. This is because of duplicate content penalties and such in search engines like google. Our search engine was designed to crawl the printable pages (because they are easier to parse and use less bandwidth than the post pages), but link back to the regular post pages (so your banner ads, navigation, etc are still seen). We did this to make crawling easier and use less of your bandwidth. We would like opinions on the best way to resolve this, we don't want to ignore the robots.txt ban on visiting printable pages, but we want to be able to include content that other search engines can index. The printable pages were only banned because other bots don't know how to handle forum software, whereas our bot was written with forums in mind.
Many pieces of forum software block robots from accessing the printable version of their forums. This is because of duplicate content penalties and such in search engines like google. Our search engine was designed to crawl the printable pages (because they are easier to parse and use less bandwidth than the post pages), but link back to the regular post pages (so your banner ads, navigation, etc are still seen). We did this to make crawling easier and use less of your bandwidth.We would like opinions on the best way to resolve this, we don't want to ignore the robots.txt ban on visiting printable pages, but we want to be able to include content that other search engines can index. The printable pages were only banned because other bots don't know how to handle forum software, whereas our bot was written with forums in mind.
I can understand what you are saying, unfortunately I don't see a clear way to get around this fundamental problem using your current spidering technique. The reason is the same one that you mentioned before regarding our usage of the robots.txt file to block dupe content which in effect renders your bot incapable of spidering (once it obeys the file).
While i understand it is undesirable for you, the only way I can see that would solve this dilemma and retain a good bot status is if you spidered the actual threads instead. I noticed while watching your spider that it works perfectly fine when it grabs all the thread ids from the forum display, so perhaps with some slight modifications you should be able to fetch the posts almost as well.