Project Kolinka Forum Search

Forum Moderators: open

Message Too Old, No Replies

Project Kolinka Forum Search

Another bot that ignores robots.txt

Moparx

6:35 pm on Sep 14, 2005 (gmt 0)

Here is another bot that completely ignores robots.txt and nails your site (in this case forums) with requests.
As of 20 minutes ago it has used 31MB of bandwidth today (99% of it from a section i had blocked via robots.txt).

User Agent: Project Kolinka Forum Search (www.kolinka.com)
IP: 67.102.63.82 // h-67-102-63-82.phlapafg.covad.net

There is not much info given on their site aside from the fact that they are a project that is "developing a new way to search community driven web forums and message boards".

According to its domain information it is owned by eCatcher, Inc. / dealcatcher.com which are online coupon/discount sites.

I plan on sending them an email in a few moments and ill update the thread with their response.

Moparx

2:53 am on Sep 16, 2005 (gmt 0)

well, its been well over a day and no response from them so my ban on their bot is going to be permanent.

GaryK

3:05 am on Sep 16, 2005 (gmt 0)

Thanks for the update. I'll be banning them too. I can't stand crawlers that disrespect robots.txt.

JAB Creations

9:45 pm on Sep 16, 2005 (gmt 0)

I've talked to the guy who runs this. I've asked him about the bot and robots.txt and never really got anything out of him on that subject.

This bot is aimed at creating a search engine for JUST forums. A nifty idea indeed but yes...it is the most active bot in the past 9 months or so that has failed to ask for robots.txt...ever.

I'm not sure if he has it automated or still runs manual crawls...I think it was manual from what I remember.

guinsu

10:17 pm on Sep 19, 2005 (gmt 0)

Hi, this is Tim Patton, I'm the one working on Kolinka. First I'd like to apologize for hitting your forums and ignoring robots.txt. There was a glitch on our side and we are going to correct that. Until then we will suspend all crawls. We are in the prealpha phase and I am still writing a lot of code. Like I said we will be correcting it ASAP and will not be crawling again until it is fixed.

For a little background, we are writing a search engine for all the forums out there. We don't feel like google crawls them deep enough and google can also index extraneous pages. Our search engine will only index the content of your posts.

If there's anything else the crawler is doing that is not playing well with anyone's servers, we do want to know immediately so we can fix it.

I'd also like to ask the opinion of the members here. Many pieces of forum software block robots from accessing the printable version of their forums. This is because of duplicate content penalties and such in search engines like google. Our search engine was designed to crawl the printable pages (because they are easier to parse and use less bandwidth than the post pages), but link back to the regular post pages (so your banner ads, navigation, etc are still seen). We did this to make crawling easier and use less of your bandwidth. We would like opinions on the best way to resolve this, we don't want to ignore the robots.txt ban on visiting printable pages, but we want to be able to include content that other search engines can index. The printable pages were only banned because other bots don't know how to handle forum software, whereas our bot was written with forums in mind.

Moparx

1:39 am on Sep 20, 2005 (gmt 0)

Hi Tim. Thanks for taking the time to clear some things up regarding your bot and your project.
I've decided to give you a chance and allow your bot access to my forums.

Many pieces of forum software block robots from accessing the printable version of their forums. This is because of duplicate content penalties and such in search engines like google. Our search engine was designed to crawl the printable pages (because they are easier to parse and use less bandwidth than the post pages), but link back to the regular post pages (so your banner ads, navigation, etc are still seen). We did this to make crawling easier and use less of your bandwidth.
We would like opinions on the best way to resolve this, we don't want to ignore the robots.txt ban on visiting printable pages, but we want to be able to include content that other search engines can index. The printable pages were only banned because other bots don't know how to handle forum software, whereas our bot was written with forums in mind.

I can understand what you are saying, unfortunately I don't see a clear way to get around this fundamental problem using your current spidering technique. The reason is the same one that you mentioned before regarding our usage of the robots.txt file to block dupe content which in effect renders your bot incapable of spidering (once it obeys the file).

While i understand it is undesirable for you, the only way I can see that would solve this dilemma and retain a good bot status is if you spidered the actual threads instead. I noticed while watching your spider that it works perfectly fine when it grabs all the thread ids from the forum display, so perhaps with some slight modifications you should be able to fetch the posts almost as well.

volatilegx

2:35 am on Sep 20, 2005 (gmt 0)

Hi guinsu and welcome to WebmasterWorld. Well, I feel there certainly is a need for a search engine that indexes forums better than the ones that are already available. I don't have a specific answer to your question (I don't write spiders, I just track 'em), but I wish you well in your endeavor!

GaryK

4:18 am on Sep 20, 2005 (gmt 0)

Perhaps one thing that could help would be something like those new AdSense section tags that tells AS what to pay attention to and what to ignore.

I'm already using them in my forums to try and get more relevant ads.