Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Who is Kosmix?

New crawler sighted in server logs.


grandma genie

4:12 pm on Aug 16, 2010 (gmt 0)

5+ Year Member

I found this new (to me anyway) crawler in my server logs. They just checked out robots.txt. User agent was voyager/2.0. IP is from www.kosmix.com/crawler.html. IP is Does anyone have any comments about this crawler? Should they be allowed. I have a terrible time with bots coming to my image rich site and not only taking the images, but also the content. I find my content in the strangest places.
Grandma-genie (Jeannie)


9:31 pm on Aug 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

I have the whole IP range starting with 38. banned for many years, nothing good ever comes from there.
Voyager is banned as well and for so long that I can't remember specifically anymore what it did wrong at the time ;o)


11:47 pm on Aug 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Visited my site as well
Fetched robots.txt and the root page.
The IP I have is
rDNS crawl0.kosmix.com

Page in the UA describes the bot and claims it obeys robots.txt.
Main page of the site says "The best of social media - filtered and organized by topic"

Seems well behaved so far.


4:24 am on Aug 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

@grandma genie: Use the "search" link atop every page to look for Kosmix in this forum's posts and you'll see it's been around/implicated a while, and that it typically hails from because that's its home base.

@Staffa: Ditto. And ditto:)

@FWIW: Their "The web organized for you" site reminds me of DMOZ. (ugh)


11:23 pm on Aug 17, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member

I've had the kosmix bot white-listed for years. Seems well behaved.

grandma genie

11:56 pm on Aug 17, 2010 (gmt 0)

5+ Year Member

Hmmmm... To block or not to block, that is the question... I'll keep an eye on it and if it does anything suspicious, out it goes. The websense bot was on my site, too. I don't think I'd like a censorship bot telling adults what to do. Maybe children if their parents say so, but not corporate spies. However, I would love a gizmo that bans whole countries, like China, Russia, Romania, Iran. You just have a list of countries and you click on the one you don't want, and away it goes. After all, I never visit their sites, so why are they always visiting mine? At the moment trying to ban all those IPs is too overwhelming.


1:47 am on Aug 18, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

websense: not just trying to parse every pos.. page and gets into bot trap, but also changes User Agent to random strings that most of the time don't make sence.

so I have a routine that checks:

if: IP contains 208.80.194.' or '208.80.195.' or '208.80.193.' or '208.80.192.'
then: serve status code 200, DISPLAY "Hello World", abort.

There might be more ranges, but that does it for me.


6:23 am on Aug 18, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

@ grandma genie have a look at maxmind.com and their geoip to country database it's what you would love :o)


9:24 pm on Aug 18, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member

Websense IPs that I know of - ie they have plagued me: - (UK) - (UK) - (china) - (USA) - (USA)

All blocked.

grandma genie

11:35 pm on Aug 18, 2010 (gmt 0)

5+ Year Member

maxmind.com sounds helpful, but I'd have to have my host install it and at this point I don't trust my host to do anything right. He charges too much, too. So, I guess I will just block IPs. Thanks dstiles for the IPs you have blocked. It looks like Websense is all over the place. Lately my server logs are chock full of bots and visitors from everyplace but America. Lots come from Google Images and Yahoo Images - most foreign visitors, whom I do not sell to. I have just banned Googlebot-Image and Yahoo-MMCrawler in my htaccess file. Everyone seems to be making a fine living off my pictures. I sell stuffed animals and take my own pictures of them. All the image sites (Google images, Yahoo images, MSN, etc.) take them off my site freely. I find them everywhere. One visitor was from a nasty site (referrer in server logs) and ended up on my site looking for only God knows what (naked stuffed animals?). Very irritating. Maybe the course to take is to block everyone and only allow American visitors. What do you think of that?


1:22 am on Aug 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

You can't just allow American visitors because Internet addressing isn't country-consecutive. But you can block in other ways...

When necessary, I weed out loads and loads and LOADS of 'unwanted' visitors by rewriting piped top-level domains (TLDs). For example, from a U.S. medical site that has zero to offer anyone beyond its zip code (and re which Google's geo-specific setting does zilch):

ewriteCond %{REMOTE_HOST} \.(adsl|int|de|pl|hr|ro|ru|su|hk|ar|arpa|bg|br|bf|cz|tw|cn|fr|be|ua|bw)$ [OR]
RewriteCond %{REMOTE_HOST} \.(mx|hu|he|nl|se|za|sk|sa|ba|cl|ch|ba|co|cu|gh|id|ir|uk|gh|yu|bd|kz)$ [OR]
RewriteCond %{REMOTE_HOST} \.(lt|in|yt|cm|ma|ni|au|it|sg|my|pk|pt|ma|mobi|mu|np|si|tr|vn|tv|py|il|gy)$ [OR]
RewriteCond %{REMOTE_HOST} \.(lu|hm|hn|be|th|uk|cc|es|nu|jo|ca|dk|eu|md|ne|ug|is|md|ne|np|ve|ie|zm|zw|tt)$ [OR]

{Note: That example will not work as-is.]

Waaay back when, I kept blocked TLDs in alphabetical order; now I just add on whenever I see another new/odd TLD in the log. If you don't use mod_rewrite, you could adapt to "SetEnvIfNoCase Remote_Host". Or use a long list of "deny from" lines.

Additionally, vis-a-vis IPs:

On other sites, I block my most troublesome countries' IPs using the "Country IP Blocks" site's FREE lists. Don't hold your breath for super-frequent updates but no matter. What they have is grrrrreat. I copy-paste the following format into .htaccess --


-- you might prefer the others:


Aside/FWIW: Speaking of unwanted visitors... Another of my sites has a page with "bomb" in the title (about a movie). It's more than a little disconcerting how many people from all over the place search for: "how to make a bomb". (shudders)


9:59 pm on Aug 19, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member

Grandma genie - you seem to have the opposite of my own experience. Over 40% of bad bots are from the US here in the UK - 5120 out of 12643 (40.4%) for this month at time of writing and that's not including Amazon, which is now banned in the IIS firewall but last month accounted for 500 hits on its own.

Last month websense accounted for 2123 hits. So far this month there have been 1217 so even worse than Amazon. :(

grandma genie

11:33 pm on Aug 20, 2010 (gmt 0)

5+ Year Member

Is there an easy way to differentiate between a real human visitor and a bot or a scraper? Since I have an e-commerce site, if people don't put anything in the shopping cart and just seem to be accessing lots of pages, I assume they are not there to buy anything. You have to put a product in the cart to determine shipping. Can you tell by the IP? The UA? Something else? Would love to get rid of all the clutter in my server logs.

As for foreign visitors, most of mine are from China, Russia, Romania, Brazil and I get lots from the UK and Australia, but they seem to just be looking.

Thanks for all the help and suggestions.



10:02 pm on Aug 21, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member

People are reluctant to disclose this kind of information as the bad guys may be watching.

Basically, kill server farms and clouds (by IP), allowing only known good bots in. Use a country-based IP database to ban countries. Look carefully at user-agents and headers to anticipate new problems.

And read the back numbers of this forum. :)

As I noted in another topic, in the UK we get FAR more "illegal" accesses from the USA than from our own country. Which is logical given the larger numbers involved. proportionally there are potentially a lot more USA users with contaminated computers and the country also has a lot of serverfarms / clouds. Add the willingness of some net block operators to hire out to the bad boys. Which isn't to say we don't have that problem in the UK, just that the problem is smaller due to lower population and resources.

If they are real people looking at your site then I wouldn't worry.

If you're not trading in the countries you mention then block them: in particular a lot of bad stuff comes from China, Korea, Vietnam, Indonesia, Russia, Ukraine, Romania... It's a long list. :(

As to Kosmix, if you're trading in USA then allow it. I haven't had any bad experiences with it here (UK) and it's not a frequent visitor (to me) in any case (11 hits so far this month across a few dozen sites). If you're not in the USA then block it.

Featured Threads

Hot Threads This Week

Hot Threads This Month