homepage Welcome to WebmasterWorld Guest from 54.205.207.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Who is Kosmix?
New crawler sighted in server logs.
grandma genie



 
Msg#: 4188052 posted 4:12 pm on Aug 16, 2010 (gmt 0)

Hi,
I found this new (to me anyway) crawler in my server logs. They just checked out robots.txt. User agent was voyager/2.0. IP is from www.kosmix.com/crawler.html. IP is 38.11.234.180. Does anyone have any comments about this crawler? Should they be allowed. I have a terrible time with bots coming to my image rich site and not only taking the images, but also the content. I find my content in the strangest places.
Grandma-genie (Jeannie)

 

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4188052 posted 9:31 pm on Aug 16, 2010 (gmt 0)

I have the whole IP range starting with 38. banned for many years, nothing good ever comes from there.
Voyager is banned as well and for so long that I can't remember specifically anymore what it did wrong at the time ;o)

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4188052 posted 11:47 pm on Aug 16, 2010 (gmt 0)

Visited my site as well
Fetched robots.txt and the root page.
The IP I have is 38.113.234.180
rDNS crawl0.kosmix.com

Page in the UA describes the bot and claims it obeys robots.txt.
Main page of the site says "The best of social media - filtered and organized by topic"

Seems well behaved so far.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4188052 posted 4:24 am on Aug 17, 2010 (gmt 0)

@grandma genie: Use the "search" link atop every page to look for Kosmix in this forum's posts and you'll see it's been around/implicated a while, and that it typically hails from 38.0.0.0/8 because that's its home base.

@Staffa: Ditto. And ditto:)

@FWIW: Their "The web organized for you" site reminds me of DMOZ. (ugh)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4188052 posted 11:23 pm on Aug 17, 2010 (gmt 0)

I've had the kosmix bot white-listed for years. Seems well behaved.

grandma genie



 
Msg#: 4188052 posted 11:56 pm on Aug 17, 2010 (gmt 0)

Hmmmm... To block or not to block, that is the question... I'll keep an eye on it and if it does anything suspicious, out it goes. The websense bot was on my site, too. I don't think I'd like a censorship bot telling adults what to do. Maybe children if their parents say so, but not corporate spies. However, I would love a gizmo that bans whole countries, like China, Russia, Romania, Iran. You just have a list of countries and you click on the one you don't want, and away it goes. After all, I never visit their sites, so why are they always visiting mine? At the moment trying to ban all those IPs is too overwhelming.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4188052 posted 1:47 am on Aug 18, 2010 (gmt 0)

websense: not just trying to parse every pos.. page and gets into bot trap, but also changes User Agent to random strings that most of the time don't make sence.

so I have a routine that checks:

if: IP contains 208.80.194.' or '208.80.195.' or '208.80.193.' or '208.80.192.'
then: serve status code 200, DISPLAY "Hello World", abort.

There might be more ranges, but that does it for me.

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4188052 posted 6:23 am on Aug 18, 2010 (gmt 0)

@ grandma genie have a look at maxmind.com and their geoip to country database it's what you would love :o)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4188052 posted 9:24 pm on Aug 18, 2010 (gmt 0)

Websense IPs that I know of - ie they have plagued me:

85.115.48.0 - 85.115.63.255 (UK)
91.194.158.0 - 91.194.159.255 (UK)
114.255.30.192 - 114.255.30.223 (china)
204.15.64.0 - 204.15.71.255 (USA)
208.80.192.0 - 208.80.199.255 (USA)

All blocked.

grandma genie



 
Msg#: 4188052 posted 11:35 pm on Aug 18, 2010 (gmt 0)

maxmind.com sounds helpful, but I'd have to have my host install it and at this point I don't trust my host to do anything right. He charges too much, too. So, I guess I will just block IPs. Thanks dstiles for the IPs you have blocked. It looks like Websense is all over the place. Lately my server logs are chock full of bots and visitors from everyplace but America. Lots come from Google Images and Yahoo Images - most foreign visitors, whom I do not sell to. I have just banned Googlebot-Image and Yahoo-MMCrawler in my htaccess file. Everyone seems to be making a fine living off my pictures. I sell stuffed animals and take my own pictures of them. All the image sites (Google images, Yahoo images, MSN, etc.) take them off my site freely. I find them everywhere. One visitor was from a nasty site (referrer in server logs) and ended up on my site looking for only God knows what (naked stuffed animals?). Very irritating. Maybe the course to take is to block everyone and only allow American visitors. What do you think of that?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4188052 posted 1:22 am on Aug 19, 2010 (gmt 0)

You can't just allow American visitors because Internet addressing isn't country-consecutive. But you can block in other ways...

When necessary, I weed out loads and loads and LOADS of 'unwanted' visitors by rewriting piped top-level domains (TLDs). For example, from a U.S. medical site that has zero to offer anyone beyond its zip code (and re which Google's geo-specific setting does zilch):

R
ewriteCond %{REMOTE_HOST} \.(adsl|int|de|pl|hr|ro|ru|su|hk|ar|arpa|bg|br|bf|cz|tw|cn|fr|be|ua|bw)$ [OR]
RewriteCond %{REMOTE_HOST} \.(mx|hu|he|nl|se|za|sk|sa|ba|cl|ch|ba|co|cu|gh|id|ir|uk|gh|yu|bd|kz)$ [OR]
RewriteCond %{REMOTE_HOST} \.(lt|in|yt|cm|ma|ni|au|it|sg|my|pk|pt|ma|mobi|mu|np|si|tr|vn|tv|py|il|gy)$ [OR]
RewriteCond %{REMOTE_HOST} \.(lu|hm|hn|be|th|uk|cc|es|nu|jo|ca|dk|eu|md|ne|ug|is|md|ne|np|ve|ie|zm|zw|tt)$ [OR]
(etc.)

{Note: That example will not work as-is.]

Waaay back when, I kept blocked TLDs in alphabetical order; now I just add on whenever I see another new/odd TLD in the log. If you don't use mod_rewrite, you could adapt to "SetEnvIfNoCase Remote_Host". Or use a long list of "deny from" lines.

Additionally, vis-a-vis IPs:

On other sites, I block my most troublesome countries' IPs using the "Country IP Blocks" site's FREE lists. Don't hold your breath for super-frequent updates but no matter. What they have is grrrrreat. I copy-paste the following format into .htaccess --

[countryipblocks.net...]

-- you might prefer the others:

[countryipblocks.net...]

.....
Aside/FWIW: Speaking of unwanted visitors... Another of my sites has a page with "bomb" in the title (about a movie). It's more than a little disconcerting how many people from all over the place search for: "how to make a bomb". (shudders)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4188052 posted 9:59 pm on Aug 19, 2010 (gmt 0)

Grandma genie - you seem to have the opposite of my own experience. Over 40% of bad bots are from the US here in the UK - 5120 out of 12643 (40.4%) for this month at time of writing and that's not including Amazon, which is now banned in the IIS firewall but last month accounted for 500 hits on its own.

Last month websense accounted for 2123 hits. So far this month there have been 1217 so even worse than Amazon. :(

grandma genie



 
Msg#: 4188052 posted 11:33 pm on Aug 20, 2010 (gmt 0)

Is there an easy way to differentiate between a real human visitor and a bot or a scraper? Since I have an e-commerce site, if people don't put anything in the shopping cart and just seem to be accessing lots of pages, I assume they are not there to buy anything. You have to put a product in the cart to determine shipping. Can you tell by the IP? The UA? Something else? Would love to get rid of all the clutter in my server logs.

As for foreign visitors, most of mine are from China, Russia, Romania, Brazil and I get lots from the UK and Australia, but they seem to just be looking.

Thanks for all the help and suggestions.

Jeannie

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4188052 posted 10:02 pm on Aug 21, 2010 (gmt 0)

People are reluctant to disclose this kind of information as the bad guys may be watching.

Basically, kill server farms and clouds (by IP), allowing only known good bots in. Use a country-based IP database to ban countries. Look carefully at user-agents and headers to anticipate new problems.

And read the back numbers of this forum. :)

As I noted in another topic, in the UK we get FAR more "illegal" accesses from the USA than from our own country. Which is logical given the larger numbers involved. proportionally there are potentially a lot more USA users with contaminated computers and the country also has a lot of serverfarms / clouds. Add the willingness of some net block operators to hire out to the bad boys. Which isn't to say we don't have that problem in the UK, just that the problem is smaller due to lower population and resources.

If they are real people looking at your site then I wouldn't worry.

If you're not trading in the countries you mention then block them: in particular a lot of bad stuff comes from China, Korea, Vietnam, Indonesia, Russia, Ukraine, Romania... It's a long list. :(

As to Kosmix, if you're trading in USA then allow it. I haven't had any bad experiences with it here (UK) and it's not a frequent visitor (to me) in any case (11 hits so far this month across a few dozen sites). If you're not in the USA then block it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved