homepage Welcome to WebmasterWorld Guest from 54.211.7.174
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Fake Google or code.google.com ?
89.164.163.72
Bewenched




msg:4521479
 8:26 am on Nov 22, 2012 (gmt 0)

GET /robots.txt -

crawler4j+(http://code.google.com/p/crawler4j/)

89.164.163.72HRZagreb, Grad Zagreb, Croatia45.8, 16ISKON INTERNET d.d. za informatiku i telekomunikacIskon Internet d.d.iskon.hr

 

dstiles




msg:4521987
 8:44 pm on Nov 23, 2012 (gmt 0)

DSL line in Croatia? Hmm. Probably not G. :)

keyplyr




msg:4522060
 12:44 am on Nov 24, 2012 (gmt 0)

Not Googlebot, but could very well be a private customer developing something on the Google platform. The URL matches.

wilderness




msg:4522062
 12:56 am on Nov 24, 2012 (gmt 0)

crawler

Bewenched




msg:4522100
 5:12 am on Nov 24, 2012 (gmt 0)

Yea, but if they're developing something on the google platform .. what exactly does that mean... a crawler for their own benefit?

Can we block [code.google.com...] ?

keyplyr




msg:4522104
 5:49 am on Nov 24, 2012 (gmt 0)

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

So you can download it, and use it to crawl web documents. This does nothing in itself. The data you retrieve still needs to be processed. This tool does not do that for you.

I block many terms found in UAs including: spider, crawler, scrape, download, etc. But there are some beneficial actors that may also include some of these terms, so you need to allow the ones you like.

Bewenched




msg:4522249
 9:56 pm on Nov 24, 2012 (gmt 0)

Yea, didn't like them snooping on our ecommerce site. So many competitors use this type of stuff to grab our pricing then beat us by a penny it's not even funny.

incrediBILL




msg:4522255
 10:35 pm on Nov 24, 2012 (gmt 0)

Just block "code.google.com" found in any user agent and you'll solve this problem once and for all.

I actually block anything with "http" or "www" in the user agent, post processing beyond the initial whitelist of course, which stops just about everything that actually advertises who they are.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved