homepage Welcome to WebmasterWorld Guest from 54.204.182.118
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Yasni
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 8:29 am on Dec 2, 2012 (gmt 0)

I'm getting hammered by a search engine of sorts claiming to be Yasni.de that thinks it's crawling long and hard yet it's getting nowhere as it couldn't answer the captcha on page 40 ;)

USER AGENT: Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0

184.22.211.146 abcd-burst2.yasni.de.
184.22.183.114 184-22-183-114.static.hostnoc.net. N
184.22.211.146 abcd-burst2.yasni.de
94.23.220.161 abcd-ovh2.yasni.de.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4524284 posted 12:02 pm on Dec 2, 2012 (gmt 0)

ovh? Isn't that one of those "You'll be blocking them sooner or later so why not now" places?

Matter of fact, detour to notes tells me I've got both 184.22. (NOC) and 92.23 (OVH) blocked :) Don't know how long ago, or what the trigger was. Generalized robotitude, looks like.

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 3:10 pm on Dec 2, 2012 (gmt 0)

94.23.0.0/16 OVH is listed on RIPE as an ISP in Paris. Just a note for people who might rely on visitors from the area to maybe block agents rather than IP.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 7:51 pm on Dec 2, 2012 (gmt 0)

Had them blocked for years.

HostNoc Virtual Servrs
184.22.0.0 - 184.22.255.255
184.22.0.0/16

OVH Dedicated Servers
94.23.0.0 - 94.23.63.255
94.23.0.0/18

(The /16 includes the ISP)

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4524284 posted 9:04 pm on Dec 2, 2012 (gmt 0)

Over 1200 scrape attempts from 227 distinct IPs from all over of 94.23.0.0/16 dating back as far as 2009-01-14 18:52:07

Including 1 bumb-*** competitor who tried to scrape product descriptions from an IP that is translated to a hyphenated .com & .net version of our domain registered to an actual B&M Store in Paris. Nailed by DMCA to all SEs.

The sad part is they actually had a pretty good inventory of widgets, quality stuff too.
The fun part is everybody from that range get a generic version of "90% off - Going out Of Business!" message since 2009.

Ha.., just checked that site: Site actuellement indisponible!

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 2:41 am on Dec 3, 2012 (gmt 0)

Comes on guys, admit that a site protecting itself by identifying, bagging and tagging scrapers automatically is cool. I only block ranges to be preemptive because so far the technology stops almost all of it cold. However, some of them get a few free pages before they get blocked if they're really good which is where blocking ranges helps prevent any leakage.

scrape product descriptions from an IP that is translated


Exactly why I put tracker bugs in my text, another reason anyway, because the trackers don't translate so codes like XXYYZZ-3287520629 (code plus long IP) make it thru unscathed into auto-translated text, scrambled text, etc. and I can easily find them in Google, Bing, etc.

I'd recommend everyone do it but I'm afraid the scrapers would figure out how to filter them out if I put out some tracker bug module.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 4:11 am on Dec 3, 2012 (gmt 0)

...or you could just block all translators like I do. Why let some translator service scrape your content, replace your ads with their own, and publish it from servers with none of your blocking techniques?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 6:06 am on Dec 3, 2012 (gmt 0)

...or you could just block all translators like I do.


Insufficient.

Content can be scraped first and then translated or translated from cache if you don't use NOARCHIVE, or from the Internet Archives if someone allows them to crawl.

I allow translators because I run a worldwide site but the user agent must be valid and they can't take too many pages or they get squished. I also check the forwarded IP for validity.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 7:24 am on Dec 3, 2012 (gmt 0)

Not "insufficient" why would you say that?

I do not allow caching and block it in several ways, and I have never allowed IA to copy my property. I was one of the very first to bring a suit against them. Because of it, they were forced to start removing anyone's intellectual property if asked by the owner.

Webmasters who allow their content to be scraped by translator services are exposing everything without all the protections they have on their own server. I don't understand why they even block any IPs or UAs or whitelist if they are going to let a translator scrape their content and put it unprotected on another server.

If your business depends on alternative language support, install those translated pages on your own server where you can protect it.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4524284 posted 3:03 pm on Dec 3, 2012 (gmt 0)

I do not allow caching and block it in several ways


Ditto.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved