homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum


 8:29 am on Dec 2, 2012 (gmt 0)

I'm getting hammered by a search engine of sorts claiming to be Yasni.de that thinks it's crawling long and hard yet it's getting nowhere as it couldn't answer the captcha on page 40 ;)

USER AGENT: Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0 abcd-burst2.yasni.de. 184-22-183-114.static.hostnoc.net. N abcd-burst2.yasni.de abcd-ovh2.yasni.de.



 12:02 pm on Dec 2, 2012 (gmt 0)

ovh? Isn't that one of those "You'll be blocking them sooner or later so why not now" places?

Matter of fact, detour to notes tells me I've got both 184.22. (NOC) and 92.23 (OVH) blocked :) Don't know how long ago, or what the trigger was. Generalized robotitude, looks like.


 3:10 pm on Dec 2, 2012 (gmt 0) OVH is listed on RIPE as an ISP in Paris. Just a note for people who might rely on visitors from the area to maybe block agents rather than IP.


 7:51 pm on Dec 2, 2012 (gmt 0)

Had them blocked for years.

HostNoc Virtual Servrs -

OVH Dedicated Servers -

(The /16 includes the ISP)


 9:04 pm on Dec 2, 2012 (gmt 0)

Over 1200 scrape attempts from 227 distinct IPs from all over of dating back as far as 2009-01-14 18:52:07

Including 1 bumb-*** competitor who tried to scrape product descriptions from an IP that is translated to a hyphenated .com & .net version of our domain registered to an actual B&M Store in Paris. Nailed by DMCA to all SEs.

The sad part is they actually had a pretty good inventory of widgets, quality stuff too.
The fun part is everybody from that range get a generic version of "90% off - Going out Of Business!" message since 2009.

Ha.., just checked that site: Site actuellement indisponible!


 2:41 am on Dec 3, 2012 (gmt 0)

Comes on guys, admit that a site protecting itself by identifying, bagging and tagging scrapers automatically is cool. I only block ranges to be preemptive because so far the technology stops almost all of it cold. However, some of them get a few free pages before they get blocked if they're really good which is where blocking ranges helps prevent any leakage.

scrape product descriptions from an IP that is translated

Exactly why I put tracker bugs in my text, another reason anyway, because the trackers don't translate so codes like XXYYZZ-3287520629 (code plus long IP) make it thru unscathed into auto-translated text, scrambled text, etc. and I can easily find them in Google, Bing, etc.

I'd recommend everyone do it but I'm afraid the scrapers would figure out how to filter them out if I put out some tracker bug module.


 4:11 am on Dec 3, 2012 (gmt 0)

...or you could just block all translators like I do. Why let some translator service scrape your content, replace your ads with their own, and publish it from servers with none of your blocking techniques?


 6:06 am on Dec 3, 2012 (gmt 0)

...or you could just block all translators like I do.


Content can be scraped first and then translated or translated from cache if you don't use NOARCHIVE, or from the Internet Archives if someone allows them to crawl.

I allow translators because I run a worldwide site but the user agent must be valid and they can't take too many pages or they get squished. I also check the forwarded IP for validity.


 7:24 am on Dec 3, 2012 (gmt 0)

Not "insufficient" why would you say that?

I do not allow caching and block it in several ways, and I have never allowed IA to copy my property. I was one of the very first to bring a suit against them. Because of it, they were forced to start removing anyone's intellectual property if asked by the owner.

Webmasters who allow their content to be scraped by translator services are exposing everything without all the protections they have on their own server. I don't understand why they even block any IPs or UAs or whitelist if they are going to let a translator scrape their content and put it unprotected on another server.

If your business depends on alternative language support, install those translated pages on your own server where you can protect it.


 3:03 pm on Dec 3, 2012 (gmt 0)

I do not allow caching and block it in several ways


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved