Forum Moderators: open

Message Too Old, No Replies

PageGlimpse

Another blatant scraper

         

blend27

4:30 pm on May 7, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Seems like hails from Comcast Business IP: 173.164.136.238

First hit -i missed it, it got thru :(


Referrer: http://www.pageglimpse.com/domain.tld
UA: Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36

now couple days ago pretends to be GoogleBot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) from the same IP.



GRRRRR...

Also: [projecthoneypot.org...] and [google.com...]

Site lists entire contents of the domain.tld root page, including keywords, description tags, as well as the text contents of the page. plain scraper.

From About Us page:


About PageGlimpse
PageGlimpse.com was established in 2009 to provide information about any website or domain name. We have developed sophisticated algorithms and methods to effectively calculate the rank and information about any website.

PageGlimpse retrieves all its information from the public domain and is protected by Fair-Use clause of the Copyright Act of 1976, 17 U.S.C. § 107.

dstiles

6:57 pm on May 8, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've had this blocked for about a month. From DNS...

173.164.136.232 - 173.164.136.239
173.164.136.232/29
PRINTHERODOTCOM

lucy24

9:47 pm on May 8, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



is protected by Fair-Use clause

This line is closely analogous to the "authorized by federal law suchandsuch" that used to be included in every chain letter (snailmail) ever. Calling something Fair Use doesn't make it fair use.

Samizdata

10:07 pm on May 8, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PageGlimpse retrieves all its information from the public domain

The content of most websites is not "public domain" at all.

...

keyplyr

12:06 am on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've always blocked "print" in the UA string, but thanks for the Comcast range.

Pfui

2:41 am on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mothership PrintHero.com's related to -- wait for it -- amazonAWS:

54.225.171.150
ec2-54-225-171-150.compute-1.amazonaws.com

Source: [myip.ms...]

More spawn using its IP now, courtesy of that source:

pageinsider.com
siteglimpse.com
ranksphere.com
irannegah.com
activedots.com
rankglimpse.com
rankdirection.com
pagedirection.com
socialplex.com
mainevents.org

For your pestilence-killing convenience:

54.224.0.0 - 54.239.255.255
54.224.0.0/12

keyplyr

3:30 am on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lots of good traffic coming through there. I certainly wouldn't block it using /12. I take a more surgical approach, letting through the humans on proxies & mobile apps, allowing the beneficial app bots (making it possible to get the humans coming from the apps) and blocking the undesirable bots. YMMV.

keyplyr

10:59 am on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just another observation about AWS in general. Many of these cloud range assignments are not static. A customer coming from one range today *may* be dynamically coming from another AWS range tomorrow.

blend27

2:56 pm on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would DMCA to the host(AMAZON) do the trick?

They got my content on at least 3 sites already :(

keyplyr

11:43 pm on May 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would DMCA to the host(AMAZON) do the trick?

Never did me any good. In my experience Amazon would just give me lip-service then ignore the issue, or maybe they did ask the account holder, but never enforced it. Either way I got no love.

Pfui

4:31 pm on May 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Might not do any good but going on record defending your copyrights is never a bad thing.

Note Amazon's/AmazonAWS's picky requirements for abuse-related complaints (see myip.ms link above) so I'd be sure to dot your Is and cross your Ts in a takedown request.

Here are legal specifics: [amazon.com...] (AWS is hosting the offender(s) so a case could be made that the company's sanctioning the infringement.) Good luck!

blend27

5:27 pm on May 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are 3 other things that could be considered.

They Run AdSence an all these pages (report it to Goog Adsence team as a scraper $$)
They utilize Amazon (hire couple of botnet runners to send bandwidth true the roof $$)
Wait til re-crawl the page and serve some mambo jumbo

hmmm...

blend27

5:35 pm on Jun 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For anyone interested, here is Google transparency report: [google.com...]