Forum Moderators: open

Message Too Old, No Replies

GlueText Spider/Crawler Located

         

incrediBILL

12:54 am on Jan 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Had to do a bit of old school bot hunting today to figure GlueText.com out when I discovered scraped content on their site.

Apparently they've been flying under the radar for quite some time because nobody had any data on them whatsoever so I decided it was time to prove I still had mad skillz and went out bot hunting.

User Agent: "Mozilla/4.76 [en] (Win98; U)"

IPs from cloud-ips.com:

173.203.210.51
173.203.210.95
173.203.215.230
173.203.241.192

Other IPs involved, may be residential or office:

76.65.207.*
99.231.78.*

No robots.txt

Site claims it has patent pending technology.

I didn't know you could patent BAD BOT behavior ;)

Mokita

7:59 am on Jan 29, 2011 (gmt 0)

10+ Year Member



Just did a search on Glue Text . com for a couple of keywords significant for our biggest site.

Nothing of ours came up fortunately, but I found an extra reason (in addition to all of Incredibill's referred to again here recently) to continue blocking the Alexa/Internet Archive bot. Two of the results for our pair of keywords, were mined from the Wayback machine.

So it seems that if they are blocked from your main site, they go straight to Wayback machine and bypass your site altogether.

Mokita

10:44 am on Jan 29, 2011 (gmt 0)

10+ Year Member



Also of interest, is that they heavily place items/localities/people/happenings etc that have current pages in Wikipedia, NOT by linking directly to Wikipedia (who surely must be blocking them) but by linking to a by-blow called freebase . com (spaces intentional).

tangor

10:53 am on Jan 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Win98

Anyone still allowing this?

Mokita

10:59 am on Jan 29, 2011 (gmt 0)

10+ Year Member



tangor wrote:
Anyone still allowing this?


Definitely NOT me, don't know about anyone else.

See my earlier post here:
[webmasterworld.com...]

blend27

2:39 pm on Jan 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



173.203.nnn.nnn anything is a RACKSPACE/SLICEHOST(some) and is banned on my sites as it is.

keyplyr

5:29 pm on Jan 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




173.203.nnn.nnn anything is a RACKSPACE/SLICEHOST(some) and is banned on my sites as it is. - blend27

Yup, I've blocked that range for a very long time.

deny from 173.203.0.0/16