Welcome to WebmasterWorld Guest from 54.145.208.64

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

-very- much the same robot

   
9:48 pm on Sep 21, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Heads-up for people who block by name rather than IP:

Couple of days ago I started getting brief visits from something calling itself discoverybot. Sounds familiar, doesn't it? Detour to raw logs tells me that up until June I occasionally saw a "discobot" coming from 38.101.148.126. (Robots aren't entitled to privacy are they?) And that's the name that comes up in WebmasterWorld searches.

This one's based at 148.120 but is beyond question the same robot. www page coded by the same color-blind 19-year-old and everything.

I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range (PSInet/Cogent). I can't, because a surprising number of Canadian schools live there. So I'll continue watching the robot. So far it hasn't done anything to cause serious annoyance.
5:46 pm on Sep 22, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



It may not "done anything to cause serious annoyance" in your logs but what is it doing with the data it mines from your site? And what do the guys that purchase your data do with it?
11:29 pm on Sep 22, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



"My data" may be putting it strongly ;) As of yesterday they've requested:

-- two copies of robots.txt
-- two directories in the form /name/index.html with no follow-up of resulting 301
-- two .sit (StuffIt) files of Mac games, datestamped 2004 but really at least 5 years older
-- two ditto, only these are patches for game files that they haven't got
-- one homemade MiSTing of similar vintage
-- two further pages also dating from around 2005
-- one random gallery page
-- one full-size jpg linked from a different gallery page

:: further detour to previous batch of visits in June ::

-- three requests for robots.txt
-- one for front page
-- two for one of the same directories as above-- only this time called correctly /name/ even though, ahem, I wasn't redirecting "index.html" at the time
-- three requests for different directory, three of them stopping short at directory-slash redirect for form /name
-- three for a different MiSTing, probably left over from when I had a very large file with this name

Before that, an even longer gap. Patterns like this make me think they've got to have collaborators. Other robots with different UAs operating from different IPs (I checked both ways) who tell them what files to ask for. The alternative is that they're working through shopping lists from 2007.

:: insert "noidea" emoticon here ::

I can block 38.something, but not the whole aaa.
11:34 pm on Sep 22, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I wish I could be as motivated to look at logs in that much detail; they just get a cursory scan from time to time here. :)
4:00 am on Sep 23, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range

I was frustrated enough to block the entire range sometimes ago, then quickly realized that was a big mistake :) Cogent pretty much includes a huge part of North East US and Canada. Lots of server farms, but also municipal agencies, schools, small ISPs, private companies...
8:05 am on Sep 23, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I wish I could be as motivated to look at logs in that much detail

The Regular Expression Is Your Friend

:)

Spotlight to bring up which log files might contain what I'm looking for; quick RegEx search within the relevant files to pull out the desired lines; a bit of cut & paste and global replaces to bring everything into focus.

I think the single happiest discovery I ever made about SubEthaEdit was that the content of its Find All window can itself be selected and pasted. Invaluable when making links in HTML versions of EETS publications with Glossarial Index at the end. Other uses revealed themselves later.

Helps when you're so small that you can process your logs in javascript. Throw in some color coding, and any unexpected slabs of robotic green are bound to catch my attention sooner or later. It's been a quiet year overall; haven't met anything truly outrageous since February.

But, ahem, the line
three requests for different directory, three of them {etc}

would probably have made more sence if my fingers had typed "two of them" as my brain clearly told them to do.

but also municipal agencies, schools, small ISPs, private companies

Yes, it's the Canadian schools that I keep getting.


* www variant of the old joke formula, "Yo momma so fat, she..." {etc.} "Your web site's so small, it..."
9:54 pm on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I new this will byte me in the @$$ sooner or later, I am blocking it since 03/2007. The whole /8 range.

Today had a phone interview with the guy, after I finished, he asked me to send some URLs(of the work I've done in the past), so he could forward them to the Boss & Team that will be doing second round interview.

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke. :)
10:03 pm on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke.


What's the big deal.
Modify the range to allow him access and explain it was server configuration error.
10:10 pm on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did that on the spot, he was very impressed, everything is peachy.

BTW, here is a thread from 2011 with some ranges/htaccess by caribguy:

[webmasterworld.com...]
10:48 pm on Feb 4, 2013 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Don't recall the range, however Jim was emphatic about leaving a portion of this open for a specific bot.