Welcome to WebmasterWorld Guest from 54.166.85.29

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

-very- much the same robot

     
9:48 pm on Sep 21, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13545
votes: 405


Heads-up for people who block by name rather than IP:

Couple of days ago I started getting brief visits from something calling itself discoverybot. Sounds familiar, doesn't it? Detour to raw logs tells me that up until June I occasionally saw a "discobot" coming from 38.101.148.126. (Robots aren't entitled to privacy are they?) And that's the name that comes up in WebmasterWorld searches.

This one's based at 148.120 but is beyond question the same robot. www page coded by the same color-blind 19-year-old and everything.

I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range (PSInet/Cogent). I can't, because a surprising number of Canadian schools live there. So I'll continue watching the robot. So far it hasn't done anything to cause serious annoyance.
5:46 pm on Sept 22, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7770
votes: 266


It may not "done anything to cause serious annoyance" in your logs but what is it doing with the data it mines from your site? And what do the guys that purchase your data do with it?
11:29 pm on Sept 22, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13545
votes: 405


"My data" may be putting it strongly ;) As of yesterday they've requested:

-- two copies of robots.txt
-- two directories in the form /name/index.html with no follow-up of resulting 301
-- two .sit (StuffIt) files of Mac games, datestamped 2004 but really at least 5 years older
-- two ditto, only these are patches for game files that they haven't got
-- one homemade MiSTing of similar vintage
-- two further pages also dating from around 2005
-- one random gallery page
-- one full-size jpg linked from a different gallery page

:: further detour to previous batch of visits in June ::

-- three requests for robots.txt
-- one for front page
-- two for one of the same directories as above-- only this time called correctly /name/ even though, ahem, I wasn't redirecting "index.html" at the time
-- three requests for different directory, three of them stopping short at directory-slash redirect for form /name
-- three for a different MiSTing, probably left over from when I had a very large file with this name

Before that, an even longer gap. Patterns like this make me think they've got to have collaborators. Other robots with different UAs operating from different IPs (I checked both ways) who tell them what files to ask for. The alternative is that they're working through shopping lists from 2007.

:: insert "noidea" emoticon here ::

I can block 38.something, but not the whole aaa.
11:34 pm on Sept 22, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I wish I could be as motivated to look at logs in that much detail; they just get a cursory scan from time to time here. :)
4:00 am on Sept 23, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7770
votes: 266



I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range

I was frustrated enough to block the entire range sometimes ago, then quickly realized that was a big mistake :) Cogent pretty much includes a huge part of North East US and Canada. Lots of server farms, but also municipal agencies, schools, small ISPs, private companies...
8:05 am on Sept 23, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13545
votes: 405


I wish I could be as motivated to look at logs in that much detail

The Regular Expression Is Your Friend

:)

Spotlight to bring up which log files might contain what I'm looking for; quick RegEx search within the relevant files to pull out the desired lines; a bit of cut & paste and global replaces to bring everything into focus.

I think the single happiest discovery I ever made about SubEthaEdit was that the content of its Find All window can itself be selected and pasted. Invaluable when making links in HTML versions of EETS publications with Glossarial Index at the end. Other uses revealed themselves later.

Helps when you're so small that you can process your logs in javascript. Throw in some color coding, and any unexpected slabs of robotic green are bound to catch my attention sooner or later. It's been a quiet year overall; haven't met anything truly outrageous since February.

But, ahem, the line
three requests for different directory, three of them {etc}

would probably have made more sence if my fingers had typed "two of them" as my brain clearly told them to do.

but also municipal agencies, schools, small ISPs, private companies

Yes, it's the Canadian schools that I keep getting.


* www variant of the old joke formula, "Yo momma so fat, she..." {etc.} "Your web site's so small, it..."
9:54 pm on Feb 4, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1889
votes: 56


I new this will byte me in the @$$ sooner or later, I am blocking it since 03/2007. The whole /8 range.

Today had a phone interview with the guy, after I finished, he asked me to send some URLs(of the work I've done in the past), so he could forward them to the Boss & Team that will be doing second round interview.

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke. :)
10:03 pm on Feb 4, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke.


What's the big deal.
Modify the range to allow him access and explain it was server configuration error.
10:10 pm on Feb 4, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1889
votes: 56


Did that on the spot, he was very impressed, everything is peachy.

BTW, here is a thread from 2011 with some ranges/htaccess by caribguy:

[webmasterworld.com...]
10:48 pm on Feb 4, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Don't recall the range, however Jim was emphatic about leaving a portion of this open for a specific bot.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members