homepage Welcome to WebmasterWorld Guest from 54.166.66.204
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
-very- much the same robot
lucy24




msg:4498455
 9:48 pm on Sep 21, 2012 (gmt 0)

Heads-up for people who block by name rather than IP:

Couple of days ago I started getting brief visits from something calling itself discoverybot. Sounds familiar, doesn't it? Detour to raw logs tells me that up until June I occasionally saw a "discobot" coming from 38.101.148.126. (Robots aren't entitled to privacy are they?) And that's the name that comes up in WebmasterWorld searches.

This one's based at 148.120 but is beyond question the same robot. www page coded by the same color-blind 19-year-old and everything.

I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range (PSInet/Cogent). I can't, because a surprising number of Canadian schools live there. So I'll continue watching the robot. So far it hasn't done anything to cause serious annoyance.

 

keyplyr




msg:4498636
 5:46 pm on Sep 22, 2012 (gmt 0)

It may not "done anything to cause serious annoyance" in your logs but what is it doing with the data it mines from your site? And what do the guys that purchase your data do with it?

lucy24




msg:4498692
 11:29 pm on Sep 22, 2012 (gmt 0)

"My data" may be putting it strongly ;) As of yesterday they've requested:

-- two copies of robots.txt
-- two directories in the form /name/index.html with no follow-up of resulting 301
-- two .sit (StuffIt) files of Mac games, datestamped 2004 but really at least 5 years older
-- two ditto, only these are patches for game files that they haven't got
-- one homemade MiSTing of similar vintage
-- two further pages also dating from around 2005
-- one random gallery page
-- one full-size jpg linked from a different gallery page

:: further detour to previous batch of visits in June ::

-- three requests for robots.txt
-- one for front page
-- two for one of the same directories as above-- only this time called correctly /name/ even though, ahem, I wasn't redirecting "index.html" at the time
-- three requests for different directory, three of them stopping short at directory-slash redirect for form /name
-- three for a different MiSTing, probably left over from when I had a very large file with this name

Before that, an even longer gap. Patterns like this make me think they've got to have collaborators. Other robots with different UAs operating from different IPs (I checked both ways) who tell them what files to ask for. The alternative is that they're working through shopping lists from 2007.

:: insert "noidea" emoticon here ::

I can block 38.something, but not the whole aaa.

g1smd




msg:4498693
 11:34 pm on Sep 22, 2012 (gmt 0)

I wish I could be as motivated to look at logs in that much detail; they just get a cursory scan from time to time here. :)

keyplyr




msg:4498731
 4:00 am on Sep 23, 2012 (gmt 0)


I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range

I was frustrated enough to block the entire range sometimes ago, then quickly realized that was a big mistake :) Cogent pretty much includes a huge part of North East US and Canada. Lots of server farms, but also municipal agencies, schools, small ISPs, private companies...

lucy24




msg:4498754
 8:05 am on Sep 23, 2012 (gmt 0)

I wish I could be as motivated to look at logs in that much detail

The Regular Expression Is Your Friend

:)

Spotlight to bring up which log files might contain what I'm looking for; quick RegEx search within the relevant files to pull out the desired lines; a bit of cut & paste and global replaces to bring everything into focus.

I think the single happiest discovery I ever made about SubEthaEdit was that the content of its Find All window can itself be selected and pasted. Invaluable when making links in HTML versions of EETS publications with Glossarial Index at the end. Other uses revealed themselves later.

Helps when you're so small that you can process your logs in javascript. Throw in some color coding, and any unexpected slabs of robotic green are bound to catch my attention sooner or later. It's been a quiet year overall; haven't met anything truly outrageous since February.

But, ahem, the line
three requests for different directory, three of them {etc}

would probably have made more sence if my fingers had typed "two of them" as my brain clearly told them to do.

but also municipal agencies, schools, small ISPs, private companies

Yes, it's the Canadian schools that I keep getting.


* www variant of the old joke formula, "Yo momma so fat, she..." {etc.} "Your web site's so small, it..."

blend27




msg:4542355
 9:54 pm on Feb 4, 2013 (gmt 0)

I new this will byte me in the @$$ sooner or later, I am blocking it since 03/2007. The whole /8 range.

Today had a phone interview with the guy, after I finished, he asked me to send some URLs(of the work I've done in the past), so he could forward them to the Boss & Team that will be doing second round interview.

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke. :)

wilderness




msg:4542358
 10:03 pm on Feb 4, 2013 (gmt 0)

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke.


What's the big deal.
Modify the range to allow him access and explain it was server configuration error.

blend27




msg:4542359
 10:10 pm on Feb 4, 2013 (gmt 0)

Did that on the spot, he was very impressed, everything is peachy.

BTW, here is a thread from 2011 with some ranges/htaccess by caribguy:

[webmasterworld.com...]

wilderness




msg:4542369
 10:48 pm on Feb 4, 2013 (gmt 0)

Don't recall the range, however Jim was emphatic about leaving a portion of this open for a specific bot.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved