homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
-very- much the same robot
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4498453 posted 9:48 pm on Sep 21, 2012 (gmt 0)

Heads-up for people who block by name rather than IP:

Couple of days ago I started getting brief visits from something calling itself discoverybot. Sounds familiar, doesn't it? Detour to raw logs tells me that up until June I occasionally saw a "discobot" coming from 38.101.148.126. (Robots aren't entitled to privacy are they?) And that's the name that comes up in WebmasterWorld searches.

This one's based at 148.120 but is beyond question the same robot. www page coded by the same color-blind 19-year-old and everything.

I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range (PSInet/Cogent). I can't, because a surprising number of Canadian schools live there. So I'll continue watching the robot. So far it hasn't done anything to cause serious annoyance.

 

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4498453 posted 5:46 pm on Sep 22, 2012 (gmt 0)

It may not "done anything to cause serious annoyance" in your logs but what is it doing with the data it mines from your site? And what do the guys that purchase your data do with it?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4498453 posted 11:29 pm on Sep 22, 2012 (gmt 0)

"My data" may be putting it strongly ;) As of yesterday they've requested:

-- two copies of robots.txt
-- two directories in the form /name/index.html with no follow-up of resulting 301
-- two .sit (StuffIt) files of Mac games, datestamped 2004 but really at least 5 years older
-- two ditto, only these are patches for game files that they haven't got
-- one homemade MiSTing of similar vintage
-- two further pages also dating from around 2005
-- one random gallery page
-- one full-size jpg linked from a different gallery page

:: further detour to previous batch of visits in June ::

-- three requests for robots.txt
-- one for front page
-- two for one of the same directories as above-- only this time called correctly /name/ even though, ahem, I wasn't redirecting "index.html" at the time
-- three requests for different directory, three of them stopping short at directory-slash redirect for form /name
-- three for a different MiSTing, probably left over from when I had a very large file with this name

Before that, an even longer gap. Patterns like this make me think they've got to have collaborators. Other robots with different UAs operating from different IPs (I checked both ways) who tell them what files to ask for. The alternative is that they're working through shopping lists from 2007.

:: insert "noidea" emoticon here ::

I can block 38.something, but not the whole aaa.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4498453 posted 11:34 pm on Sep 22, 2012 (gmt 0)

I wish I could be as motivated to look at logs in that much detail; they just get a cursory scan from time to time here. :)

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4498453 posted 4:00 am on Sep 23, 2012 (gmt 0)


I realize a lot of people avoid the issue by locking out the whole 38.0.0.0/8 range

I was frustrated enough to block the entire range sometimes ago, then quickly realized that was a big mistake :) Cogent pretty much includes a huge part of North East US and Canada. Lots of server farms, but also municipal agencies, schools, small ISPs, private companies...

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4498453 posted 8:05 am on Sep 23, 2012 (gmt 0)

I wish I could be as motivated to look at logs in that much detail

The Regular Expression Is Your Friend

:)

Spotlight to bring up which log files might contain what I'm looking for; quick RegEx search within the relevant files to pull out the desired lines; a bit of cut & paste and global replaces to bring everything into focus.

I think the single happiest discovery I ever made about SubEthaEdit was that the content of its Find All window can itself be selected and pasted. Invaluable when making links in HTML versions of EETS publications with Glossarial Index at the end. Other uses revealed themselves later.

Helps when you're so small that you can process your logs in javascript. Throw in some color coding, and any unexpected slabs of robotic green are bound to catch my attention sooner or later. It's been a quiet year overall; haven't met anything truly outrageous since February.

But, ahem, the line
three requests for different directory, three of them {etc}

would probably have made more sence if my fingers had typed "two of them" as my brain clearly told them to do.

but also municipal agencies, schools, small ISPs, private companies

Yes, it's the Canadian schools that I keep getting.


* www variant of the old joke formula, "Yo momma so fat, she..." {etc.} "Your web site's so small, it..."

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4498453 posted 9:54 pm on Feb 4, 2013 (gmt 0)

I new this will byte me in the @$$ sooner or later, I am blocking it since 03/2007. The whole /8 range.

Today had a phone interview with the guy, after I finished, he asked me to send some URLs(of the work I've done in the past), so he could forward them to the Boss & Team that will be doing second round interview.

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke. :)

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4498453 posted 10:03 pm on Feb 4, 2013 (gmt 0)

And BAM, the Boss is on PSInet/Cogent Range, got 403'd with the nice message displayed: Sorry You are not on the list! on 3 URLs that I sent. Got a call from the guy asking if this was a practical joke.


What's the big deal.
Modify the range to allow him access and explain it was server configuration error.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4498453 posted 10:10 pm on Feb 4, 2013 (gmt 0)

Did that on the spot, he was very impressed, everything is peachy.

BTW, here is a thread from 2011 with some ranges/htaccess by caribguy:

[webmasterworld.com...]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4498453 posted 10:48 pm on Feb 4, 2013 (gmt 0)

Don't recall the range, however Jim was emphatic about leaving a portion of this open for a specific bot.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved