homepage Welcome to WebmasterWorld Guest from 54.226.0.225
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Hits from unidentified blog trackers
where are these spiders from?
abates




msg:3097219
 9:51 am on Sep 26, 2006 (gmt 0)

I have my weblogging software set up to automatically ping three major blog-tracking sites whenever I update. I've noticed that whenever I ping these services, a flock of blog-tracking spiders will descend on my RSS feeds, including Google, Yahoo and MSN. Most of these are good enough to identify themselves in the user-agent.

Some of them just use a generic Java version string, like this one which tried to grab my index page (no robots.txt):
66.96.216.### "Java/1.5.0_06"

Anyone know who this spider belongs to?

 

wilderness




msg:3097647
 3:24 pm on Sep 26, 2006 (gmt 0)

NOC is a co-locator that offers rack space.
Could be from anybody including the many cohosts that use NOC.

Regarding the UA, most everybod has it denied.

GaryK




msg:3097723
 3:58 pm on Sep 26, 2006 (gmt 0)

Yup. It's a decision each of us has to make but I for one block everything that has java anywhere in the ua. They're usually nothing but trouble.

Whitelisting is the key to the future. Start thinking about things you can do to only let in what you want to come in instead of trying to keep things out. :)

incrediBILL




msg:3098709
 8:19 am on Sep 27, 2006 (gmt 0)

I honestly don't care what blog trackers I block because they can't figure out how to set a simple user agent string, OH WAHHHHH!

If anyone is seriously upset I'm assuming they'll contact me and tell me their blog reader won't work and then I can contact whoever wrote it and tell them what a useless pound of programming flesh they are along with instructions to fix it.

Until then, "Java/anything" goes BOING! BOING! BOING!

GaryK




msg:3099199
 4:04 pm on Sep 27, 2006 (gmt 0)

BOING! BOING! BOING!
That's the name of the new BoingBoing podcast. :)

Your rants are always the best Bill!

On a serious note, I don't even bother with the slash like you did. I suppose if a crawler came around like, for example, Conjavabot, which I just made up, they'd be banned under my rules. So perhaps for those webmasters who prefer a more conservative approach your pattern is a better example than mine.

Then again mine will catch these nasty and persistent little user agents so maybe it's not so bad after all:

Java(TM) 2 Runtime Environment
Java1.1.7
JPluck/2.0.9 (Java 1.4.2_03; Windows XP)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved