homepage Welcome to WebmasterWorld Guest from 54.242.126.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
WASALive
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4372326 posted 10:45 pm on Oct 8, 2011 (gmt 0)

Hit two different sites and never requested robots.txt. Note space before middle semi-colon:

dev.wasalive.com
Mozilla/5.0 (compatible; WASALive Bot ; http://blog.wasalive.com/wasalive-bots/)

robots.txt? NO

dev.wasalive.com = 94.23.239.127 = OVH France

Apparently they have different bots/UAs for different purposes. Have only seen this one. Anyone else?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4372326 posted 2:26 am on Oct 9, 2011 (gmt 0)

Note space before middle semi-colon:

I'd always assumed in a vague sort of way that spurious space = useless robot. Shove in a BrowserMatch looking for space followed by [;:,)] et cetera and you can forget about it. But after applying some brute force and a Regular Expression* I've had to conclude that 'tain't necessarily so.**

SV1) ;
Configuration/CLDC-1.1 )
U; ;

all appear to be legitimate. (The third one shows up in some rare ex-Soviet-bloc UAs, but seems to be human.)

On the other hand are mostly the no-brainers:

"GeoHasher/Nutch-1.0 (GeoHasher Web Search Engine; geohasher.gotdns.org; geo_hasher at yahoo * com)"
(This only turned up because there was no reason to exclude asterisk from the search)

"Mozilla/5.0 (compatible; spbot/3.0; +http://www.seoprofiler.com/bot )"
(Really, I don't think we need the extra space to give us any information here!)

"^Mozilla/4.0 \\(compatible; MSIE 8.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727\\)$"
(I'm not kidding. That's from raw logs, not from an .htaccess file. Maybe they pasted it in from someone else's htaccess. Or even their own.)

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts; .NET CLR 1.1.4322; &id;)"
(As above: didn't exclude &. What is &id; anyway? It's not an HTML entity. No! BAD smiley! Get out of there!)

"Lotus-Notes/4.5 ( Windows-NT )"
(Really? You think it might be a robot?)

Phooey. Haha. Another good idea down the drain.


* [\p{Punct}&&[^-/.{(\[quote]] (with leading space) applied to raw log files.
** Like those scientific surveys where they investigate something everyone already knows. Huge waste of money if it turns out everyone was right all along-- but infuriating when it turns out that "common knowledge" is wrong.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4372326 posted 5:19 am on Oct 9, 2011 (gmt 0)

Erm... Seen any WASAlive bots, lucy? :)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4372326 posted 6:33 pm on Nov 16, 2011 (gmt 0)

Just noting another Hostname:

bot.45.wasalive.com
Mozilla/5.0 (compatible; WASALive Bot ; http://blog.wasalive.com/wasalive-bots/)

robots.txt? NO

bot.45.wasalive.com = 94.23.251.171 = OVH France
94.23.192.0 - 94.23.255.255 = 94.23.0.0/16

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved