homepage Welcome to WebmasterWorld Guest from 54.226.21.57
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Amazon
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3229434 posted 11:56 pm on Jan 23, 2007 (gmt 0)

There's and old thread closed.
[webmasterworld.com...]

Today
216.182.238.102 - - [23/Jan/2007:14:54:56 -0800] "GET /robots.txt HTTP/1.1" 403 - "-" "O#*$!earch/1.x (www.o#*$!earch.com)"

 

thetrasher

5+ Year Member



 
Msg#: 3229434 posted 2:37 pm on Jan 24, 2007 (gmt 0)

SMBot?! ([webmasterworld.com ])

Advertised website openisearch.com is hosted by "specificmedia".

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3229434 posted 7:51 pm on Jan 24, 2007 (gmt 0)

O#*$!earch/1.x (www.o#*$!earch.com)

Are the odd characters part of the UA or yet another bug in the software?

For people using PHP's get_browser() function, adding this to their browscap.ini file without quotation marks will cause errors. That would make this a malicious bot. Would Amazon do something like that?

Am I missing anything Don? Thanks.

thetrasher

5+ Year Member



 
Msg#: 3229434 posted 8:35 pm on Jan 24, 2007 (gmt 0)

Gary, it's censorship by WebmasterWorld.

o#*$!earch is openisearch, but there is a "bad" word between O and E. I think Specificmedia knows about WebmasterWorld's censorship.

Amazon is not running bots from 216.182.224.0/20! They sell computer power and bandwidth to anyone. It's like a temporary virtual server. See here: [webmasterworld.com...]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3229434 posted 8:43 pm on Jan 24, 2007 (gmt 0)

Gary and trasher,
I read a recent announcement (believe in the IAR [renamed] relases) were Amazon, eBay and another were partnering in a venture that was SE related.

the bot name is " open I search" all one name.

I believe the forum censor is screening the alternative word for phallus.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3229434 posted 5:28 am on Jan 25, 2007 (gmt 0)

I always thought the dirty words filter used asterisks. Oh well.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3229434 posted 5:49 am on Jan 27, 2007 (gmt 0)

just a heads up.

216.182.233.215 - - [26/Jan/2007:20:37:04 -0800] "GET /robots.txt HTTP/1.0" 403 - "-" "complex_network_group/Nutch-0.9-dev (discovering the structure of the world-wide-web; [cantor.ee.ucla.edu...] nimakhaj@gmail.com)"

hybrid6studios

5+ Year Member



 
Msg#: 3229434 posted 9:48 am on Feb 7, 2007 (gmt 0)

Iím pretty sure this is either the little brother of SMBot or itís replacement. Can anyone else confirm that this is run by Specific Media? Sure smells like it. Since we discussed it and I started banning it, SMBot completely quit hitting my sites and OpenISearch picked up where it left off, slamming my sites, even worse than SMBot.

Here are some interesting similarities with SMBot:

1) OpenISearch has the same format for the User-Agent:
- OpenISearch User-Agent: OpenISearch/1.x (www.openisearch.com)"
- SMBot User-Agent: "SMBot/1.1 (www.specificmedia.com)"
2) The web sites are a very similar design style.
3) OpenISearch and SMBot both come from the same IP block (216.182.236.*, 216.182.237.*, 216.182.238.*) and server at Amazon Web Services (compute.amazonaws.com).
4) Both domains are registered to "Domains by Proxy".

Went to teh site listed in the User-Agent, www.OpenISearch.com, and it's a front. Claims to be "The Ultimate Search Engine", that will have "more results than all other search engines combined". They're planning to overtake Google, Yahoo, and MSN? Have fun with that.

None of the links on the page are even working...it claims to be "Coming Soon". Hmmm...

Anyone else have info on OpenISearch/SMBot? Please contribute.

hybrid6studios

5+ Year Member



 
Msg#: 3229434 posted 10:11 am on Feb 7, 2007 (gmt 0)

I went through my logs again and found more IP blocks that these bots have in common. Here's my complete list:

216.182.225.*
216.182.228.*
216.182.230.*
216.182.231.*
216.182.233.*
216.182.236.*
216.182.237.*
216.182.238.*
216.182.239.*

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3229434 posted 3:11 pm on Feb 7, 2007 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^216\.182\.2(2[4-9]¦3[0-9])\. [OR]

hybrid6studios

5+ Year Member



 
Msg#: 3229434 posted 9:07 am on Feb 8, 2007 (gmt 0)

Thanks for the info wilderness. I'm guessing you've had it hit a few of your sites?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3229434 posted 5:41 pm on Feb 8, 2007 (gmt 0)

I'm guessing you've had it hit a few of your sites?

In early December I added the range as a result of threads referenced in this thread.
OpenI has been relentless at eating 403's of the IP range denial.
OpenI also catches a SetEnvIf for "Open".

In addition I'm getting some slight traffic from the following (course the below catches three rules; one for the IP range (same Class C as OpenI) and the other for Nutch), as well as "crawl".)

216.182.236.zz - - [05/Feb/2007:18:47:20 -0800] "GET /robots.txt HTTP/1.0" 403 - "-" "complex_network_group/Nutch-0.9-dev (discovering the structure of the world-wide-web; [cantor.ee.ucla.edu...] nimakhaj@gmail.com)"

As a result of the four rules implemented in SetEnvIf (my SetEnvIf and deny from's are not configured to allow the reading of robots.txt, whewereas my Rewrites for specific IP ranges are allowed access to robots.txt), neither is able to read robots.txt and is stuck in a 403 loop.

hybrid6studios

5+ Year Member



 
Msg#: 3229434 posted 10:43 am on Feb 11, 2007 (gmt 0)

Thanks Wilderness. So, updated range is: 216.182.224.* - 216.182.239.* (for newbies not familiar with Regex)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved