homepage Welcome to WebmasterWorld Guest from 54.161.246.212
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Charlotte the spider
anyone seen this one yet?
innocbystr

5+ Year Member



 
Msg#: 3252 posted 5:22 pm on May 17, 2006 (gmt 0)

209.249.xx.x - - [17/May/2006:03:24:23 -0700] "GET /robots.txt HTTP/1.0" 200 163 "-" "Mozilla/5.0 (compatible; Charlotte/1.0b; charlotte@beta.spider.com)"

All I got for search results were the classic childrens book and an ancient browser. Clever name.

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 8:59 pm on May 17, 2006 (gmt 0)

as a general rule, any UA that contains the words, Web, spider, or crawl turn out to be pests.

In more than five years I've only seen a few exceptions.

Don

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3252 posted 9:02 pm on May 17, 2006 (gmt 0)

I saw the same one today. The IP has visited since late April but without a UA and was banned.
Today the UA is like you mentioned and googling for Ch.../1.0b has no results.
Visiting the site in the email gives a redirect to another site that has nothing to do with a SE but something with spidering technology.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3252 posted 11:20 pm on May 17, 2006 (gmt 0)

I saw this one, too. It read robots.txt, and then ignored it. I hope it likes 403s.

Jim

vortech

5+ Year Member



 
Msg#: 3252 posted 11:36 pm on May 20, 2006 (gmt 0)

Came in early this morning and hit my trap.

Netblock listed as: Kavam MFN-T595-209-249-86-0-24

Could it be kavam crawler related?
G kavam and you'll see a startup out of California from a former AltaVista exec.
Kavam out of our old friend Hurricane Electric.

BaseVinyl

10+ Year Member



 
Msg#: 3252 posted 1:36 pm on May 21, 2006 (gmt 0)

Well...how do I block it? I am trying to block the IP in .htaccess but it still keeps sucking pages...

Here is my .htaccess code

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 209.249.86.4

Is that not enough?
:(

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 2:39 pm on May 21, 2006 (gmt 0)

Options -Indexes
<Limit GET>
SetEnvIf User-Agent Charlotte keep_out
order allow,deny
deny from 209.249.86.
allow from all
deny from env=keep_out
</Limit>

BaseVinyl

10+ Year Member



 
Msg#: 3252 posted 2:56 pm on May 21, 2006 (gmt 0)

Thanks wilderness!

starhugger

5+ Year Member



 
Msg#: 3252 posted 4:31 pm on May 22, 2006 (gmt 0)

It seems I've posted a thread here about this spider too, but just found out it's the same one. Royal pain in the web! Is anyone else getting requests for compound URLs that are composed of real directory and filenames but don't collectively add up to a real URL? I describe this in more detail here:

[webmasterworld.com...]

What are these guys looking for by deliberately requesting URLs that don't exist? I find this a bit disturbing. Is it perhaps a hack attempt?

So, are we to add this code as-is to our .htaccess file? I've worked with redirects in this file but not blocking someone.

Options -Indexes
<Limit GET>
SetEnvIf User-Agent Charlotte keep_out
order allow,deny
deny from 209.249.86.
allow from all
deny from env=keep_out
</Limit>

Thanks,

Starhugger

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 11:45 pm on May 22, 2006 (gmt 0)

What are these guys looking for by deliberately requesting URLs that don't exist?

Seems to be a new scraping technique, at least in part. I am seeing this from several different known scraper "directories."

starhugger

5+ Year Member



 
Msg#: 3252 posted 12:08 am on May 23, 2006 (gmt 0)

Keyplyer, thanks for your reply. But why would this scraper be using these weird appended/compound URLs? It doesn't make sense. Either this bot is incredibly incompetent, or else it's deliberately trying to generate 40x errors. What would it be trying to get from that? Any ideas?

Starhugger

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 7:52 am on May 23, 2006 (gmt 0)


I haven't a clue to why they do it, just a suspicion that it plays into the scraper scenario. Example:

63.148.99.*** - - [22/May/2006:15:18:26 -0700] "GET /%26y%3D029700D88CA97D40%26i%3D41%26c%3D2699%26q%3D02%5ESSHPM
[L7kwz?wvlkpmf?py?u~ee?rjlv¦6&e=utf8&r=2&d=www-en-us&n=89045H1
EOPTK1IPG&s=175&t=&m=407BD303&x=0164DE6A9CD3A7EA HTTP/1.1" 404 1980 "http://www.some-scraper-directory.com" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)"

Got a few of these today. I've seen several variations, but on each occasion when I back-track to either the referrer URL or the IP address, there's one of these scraper directories at the other end.

victor

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3252 posted 7:23 pm on May 24, 2006 (gmt 0)

It's a serious pest that has been attacking a site of mine for days......It treats 403s as a neon sign saying "welcome"

Mokita

5+ Year Member



 
Msg#: 3252 posted 5:37 am on Jun 8, 2006 (gmt 0)

Just found this URI:

[betaspider.com...]

We are a stealth-mode startup that is indexing the web for a novel application.

Would someone explain what a "stealth-mode startup" is please? ;)

innocbystr

5+ Year Member



 
Msg#: 3252 posted 5:55 am on Jun 8, 2006 (gmt 0)

Would someone explain what a "stealth-mode startup" is please? ;)

That means it's a hush-hush top secret project: They ain't tellin' anyone anything.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 6:12 am on Jun 8, 2006 (gmt 0)

It just hit one of my sites asking for > 10K pages yesterday.

Got nothing but crap as it bounced off my bot blocker, but it kept asking anyway.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3252 posted 5:38 am on Jun 26, 2006 (gmt 0)

charlotte.betaspider.com
Mozilla/5.0 (compatible; Charlotte/1.0b; charlotte@betaspider.com)
06/25 22:25:44 /robots.txt

Same IP as mentioned previously.

Amazing how many bots are using 'does not exist' Hosts nowdays.

Courtesy of [dnsstuff.com...] --

IP address: 209.249.86.4
Reverse DNS: charlotte.betaspider.com
Reverse DNS authenticity: [Could be forged: hostname
charlotte.betaspider.com does not exist]

starhugger

5+ Year Member



 
Msg#: 3252 posted 11:28 am on Jun 26, 2006 (gmt 0)

Pfui wrote: "Amazing how many bots are using 'does not exist' Hosts nowdays."

Greeeeat... Do you think this kind of thing is a serious enough problem that web hosts will create new security filters to shut out these little creeps? Or, since we're paying for the bandwidth that they chew up, do you think it will be regarded as our problem? I'm just thinking how ISPs and email providers seem to have been forced to invest in anti-spam and anti-virus filters, if only to protect themselves from their service being chewed up by those critters. Just wondering if something like that might happen with creepy-crawly things too. Any thoughts?

Starhugger

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3252 posted 6:27 pm on Jun 26, 2006 (gmt 0)

1.) Re "Could be forged" hosts/IPs...

I didn't really even know about or notice that aspect until DNS Stuff [dnsstuff.com] started adding "IP Information" as a quick lookup option. Ah, ignorance was bliss:)

And please know that I'm not a network person -- my SysAdmin tends to that and the servers and I do the Web (a good bifurcated arrangement when one's SysAdmin is also one's spouse:) But our entire Class C block (0-255 IP addresses), all of our network addresses, have a nameserver assigned to them. Meaning when you do a lookup even for our unassigned IPs, they exist, they're legit.

So when it comes to what these other guys are doing, beats me. Way I look at it, Bad Guys and/or bot-runners and their ilk know it's one way to hide. Or, giving some the benefit of the doubt, they don't know how nameservice and such works. Either way, I don't want or need that kind of 'visitor' crawling through my stuff.

2.) Re an increasing number of apparent/actual 'forgeries'...

Were I an ISP, I probably wouldn't think twice because the scope of my operations would be such that any of these guys is a flea on the rump of an elephant. And I'd already have loads of security and efficiency kinds of things in place, both software and hardware, so drop-kicking one more iffy IP block into the bit bucket would be more annoyance than anything else.

However, as a site owner, that means an increasing number of visitors (good & bad) won't be able to see my site(s) because more countries and IPs = ISPs, just as good and bad 'users' are increasingly blocked on the e-mail side. To the average site owner, the end result will be probably be negligible, if even noticeable. Alternatively, if controlling site/server access matters a whole lot to, say, a large commercial enterprise, they're already solely or co-located and continuously monitoring/controlling who comes and goes, and who doesn't.

3.) Long, iffily OT (sorry) muse short...

I'm not going to lose any sleep over the 'forgeries,' at least not yet:) Because for myself, the more I pay attention to the jerks, the less I spend time doing that which is remunerative(!) and also which I truly enjoy -- working with clients, building their sites, seeing their plans and dreams and goals realized, hearing them exclaim, "I love it!"

Geeking is like gardening: You're never going to kill every weed, let alone prevent any from coming back. But if all you focus on are the weeds, you lose sight of what's growing right.

And now, I really must scoot and tend to some flowers (& yank a few weeds:)

innocbystr

5+ Year Member



 
Msg#: 3252 posted 6:34 pm on Jun 26, 2006 (gmt 0)

Amazing how many bots are using 'does not exist' Hosts nowdays.

I looked it up on a differnet IP "looker upper" and came up with "Private Block Address"....not much better.

Was using a really good IP whois site but it went down a few weeks ago. Don't really trust the one that pfui uses as it turns out bad results on geo lookup, i.e. Glasgow, Scotland, Norway (and other erroneous results).

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3252 posted 7:34 pm on Jun 26, 2006 (gmt 0)

Don't really trust the one that pfui uses as it turns out bad results on geo lookup

geoge's canufly was the BEST.
Not only do we all miss george today, however we will assuredly in the future.
george had built some sub-delegation databases based on local airport names that many of the BIG IP's use. george understood the three letter coding, which to the rest of us looks random.

Thanks to George for your long effort and dedication.
Hope your rest is an enjoyable change of routine.

Don

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3252 posted 8:08 pm on Jun 26, 2006 (gmt 0)

How often do you guys check geolocation [dnsstuff.com]? And how does geo info help you?

When I'm researching a bot or a Host/IP and/or to get a look at an iffy site without going there, I use a mix of DNS Stuff, Google, dig, traceroute, and Name Intelligence's Domain Tools [domaintools.com] (formerly Whois Source; whois-dot-sc). They're the ones running:

www.[whois].sc <=brackets added or else BestBBS obfuscates
SurveyBot/2.3 (Whois Source)

Between G and DNS Stuff and DT, I usually get more than enough info to either confirm or allay my suspicions.

(Hmm... Someone should start a thread about bot-lookup techniques. ~Oh, Dan...~ :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved