homepage Welcome to WebmasterWorld Guest from 54.205.207.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Charlotte/0.05
jimji

10+ Year Member



 
Msg#: 3865123 posted 4:20 pm on Mar 7, 2009 (gmt 0)


I have been getting this "something" on my site since about 48 hours ago. Seems to use 9 IP addresses that are showing as proxies out of Texas in the United States. All I'm getting is the "something" is viewing topics or the index page.

Anybody know anything about this?

Thank you.

 

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3865123 posted 4:34 pm on Mar 7, 2009 (gmt 0)

Charlotte [webmasterworld.com] is a search engine spider/bot that has a reputation for being very badly behaved.

jimji

10+ Year Member



 
Msg#: 3865123 posted 4:48 pm on Mar 7, 2009 (gmt 0)


What I am seeing is this:

Guest IP: 00.00.000.00 â Whois
Charlotte/0.05 Index page Sat Mar 07, 2009 5:48 am

I'll leave the IP off for now. You see, this looks different from what's discussed in the thread you referred me to.

Just a new face of the same crawler, maybe?

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3865123 posted 4:52 pm on Mar 7, 2009 (gmt 0)

I see now that the version seems to have gone backwards. So maybe it's someone spoofing Charlotte. Or it could be an entirely different user agent.

Now that a mod has moved this to the right forum I'll bet you get some better input. :)

jimji

10+ Year Member



 
Msg#: 3865123 posted 6:04 pm on Mar 7, 2009 (gmt 0)


It is odd -- Charlotte/1.0b in '06 and Charlotte/0.05 in '09.

I'm going to ban the IPs and see what happens.

See if it comes back at me with new ones.

Anyway, can't be up to any good, right?

I can always unban the IPs if I find out it's a charity organization and all sweet stuff.

But I do appreciate the help on an identification. Nobody else seems to know anything about this.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3865123 posted 6:24 pm on Mar 7, 2009 (gmt 0)

It seems reasonable to ban a user agent/search engine you're not getting traffic from.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3865123 posted 6:35 pm on Mar 7, 2009 (gmt 0)

If it's bot info you want, you've come to the right place. You're surrounded by members of the just-named KBO, a.k.a. Keep the Bots (or Bums; or B*stards) Out. Welcome!

: )

[edited by: incrediBILL at 10:46 pm (utc) on Mar. 7, 2009]
[edit reason] removed off topic links [/edit]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 10:51 pm on Mar 7, 2009 (gmt 0)

It's the crawler for searchme.com, been documented a few times but people spoof it as well.

If you run a whois on the IP and it's not in a block owned by Kavam, it's probably a fake.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3865123 posted 12:00 am on Mar 8, 2009 (gmt 0)

Thanks, Bill.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3865123 posted 2:14 am on Mar 8, 2009 (gmt 0)

Searchme, Inc on IP range 204.62.52.0 - 204.62.55.255.

It never seemed to manage proper headers so it gradually blocked itself.

jimji

10+ Year Member



 
Msg#: 3865123 posted 3:40 am on Mar 9, 2009 (gmt 0)


What has me stumped is this:

71.170.242.nnn
99.6.235.nnn
65.67.112.nnn
216.84.45.nnn
99.186.215.nnn
66.25.28.nnn
74.81.199.nnn
66.25.8.nnn
65.69.153.nnn

Admittedly, I'm no expert, but I don't ever recall seeing a "user agent" farming a website using assorted ISPs. Off and on we had 8 of those on at the same time. That would either be one very sophisticated bot program, or very low tech manual work. No?

What concerns me is whether this was specifically targeted at only my site. If so, then I won't be getting feedback from anyone else having seen this happening on their site.

Why would somebody target me specifically? Good question. Maybe it's love.

But it seems odd, and when it comes to security "odd" worries me.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 7:01 am on Mar 9, 2009 (gmt 0)

It's not odd at all.

Some of the scrapers or spam email harvesters uses a wide range of user agents and rotate them randomly attempting to find one that gets past your security.

However, you could also be seeing a very new crawler that's too naive to know any better than crawl through proxies all over the place, this technique is known as proxy hijacking where you claim another sites content by redirection.

jimji

10+ Year Member



 
Msg#: 3865123 posted 12:13 am on Mar 11, 2009 (gmt 0)

Is it possible for all nine IP addresses to be listed as "Known Proxy? No" or "Proxy: None detected"?

That's from two sites that I checked these addresses on. I'll go further, if needed, but I'd like to know if "proxy hijacking" can be accomplished from addresses that are not proxies?

I can provide my full list of the results from the two sites for each address, if the rules here allow it.

By the way, <incrediBILL>, I've been studying your posts from back in 2007 and since and some other sites, and the more I read the more confused I get about the actual definition of "proxy" in the expression "proxy hijacking" so please allow me to apologize for not yet getting the picture clear in my mind and asking what may seem to you to be stupid questions. I am trying.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 12:47 am on Mar 11, 2009 (gmt 0)

Proxy hijacking is simple, let me draw you a picture.

Here's a crawl example:

Googlebot -> exampleproxy.com -> examplesite.com

Googlebot asks for a page via exampleproxy.com via some link like:

exampleproxy.com/blah/examplesite.com/index.html

So Googlebot has now crawled a page from examplesite.com using exampleproxy.com to deliver that page, and now exampleproxy.com *MAY* be credited as the source of the page instead of it's rightful owner examplesite.com

Does that make sense?

It doesn't happen often anymore, at least not that I'm seeing in Google or Live.

jimji

10+ Year Member



 
Msg#: 3865123 posted 12:59 am on Mar 11, 2009 (gmt 0)

Let me offer an example from just a bit ago on my site, please.

Google [Bot] IP: 66.249.66.106 » Whois
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Reading topic in Politics Wed Mar 11, 2009 12:39 am

I am afraid I don't understand where the "exampleproxy.com" is located in that information above.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 1:01 am on Mar 11, 2009 (gmt 0)

The IP you're seeing is a real Googlebot IP, not a problem.

jimji

10+ Year Member



 
Msg#: 3865123 posted 1:16 am on Mar 11, 2009 (gmt 0)


Okay, I got that, but then I get back to my question about the nine IP addresses of Charlotte/0.05. They are also listed as not being proxies.

In fact, since we started keeping records a month or so ago I think my people have identified about 15 "user agents" that are listed on <honeypot> and <stopforumspam> as being problems.

I was just trying to match that list with the one you have as a sticky above and it seems we are doing something wrong. or something different.

But, back on-topic, it's that Charlotte bla.bla we're focusing on here. If the addresses aren't proxies, then is it still proxy hijacking? If not, what might we be seeing?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 1:20 am on Mar 11, 2009 (gmt 0)

You got me, even residential IPs can be used for a proxy site, which some of those appear to be, and residential can also scrape, I'd have to watch the activity to figure it out, just can't conjecture on a limited number of facts.

jimji

10+ Year Member



 
Msg#: 3865123 posted 1:37 am on Mar 11, 2009 (gmt 0)

Well, I could give you a rather detailed picture of when the "something" showed up, how often, what it was viewing, etc. but it may still not help much. I haven't been into my cPanel yet to gather more info, but we banned all nine IPs and I was sort of waiting to see if it showed up with a new one, but it hasn't.

I was just wondering if anyone else had seen this so I could assure myself that it was not targeted only at my site.

I'm afraid I picked up a few "enemies" over the years while running another rather large site and that's why I wondered if somebody was up to more than just no good -- like real bad stuff.

I've got a post about this over on phpBB, so between here and there and keeping an eye out on some other known "we ID baddies" sites I'll see in a week or so if anyone else has seen this mystery gal named Charlotte. Hope I don't get in trouble before then.

My Moderation Team Leader informed me yesterday my site has been listed on some sort of blacklist, so I've got to figure that out next.

But like I indicated above, the more I seem to learn, the more questions seem to come up. It wouldn't be so bad if I didn't have brick-and-mortar work to deal with, as well.

But if you are really <icrediBILL> with super incredible powers, Bill, you could slow down the rotation of this planet and get us all a few more hours per day, right? Then you would be famous off the Net, as well.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 1:44 am on Mar 11, 2009 (gmt 0)

But if you are really <icrediBILL> with super incredible powers, Bill, you could slow down the rotation of this planet and get us all a few more hours per day, right? Then you would be famous off the Net, as well.

LOL - My powers are the ability to work at home and avoid a corporate job which already gives me a few extra hours per day everyone else wastes getting ready for and going to/from that job, so technically I have a few more hours in my day.

About the crawler, I wouldn't be too concerned, I get literally hundreds of things like this daily and if I stopped to worry about them all I'd never get anything done whatsoever so I have firewalled my site to the best of my ability and if anything does get through, they're more determined than I am at that point.

FWIW, there is a very big internet "black market" that makes money leeching the content of others so if you're being targeted it means you're successful and probably a leader in your field as these people don't waste their time scraping losers.

jimji

10+ Year Member



 
Msg#: 3865123 posted 1:53 am on Mar 11, 2009 (gmt 0)


FWIW, there is a very big internet "black market" that makes money leeching the content of others so if you're being targeted it means you're successful and probably a leader in your field as these people don't waste their time scraping losers.

In that case, there are some really stupid people out there, because I doubt my site is the leading anything. Down the road, maybe, but not now.

Anyway, I appreciate you taking the time to help me out here.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3865123 posted 4:12 am on Mar 11, 2009 (gmt 0)

Most days now I get a raft of IPs within a few minutes, every IP different (or perhaps two or three hits per IP) all aimed at a single site, and then nothing for a while. Next day-ish they're aimed at a different site but with a similar pattern. One time the UA may be MSIE, another time Opera. For the most part they can be detected by unusual header parameters.

I am fairly certain these are coming from botnets. Most are from domestic IP blocks, some of which are running a primitive server of some kind, suggesting a proxy has been installed or (more likely?) someone was careless installing their OS.

Purpose: No idea. Possibly scraping content for use in hyping their trojanned sites but at least several of the targetted sites use querystrings so possibly the bots are looking for vulnerable servers.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 5:51 am on Mar 11, 2009 (gmt 0)

By leading, I mean showing up in the top 10 SERPs somewhere.

If you hit the top 10, you're a target.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3865123 posted 6:17 pm on Mar 15, 2009 (gmt 0)

Both the above UA and this one, just plain Charlotte, visited me last week. Both from the same IP Address so I know they're related. It was a Verizon Internet Services IP range.

[edited by: GaryK at 6:19 pm (utc) on Mar. 15, 2009]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 7:00 pm on Mar 15, 2009 (gmt 0)

The real Charlotte/Searchme.com crawler has the following characteristics:

Full reverse DNS that identifies the crawler like the big SEs:
crawl2.nat.svl.searchme.com.

The following UA:
"Mozilla/5.0 (compatible; Charlotte/1.1; [searchme.com ])"

Operating from the following location:
network:IP-Network:208.111.154.0/24
network:Auth-Area:208.111.128.0/18
network:Org-Name:Kavam, Inc.

Anything else should be treated as a spoof, fake, whatever you want to call it.

[edited by: incrediBILL at 11:30 pm (utc) on Mar. 15, 2009]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3865123 posted 9:22 pm on Mar 15, 2009 (gmt 0)

Variations on the Will-the-real-Charlotte-please-stand-up? theme -- one repeat + two more IPs connected to KAVAM/Searchme. (Note also three bot versions.):

208.111.154.249
Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.11) Gecko/20080109 (Charlotte/0.9t; [searchme.com ])
robots.txt? NO

-----
204.62.53.36
Mozilla/5.0 (compatible; Charlotte/1.0t; [searchme.com ])
robots.txt? NO

OrgName: Searchme, Inc.
OrgID: KAVAM
Address: 800 W El Camino Real, Suite 100, Mountain View, CA 94040
NetRange: 204.62.52.0 - 204.62.55.255 
CIDR: 204.62.52.0/22

-----
209.249.86.17
Mozilla/5.0 (compatible; [url]Charlotte/1.0b; [searchme.com...]
robots.txt? YES

209.249.86.210
Mozilla/5.0 (compatible; [url]Charlotte/1.0t; [searchme.com...]
robots.txt? NO

OrgName: Abovenet Communications, Inc
CustName: Kavam
Address: 1735 Lundy Ave, San Jose, CA 95131
NetRange: 209.249.86.0 - 209.249.86.255 
CIDR: 209.249.86.0/24 

[edited by: incrediBILL at 11:32 pm (utc) on Mar. 15, 2009]
[edit reason] fixed urls [/edit]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3865123 posted 9:56 pm on Mar 15, 2009 (gmt 0)

Just because one IP doesn't read the robots.txt doesn't mean the other IP didn't read the robots.txt, when you're talking about a crawler it only needs to read it once regardless of which IP it's crawling from.

I'm wondering why some of the IPs you referenced aren't set up for reverse DNS yet?

Are those older IPs no longer in use as all the current ones hitting my server provide reverse DNS as I showed above.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3865123 posted 11:04 pm on Mar 15, 2009 (gmt 0)

1.) The hits I get from Charlotte (from all sources) are infrequent enough that I doubt my robots.txt is cached and that's why it's unrequested. And ignored.

2.) I can't really discuss the reverse DNS thing, sorry -- I'm a Web geek, not a Network geek:)

But I can tell you that on 02-16-09, the second IP I mentioned, 204.62.53.36, hit as a bare address. No hostname. The other address-only hits date back to 09-08 and I got their what-where data from WHO*S today. In between, I've seen host hits akin to your mention:

crawl1.nat.svl.searchme.com
crawl2.nat.svl.searchme.com

3.) Hmm. I wonder if the address-only hosts are sandboxes vis-a-vis the 1.0b and 1.0t versions? At this end, a quick skim suggests that the searchme.com host hits used 1.1 exclusively. (Haven't grepped for 0.05.)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved