Charlotte [webmasterworld.com] is a search engine spider/bot that has a reputation for being very badly behaved.
What I am seeing is this:
Guest IP: 00.00.000.00 â Whois
Charlotte/0.05 Index page Sat Mar 07, 2009 5:48 am
I'll leave the IP off for now. You see, this looks different from what's discussed in the thread you referred me to.
Just a new face of the same crawler, maybe?
I see now that the version seems to have gone backwards. So maybe it's someone spoofing Charlotte. Or it could be an entirely different user agent.
Now that a mod has moved this to the right forum I'll bet you get some better input. :)
It is odd -- Charlotte/1.0b in '06 and Charlotte/0.05 in '09.
I'm going to ban the IPs and see what happens.
See if it comes back at me with new ones.
Anyway, can't be up to any good, right?
I can always unban the IPs if I find out it's a charity organization and all sweet stuff.
But I do appreciate the help on an identification. Nobody else seems to know anything about this.
It seems reasonable to ban a user agent/search engine you're not getting traffic from.
If it's bot info you want, you've come to the right place. You're surrounded by members of the just-named KBO, a.k.a. Keep the Bots (or Bums; or B*stards) Out. Welcome!
[edited by: incrediBILL at 10:46 pm (utc) on Mar. 7, 2009]
[edit reason] removed off topic links [/edit]
It's the crawler for searchme.com, been documented a few times but people spoof it as well.
If you run a whois on the IP and it's not in a block owned by Kavam, it's probably a fake.
Searchme, Inc on IP range 22.214.171.124 - 126.96.36.199.
It never seemed to manage proper headers so it gradually blocked itself.
What has me stumped is this:
Admittedly, I'm no expert, but I don't ever recall seeing a "user agent" farming a website using assorted ISPs. Off and on we had 8 of those on at the same time. That would either be one very sophisticated bot program, or very low tech manual work. No?
What concerns me is whether this was specifically targeted at only my site. If so, then I won't be getting feedback from anyone else having seen this happening on their site.
Why would somebody target me specifically? Good question. Maybe it's love.
But it seems odd, and when it comes to security "odd" worries me.
It's not odd at all.
Some of the scrapers or spam email harvesters uses a wide range of user agents and rotate them randomly attempting to find one that gets past your security.
However, you could also be seeing a very new crawler that's too naive to know any better than crawl through proxies all over the place, this technique is known as proxy hijacking where you claim another sites content by redirection.
Is it possible for all nine IP addresses to be listed as "Known Proxy? No" or "Proxy: None detected"?
That's from two sites that I checked these addresses on. I'll go further, if needed, but I'd like to know if "proxy hijacking" can be accomplished from addresses that are not proxies?
I can provide my full list of the results from the two sites for each address, if the rules here allow it.
By the way, <incrediBILL>, I've been studying your posts from back in 2007 and since and some other sites, and the more I read the more confused I get about the actual definition of "proxy" in the expression "proxy hijacking" so please allow me to apologize for not yet getting the picture clear in my mind and asking what may seem to you to be stupid questions. I am trying.
Proxy hijacking is simple, let me draw you a picture.
Here's a crawl example:
Googlebot -> exampleproxy.com -> examplesite.com
Googlebot asks for a page via exampleproxy.com via some link like:
So Googlebot has now crawled a page from examplesite.com using exampleproxy.com to deliver that page, and now exampleproxy.com *MAY* be credited as the source of the page instead of it's rightful owner examplesite.com
Does that make sense?
It doesn't happen often anymore, at least not that I'm seeing in Google or Live.
Let me offer an example from just a bit ago on my site, please.
Google [Bot] IP: 188.8.131.52 » Whois
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Reading topic in Politics Wed Mar 11, 2009 12:39 am
I am afraid I don't understand where the "exampleproxy.com" is located in that information above.
The IP you're seeing is a real Googlebot IP, not a problem.
Okay, I got that, but then I get back to my question about the nine IP addresses of Charlotte/0.05. They are also listed as not being proxies.
In fact, since we started keeping records a month or so ago I think my people have identified about 15 "user agents" that are listed on <honeypot> and <stopforumspam> as being problems.
I was just trying to match that list with the one you have as a sticky above and it seems we are doing something wrong. or something different.
But, back on-topic, it's that Charlotte bla.bla we're focusing on here. If the addresses aren't proxies, then is it still proxy hijacking? If not, what might we be seeing?
You got me, even residential IPs can be used for a proxy site, which some of those appear to be, and residential can also scrape, I'd have to watch the activity to figure it out, just can't conjecture on a limited number of facts.
Well, I could give you a rather detailed picture of when the "something" showed up, how often, what it was viewing, etc. but it may still not help much. I haven't been into my cPanel yet to gather more info, but we banned all nine IPs and I was sort of waiting to see if it showed up with a new one, but it hasn't.
I was just wondering if anyone else had seen this so I could assure myself that it was not targeted only at my site.
I'm afraid I picked up a few "enemies" over the years while running another rather large site and that's why I wondered if somebody was up to more than just no good -- like real bad stuff.
I've got a post about this over on phpBB, so between here and there and keeping an eye out on some other known "we ID baddies" sites I'll see in a week or so if anyone else has seen this mystery gal named Charlotte. Hope I don't get in trouble before then.
My Moderation Team Leader informed me yesterday my site has been listed on some sort of blacklist, so I've got to figure that out next.
But like I indicated above, the more I seem to learn, the more questions seem to come up. It wouldn't be so bad if I didn't have brick-and-mortar work to deal with, as well.
But if you are really <icrediBILL> with super incredible powers, Bill, you could slow down the rotation of this planet and get us all a few more hours per day, right? Then you would be famous off the Net, as well.
|But if you are really <icrediBILL> with super incredible powers, Bill, you could slow down the rotation of this planet and get us all a few more hours per day, right? Then you would be famous off the Net, as well. |
LOL - My powers are the ability to work at home and avoid a corporate job which already gives me a few extra hours per day everyone else wastes getting ready for and going to/from that job, so technically I have a few more hours in my day.
About the crawler, I wouldn't be too concerned, I get literally hundreds of things like this daily and if I stopped to worry about them all I'd never get anything done whatsoever so I have firewalled my site to the best of my ability and if anything does get through, they're more determined than I am at that point.
FWIW, there is a very big internet "black market" that makes money leeching the content of others so if you're being targeted it means you're successful and probably a leader in your field as these people don't waste their time scraping losers.
|FWIW, there is a very big internet "black market" that makes money leeching the content of others so if you're being targeted it means you're successful and probably a leader in your field as these people don't waste their time scraping losers. |
In that case, there are some really stupid people out there, because I doubt my site is the leading anything. Down the road, maybe, but not now.
Anyway, I appreciate you taking the time to help me out here.
Most days now I get a raft of IPs within a few minutes, every IP different (or perhaps two or three hits per IP) all aimed at a single site, and then nothing for a while. Next day-ish they're aimed at a different site but with a similar pattern. One time the UA may be MSIE, another time Opera. For the most part they can be detected by unusual header parameters.
I am fairly certain these are coming from botnets. Most are from domestic IP blocks, some of which are running a primitive server of some kind, suggesting a proxy has been installed or (more likely?) someone was careless installing their OS.
Purpose: No idea. Possibly scraping content for use in hyping their trojanned sites but at least several of the targetted sites use querystrings so possibly the bots are looking for vulnerable servers.
By leading, I mean showing up in the top 10 SERPs somewhere.
If you hit the top 10, you're a target.
Both the above UA and this one, just plain Charlotte, visited me last week. Both from the same IP Address so I know they're related. It was a Verizon Internet Services IP range.
[edited by: GaryK at 6:19 pm (utc) on Mar. 15, 2009]
The real Charlotte/Searchme.com crawler has the following characteristics:
Full reverse DNS that identifies the crawler like the big SEs:
The following UA:
"Mozilla/5.0 (compatible; Charlotte/1.1; [searchme.com ])"
Operating from the following location:
Anything else should be treated as a spoof, fake, whatever you want to call it.
[edited by: incrediBILL at 11:30 pm (utc) on Mar. 15, 2009]
Variations on the Will-the-real-Charlotte-please-stand-up? theme -- one repeat + two more IPs connected to KAVAM/Searchme. (Note also three bot versions.):
Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:184.108.40.206) Gecko/20080109 (Charlotte/0.9t; [searchme.com ])
Mozilla/5.0 (compatible; Charlotte/1.0t; [searchme.com ])
OrgName: Searchme, Inc.
Address: 800 W El Camino Real, Suite 100, Mountain View, CA 94040
NetRange: 220.127.116.11 - 18.104.22.168
Mozilla/5.0 (compatible; [url]Charlotte/1.0b; [searchme.com...]
Mozilla/5.0 (compatible; [url]Charlotte/1.0t; [searchme.com...]
OrgName: Abovenet Communications, Inc
Address: 1735 Lundy Ave, San Jose, CA 95131
NetRange: 22.214.171.124 - 126.96.36.199
[edited by: incrediBILL at 11:32 pm (utc) on Mar. 15, 2009]
[edit reason] fixed urls [/edit]
Just because one IP doesn't read the robots.txt doesn't mean the other IP didn't read the robots.txt, when you're talking about a crawler it only needs to read it once regardless of which IP it's crawling from.
I'm wondering why some of the IPs you referenced aren't set up for reverse DNS yet?
Are those older IPs no longer in use as all the current ones hitting my server provide reverse DNS as I showed above.
1.) The hits I get from Charlotte (from all sources) are infrequent enough that I doubt my robots.txt is cached and that's why it's unrequested. And ignored.
2.) I can't really discuss the reverse DNS thing, sorry -- I'm a Web geek, not a Network geek:)
But I can tell you that on 02-16-09, the second IP I mentioned, 188.8.131.52, hit as a bare address. No hostname. The other address-only hits date back to 09-08 and I got their what-where data from WHO*S today. In between, I've seen host hits akin to your mention:
3.) Hmm. I wonder if the address-only hosts are sandboxes vis-a-vis the 1.0b and 1.0t versions? At this end, a quick skim suggests that the searchme.com host hits used 1.1 exclusively. (Haven't grepped for 0.05.)