Forum Moderators: open
- requests are truncated to 19 characters, thus in my case almost all are generating 404s
- both HTTP 1.0 and 1.1
- numerous different browser IDs
- numerous (thousands of) different IP addresses
- I have determined that Javascript is NOT enabled (thus not an AdSense attack, for example)
What is going on? Has anyone else seen this? Such vastly distributed spidering makes me think that the IP is being spoofed, or that these are zombie PCs -- but why?
Starhugger
Vortech
I run a Linux box which is CaSe SeNsItIve and I'm assuming the scraper is a Windoze person because every time they scrape I get tons of 404 errors.
Why?
They converted all my FileNames to filenames, that's right, everything comes back as lower case.
How dumb do you have to be to do that?
They converted all my FileNames to filenames, that's right, everything comes back as lower case.
Makes me think even more that this is some kind of bot-network... everything that's been witnessed and stated within this thread would certainly lead me to believe these are not actual people surfing.
Actually, I thought I had a real, live one for a minute and set-up a special rewrite so that I could ask them about their wonky hits, but they got away.
AAAAARRRGH!
Anyone have any new news?
Hey, Key_Master, you tease [webmasterworld.com]:) Anything you can report?
Actually, until last week I hadn't seen a solitary one. Guess it's possible a few came and I missed them.
I am seeing the 47-character chop off.
Last Wednesday (6/07) I had just three.
With the first being from the following IP:
http ://www.warp9inc.com/
The next two were different IP's entirely and all three had different UA's.
The only thing that these three have in common for me?
The first two had two successive page requests.
Don
My theory is perhaps the source can't be stopped but I can eliminate duplication of these errors by correcting anything else that encounters them and hopefully nip it in the bud before it turns into an epidemic.
That looked really promising but seemed to dead-end as we ran out of additional data. Hmm... Maybe GaryK's Yahoo pal [webmasterworld.com] could check out this thread, too?
Oh, also, you're only naive when it comes to thinking the kinds of hosts we've been seeing -- huge governmental entities, universities, cablecos -- ever answer e-mail. Even when you're under attack by some jerk, it's the very rare large entity that responds to a complaint about a user with other than an e-mailed receipt. Unless it's Cc'd to legal@, that is:)
.
P.S.
Don et al, a quick way I spot these is to grep my error logs for File, case-sensitive (as in the server note "File does not exist"). The results show more than just the partial URL problems, but those will stand out at a glance. I write the results to a file so I can have a copy handy. E.g.:
grep File /path/to/site/error_log > 2006_File_06-13-err.txt
Notes: Command-line access required. Not recommended for hugely trafficked sites.
2006-06-16 02:27:10 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 246 16 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:10 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:17 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:17 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 16 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
Look at the rate per second for the get request.
This is since I blocked them.
Vortech
However, what you're reporting is not what we're describing, sorry. You'll see the many dissimilarities when you read the original thread [webmasterworld.com].
Why don't you start a new thread in this same forum? That way, what you're describing will get fresh eyes and specific attention.
Vortech:
"Now, tell me this is not a bot.."
It's a play on words, of course it's a bot.
If you had read my message #32 in this thread you would see that I AM getting these same truncated requests that are being discussed in this thread. The correct URL request doesn't end with an "&" it's being truncated.
Vortech
The following occur ~100% of the time --
1.) The truncated URLs are 40 characters long, a.k.a. 47 characters from [www...] to the end. (URL-path lengths that come in at 51 still end up at 47 minus the www. prefix. Confused? Sorry! See Jim's info in the original thread [webmasterworld.com].) This means that sites with longer domain, directory and/or file names see the partial or truncated URLs the most. Sites with shorter names may not see them at all.
2.) The User-agent begins with "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1" and does not contain search-related info (see original thread). For example:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705)
3.) Attempts to contact the visitor Hosts/IPs requesting the truncated URLs are fruitless. (Efforts include a 302 redirect to a special, off-site page [webmasterworld.com] with contact info and an address .jpg; the graphic is retrieved, but the contact-me message unheeded.)
The following occur ~95-99% of the time --
4.) The truncated-URL file is the only file requested.
5.) The correct file is not looked for after the truncated-URL file is not found, regardless of server error code.
6.) There are no concurrent file requests of any kind, rapid-fire or otherwise.
7.) The exact same Host/IP (e.g.: www.abcdefg.com; ###.###.###.###) has never been to the site before, neither does it return.
8.) The Hosts/IPs are U.S.-based and big (telecoms, cablecos, governmental agencies), but not search entities per se.
9.) There are no referers for the initial file requests.
Conclusions/Confusions --
- The 'visitors' are not real people in real time. It's a bot/crawler/whatever. But...
- Original source/creator =?
- Commonality amongst visitors =?
- Commonality amongst sites visited =?
- Commonality amongst platforms visited =?
- Yahoo and/or Viewpoint et al SERPs-connected?
Searching Viewpoint [search.viewpoint.com] for typically long-URL'd blogspot.com-related pages [search.viewpoint.com] (mods: link goes to SERPs not blogs) shows correct links URLs (in blue) AND their truncated, unlinked URLs (in green). Thus could a bot/whatever be scraping an engine's SERPs?
Well, that's all I know. And don't:)
Got data?
If you're seeing #1-3 ALL the time AND #4-9 ALMOST all the time, feel free to chime in with your specifics and observations! If you're seeing absolutely anything else, please start a new thread in this forum, thanks!
I've already solved this problem as it pertains to my sites, so don't have a worry.
"Got data?
If you're seeing #1-3 ALL the time AND #4-9 ALMOST all the time, feel free to chime in with your specifics and observations! If you're seeing absolutely anything else, please start a new thread in this forum, thanks!"
Sorry to post on YOUR thread, but until you become a moderator I'll post where I think it's appropriate, thanks!
vortech
Such as this:
"GET /reallylongpage ... withellipsis.html"
I'm also seeing Google Media Bot bombing out on pages with SGML/HML errors so I'm wondering why the media bot is requesting things like this:
"GET /mypage.html%3Fpage%3D5" "Mediapartners-Google/2.1"
Are they getting the bad page formatted data from somewhere other than my site?
Very odd.
%3E%3Cimg%20src=
-- after the .html, a tip off to me that someone's got my stuff on their site. Never been able to find out who-what-where. At least not yet.
The truncated URL-seekers never ask for any file names other than too-short ones. Nothing's ever added, just lopped off.
Noteworthy, and a bit unsettling, is how many others are encountering the truncated URL thing elsewhere. A quick G for "truncated URLs [google.com]" resulted in a couple of representative threads from other sites, both circa April when we first started talking about this, too:
Truncated SEO Friendly URLs? [forums.oscommerce.com]
Yahoo created 404's (truncated URL's) [html.com]
The truncated URL-seekers never ask for any file names other than too-short ones. Nothing's ever added, just lopped off.
That's all I've seen here... nothing added, just the 'correct' URL snipped a few characters or so.
IP is a company in Trenton, Ontario, not seen in the site previously and with no referrer. One page was first requested successfully, complete with all graphics and stylesheet - URL is less than 40/47 characters. Next it attempted to get a truncated URL exactly 40/47 characters long. There is a link to that page on the first requested one, but I checked, and that link is complete and exactly as it should be. There was no referrer shown for the second page request either.
The interesting thing is, that when it received the 404 for the truncated request, there were no concurrent 304s for the stylesheet or sitewide graphics that are used on our people-friendly 404 page. It just vanished.
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
I found the source of my Google Mediabot funky file names!
This search engine is causing the problem:
[tiscali.co.uk...]
I found referrals from them every time there was a mediabot 404 issue.
Someone clicks the listing that looks like this:
[mydomain.com...]
The browser doesn't fix up the path and sends it along "as-is" and VOILA! the mediabot indexes the same page name in real time. Now that I know exactly what is causing this problem and the scope of the bug I can even translate the botched path and redirect it to the proper location.
At least one of these 404 issues is behind me so now I have to write to them and tell them their search engine is busted.
... but what's this finding have to do with truncated page requests?
(which is what I thought this thread was about)
One down, one to go...
If you can solve the other one... I'll give you a gold star ;)
Let's do a little deductive reasoning on this problem:
1) The truncated URLs are displayed on Viewpoints search page and not hyperlinked
2) The full URL is actually hyperlinked correctly to the page title
3) A browser clicking on the actual link on the page wouldn't get a 404 from a truncated URL
4) A browser couldn't cut and paste the truncated URL and still have a referrer from that site
My theorem based on pondering the possibilities of the facts above:
The truncated links that refer to Viewpoint must therefore be created by an automated crawler. This crawler detects what appears to be a link (truncated) on the page in the text and also crawls this link. The crawler also sends the referrer of where it came from, in this case Viewpoint, as some of them do that.
Since Viewpoint only appears to be a search, not a directory, then some search crawler is scraping the results from a list of keyword searches performed on Viewpoint which is probably why you only see a certain group of truncated page names. All I'm seeing is the same bunch of page names truncated over and over, which were most likely the result from the same specific keywords being searched on Viewpoint.
Other instances of this same problem MAY BE, but not limited to, other search scrapers accessing Viewpoint or crawling an already damaged list of results on another scrapers site.
That's my theory to date.
<where's my gold star?>
I agree Viewpoint definitely looks like the most likely 'enabling engine' right now, if only because we can literally see truncated URLs in its SERPs.
Btw, Key_Master was the first to note the connection [webmasterworld.com]. In a thousand or so truncated URL hits to date (about as many as jomaxx [webmasterworld.com] sees by lunch:) I've actually only found a single referer to Viewpoint. But that's one more than ALL the others.
Key_Master also noted a Yahoo-related connection, re font size, but I've not seen any connection to Yahoo, at least not overtly.
(He also said he'd "have answer to this puzzle sometime next week" -- on May 26. Drat. The suspense is killing me!)
However none of the truncated requests I've seen have had a referrer, and I'm sure that there were various different user agents and not just Firefox.
if only because we can literally see truncated URLs in its SERPs.
And referrers, which aren't possible without a crawler involved, unless someone has the aforementioned Firefox plugin which doesn't work in the cases I cited that had no Firefox agent but referrals from Viewpoint.
You have to solve the mystery to get that shiny adornment, not just posit a theory
Funny, people have gotten Nobel prizes based on theories alone.
I'm gonna stick with my theory until it's overturned, just like those that believed the earth was flat and the sun revolved around the earth ;)
Just adding (before this thread locks:) that I'm still seeing the truncated URLs every day.
Same here, though it does seem to be reducing in numbers - at least for me.
Oh well... guess we'll really never know what the cause was. :(
I mean Google supplies all sorts of search engines, so the addresses could easily be from any number of different sources.
Shrug ... just thinking out loud.