| This 82 message thread spans 3 pages: < < 82 ( 1  3 ) > > || |
|000s of Truncated Page Requests from Many IPs|
Okay, for a few weeks I have been inundated with thousands of requests in the form
- requests are truncated to 19 characters, thus in my case almost all are generating 404s
- both HTTP 1.0 and 1.1
- numerous different browser IDs
- numerous (thousands of) different IP addresses
What is going on? Has anyone else seen this? Such vastly distributed spidering makes me think that the IP is being spoofed, or that these are zombie PCs -- but why?
According to my logs, I am seeing a lot of different sources for truncated URLs. I think it makes more sense that maybe these source sites are using a common search engine that has recently put a limit on the number of characters it can handle, and depending on the using site, it might wind up cutting off characters in the destination address. That's what previous posts seem to be suggesting, if I understand them correctly. I wish I had time to track the sources down and test my theory, but I don't right now. I'm hoping someone will report some new researched information about this.
I've been following this thread as one of my sites is seeing this same "traffic". Here's my 2 cents on this. My site is an ASP driven site that "gasp" still uses query strings. So in the case of a URL that ends with an "&" it throws a 500 sever error, not just a 404. I'm pretty sure this is a bot. When it first started it would throw these 500 errors almost as if it was in a loop. My log file was a mess! So, we wrote some code. All Get requests ending in an "&" get 302 to 127.0.0.1 YES! OK, problem solved, Not. Then what I saw was the original malformed URL get the 302 then still continue to request pages with correct URL, but the order of the pages requested were not humanly possible. No link from here to there. Bot. We wrote more code. Get the 302 and get your IP address added to the banned list automatically. Done. My theory is what we are seeing are infected computers with malware/spyware that is running through these machines and reporting home. BTW, I too saw 1 reffer from viewpoint and some IPs were to .edu and one was tracable back to a .com which sold insurance. I would like to know myself what the real story might be.
I have a slight variation on this theme which is quite hytsterical.
I run a Linux box which is CaSe SeNsItIve and I'm assuming the scraper is a Windoze person because every time they scrape I get tons of 404 errors.
They converted all my FileNames to filenames, that's right, everything comes back as lower case.
How dumb do you have to be to do that?
|They converted all my FileNames to filenames, that's right, everything comes back as lower case. |
Makes me think even more that this is some kind of bot-network... everything that's been witnessed and stated within this thread would certainly lead me to believe these are not actual people surfing.
I hoped we'd have a clue by now but shoot, no. And today the partials/whatevers hit with more vengeance than usual, still exactly as first detailed [webmasterworld.com] almost two months ago (oy!), still case sensitive, still 47-chars from http:// to the end of every truncated URL.
Actually, I thought I had a real, live one for a minute and set-up a special rewrite so that I could ask them about their wonky hits, but they got away.
Anyone have any new news?
Hey, Key_Master, you tease [webmasterworld.com]:) Anything you can report?
Um, has anyone tried contacting any of the source sites or IPs of these truncated page requests? Maybe they could shed some light on this. Or is that too naive of me to think they are just white-hats having honest problems?
I've not been seeing the rash of these that you folks have.
Actually, until last week I hadn't seen a solitary one. Guess it's possible a few came and I missed them.
I am seeing the 47-character chop off.
Last Wednesday (6/07) I had just three.
With the first being from the following IP:
The next two were different IP's entirely and all three had different UA's.
The only thing that these three have in common for me?
The first two had two successive page requests.
This thread inspired me to cleanup some 404's coming from places obviously mucking up things with redirects to the correct information. If I can't get rid of the 404s then I can at least give visitors, googlebot, etc. the location of what they should really be seeing as it's not their fault the bogus data exists.
My theory is perhaps the source can't be stopped but I can eliminate duplication of these errors by correcting anything else that encounters them and hopefully nip it in the bud before it turns into an epidemic.
Starhugger, I don't think these visitors are necessarily black-hats -- but I also don't think they're real, either. (Details, details [webmasterworld.com]...) Which leads me to also think there's a common source for the widespread, common symptoms, and we've had some reports that Yahoo's SERPs, through Viewpoint, may be involved.
That looked really promising but seemed to dead-end as we ran out of additional data. Hmm... Maybe GaryK's Yahoo pal [webmasterworld.com] could check out this thread, too?
Oh, also, you're only naive when it comes to thinking the kinds of hosts we've been seeing -- huge governmental entities, universities, cablecos -- ever answer e-mail. Even when you're under attack by some jerk, it's the very rare large entity that responds to a complaint about a user with other than an e-mailed receipt. Unless it's Cc'd to legal@, that is:)
Don et al, a quick way I spot these is to grep my error logs for File, case-sensitive (as in the server note "File does not exist"). The results show more than just the partial URL problems, but those will stand out at a glance. I write the results to a file so I can have a copy handy. E.g.:
grep File /path/to/site/error_log > 2006_File_06-13-err.txt
Notes: Command-line access required. Not recommended for hugely trafficked sites.
Now, tell me this is not a bot........
2006-06-16 02:27:10 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 246 16 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:10 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:17 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:17 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 16 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=2&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
2006-06-16 02:27:24 24.243.185.aaa - W3SVC181 edit- 80 GET /folder/pageA.asp a=3&t=& 302 0 0 301 0 HTTP/1.1 www.my.com Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)
Look at the rate per second for the get request.
This is since I blocked them.
vortech, no one ever said what we're seeing isn't a bot or the result of something automated. We just don't know what's going on. Yet.
However, what you're reporting is not what we're describing, sorry. You'll see the many dissimilarities when you read the original thread [webmasterworld.com].
Why don't you start a new thread in this same forum? That way, what you're describing will get fresh eyes and specific attention.
"Now, tell me this is not a bot.."
It's a play on words, of course it's a bot.
If you had read my message #32 in this thread you would see that I AM getting these same truncated requests that are being discussed in this thread. The correct URL request doesn't end with an "&" it's being truncated.
Looks like it's time for a recap of the common oddities we're still seeing and have been since the first week of April (4th; 5th):
The following occur ~100% of the time --
1.) The truncated URLs are 40 characters long, a.k.a. 47 characters from [www...] to the end. (URL-path lengths that come in at 51 still end up at 47 minus the www. prefix. Confused? Sorry! See Jim's info in the original thread [webmasterworld.com].) This means that sites with longer domain, directory and/or file names see the partial or truncated URLs the most. Sites with shorter names may not see them at all.
2.) The User-agent begins with "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1" and does not contain search-related info (see original thread). For example:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705)
3.) Attempts to contact the visitor Hosts/IPs requesting the truncated URLs are fruitless. (Efforts include a 302 redirect to a special, off-site page [webmasterworld.com] with contact info and an address .jpg; the graphic is retrieved, but the contact-me message unheeded.)
The following occur ~95-99% of the time --
4.) The truncated-URL file is the only file requested.
5.) The correct file is not looked for after the truncated-URL file is not found, regardless of server error code.
6.) There are no concurrent file requests of any kind, rapid-fire or otherwise.
7.) The exact same Host/IP (e.g.: www.abcdefg.com; ###.###.###.###) has never been to the site before, neither does it return.
8.) The Hosts/IPs are U.S.-based and big (telecoms, cablecos, governmental agencies), but not search entities per se.
9.) There are no referers for the initial file requests.
- The 'visitors' are not real people in real time. It's a bot/crawler/whatever. But...
- Original source/creator =?
- Commonality amongst visitors =?
- Commonality amongst sites visited =?
- Commonality amongst platforms visited =?
- Yahoo and/or Viewpoint et al SERPs-connected?
Searching Viewpoint [search.viewpoint.com] for typically long-URL'd blogspot.com-related pages [search.viewpoint.com] (mods: link goes to SERPs not blogs) shows correct links URLs (in blue) AND their truncated, unlinked URLs (in green). Thus could a bot/whatever be scraping an engine's SERPs?
Well, that's all I know. And don't:)
If you're seeing #1-3 ALL the time AND #4-9 ALMOST all the time, feel free to chime in with your specifics and observations! If you're seeing absolutely anything else, please start a new thread in this forum, thanks!
It seems that the only static fact in this is that the truncated URL is 47 characters in length. This is what I'm seeing, which is why I posted. To limit the criteria to these other factors is not going to help you solve anything. What if they change UA? It seems logical to look for the anomaly rather then the norm and hope they slip up and you can nail it down.
I've already solved this problem as it pertains to my sites, so don't have a worry.
If you're seeing #1-3 ALL the time AND #4-9 ALMOST all the time, feel free to chime in with your specifics and observations! If you're seeing absolutely anything else, please start a new thread in this forum, thanks!"
Sorry to post on YOUR thread, but until you become a moderator I'll post where I think it's appropriate, thanks!
New twists showing up on this themese that makes me think these junk URLs are mostly coming from some web pages.
Such as this:
"GET /reallylongpage ... withellipsis.html"
I'm also seeing Google Media Bot bombing out on pages with SGML/HML errors so I'm wondering why the media bot is requesting things like this:
"GET /mypage.html%3Fpage%3D5" "Mediapartners-Google/2.1"
Are they getting the bad page formatted data from somewhere other than my site?
Bill, FWIW, all too often Googlebot looks for three to five different filenames on my main site with --
-- after the .html, a tip off to me that someone's got my stuff on their site. Never been able to find out who-what-where. At least not yet.
The truncated URL-seekers never ask for any file names other than too-short ones. Nothing's ever added, just lopped off.
Noteworthy, and a bit unsettling, is how many others are encountering the truncated URL thing elsewhere. A quick G for "truncated URLs [google.com]" resulted in a couple of representative threads from other sites, both circa April when we first started talking about this, too:
Truncated SEO Friendly URLs? [forums.oscommerce.com]
Yahoo created 404's (truncated URL's) [html.com]
|The truncated URL-seekers never ask for any file names other than too-short ones. Nothing's ever added, just lopped off. |
That's all I've seen here... nothing added, just the 'correct' URL snipped a few characters or so.
I've just seen this in our logs for the first time.
IP is a company in Trenton, Ontario, not seen in the site previously and with no referrer. One page was first requested successfully, complete with all graphics and stylesheet - URL is less than 40/47 characters. Next it attempted to get a truncated URL exactly 40/47 characters long. There is a link to that page on the first requested one, but I checked, and that link is complete and exactly as it should be. There was no referrer shown for the second page request either.
The interesting thing is, that when it received the 404 for the truncated request, there were no concurrent 304s for the stylesheet or sitewide graphics that are used on our people-friendly 404 page. It just vanished.
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
I found the source of my Google Mediabot funky file names!
This search engine is causing the problem:
I found referrals from them every time there was a mediabot 404 issue.
Someone clicks the listing that looks like this:
The browser doesn't fix up the path and sends it along "as-is" and VOILA! the mediabot indexes the same page name in real time. Now that I know exactly what is causing this problem and the scope of the bug I can even translate the botched path and redirect it to the proper location.
At least one of these 404 issues is behind me so now I have to write to them and tell them their search engine is busted.
Good for you Bill!
... but what's this finding have to do with truncated page requests?
(which is what I thought this thread was about)
Thanks Bill, I'm getting these too. I'll be interested to hear what these guys have to say about this.
|... but what's this finding have to do with truncated page requests? |
They're both bogus page requests and I'd mentioned this along with the truncated pages in my logs earlier in the thread.
One down, one to go...
If you can solve the other one... I'll give you a gold star ;)
Back to truncated URLS, most if not all of mine appear to originate from where Pfui mentioned at [search.viewpoint.com...]
Let's do a little deductive reasoning on this problem:
1) The truncated URLs are displayed on Viewpoints search page and not hyperlinked
2) The full URL is actually hyperlinked correctly to the page title
3) A browser clicking on the actual link on the page wouldn't get a 404 from a truncated URL
4) A browser couldn't cut and paste the truncated URL and still have a referrer from that site
My theorem based on pondering the possibilities of the facts above:
The truncated links that refer to Viewpoint must therefore be created by an automated crawler. This crawler detects what appears to be a link (truncated) on the page in the text and also crawls this link. The crawler also sends the referrer of where it came from, in this case Viewpoint, as some of them do that.
Since Viewpoint only appears to be a search, not a directory, then some search crawler is scraping the results from a list of keyword searches performed on Viewpoint which is probably why you only see a certain group of truncated page names. All I'm seeing is the same bunch of page names truncated over and over, which were most likely the result from the same specific keywords being searched on Viewpoint.
Other instances of this same problem MAY BE, but not limited to, other search scrapers accessing Viewpoint or crawling an already damaged list of results on another scrapers site.
That's my theory to date.
<where's my gold star?>
Not so fast, starry-eyed BILL. You have to solve the mystery to get that shiny adornment, not just posit a theory:)
I agree Viewpoint definitely looks like the most likely 'enabling engine' right now, if only because we can literally see truncated URLs in its SERPs.
Btw, Key_Master was the first to note the connection [webmasterworld.com]. In a thousand or so truncated URL hits to date (about as many as jomaxx [webmasterworld.com] sees by lunch:) I've actually only found a single referer to Viewpoint. But that's one more than ALL the others.
Key_Master also noted a Yahoo-related connection, re font size, but I've not seen any connection to Yahoo, at least not overtly.
(He also said he'd "have answer to this puzzle sometime next week" -- on May 26. Drat. The suspense is killing me!)
There's an extension for Firefox which turns raw URLs on web pages into links if they aren't links already. That could explain how people could click on a truncated link.
However none of the truncated requests I've seen have had a referrer, and I'm sure that there were various different user agents and not just Firefox.
Haven't even seen Firefox involved in this on my server.
|if only because we can literally see truncated URLs in its SERPs. |
And referrers, which aren't possible without a crawler involved, unless someone has the aforementioned Firefox plugin which doesn't work in the cases I cited that had no Firefox agent but referrals from Viewpoint.
|You have to solve the mystery to get that shiny adornment, not just posit a theory |
Funny, people have gotten Nobel prizes based on theories alone.
I'm gonna stick with my theory until it's overturned, just like those that believed the earth was flat and the sun revolved around the earth ;)
Just adding (before this thread locks:) that I'm still seeing the truncated URLs every day.
|Just adding (before this thread locks:) that I'm still seeing the truncated URLs every day. |
Same here, though it does seem to be reducing in numbers - at least for me.
Oh well... guess we'll really never know what the cause was. :(
OK ... this is just a shot in the dark, but does anyone think this "might" be caused by people accessing the cached page on a search engine instead of the real page?
I mean Google supplies all sorts of search engines, so the addresses could easily be from any number of different sources.
Shrug ... just thinking out loud.
1.) Not sure about everyone else but all of my pages are NOINDEX and have been for months and months. Alas, and despite METAs and robots.txt, there are still cached copies, on MSN, and on Amazon.com-owned Alexa. The latter shows msnscache.com-based SERPs, not Amazon-owned A9's -- which, curiously, is also showing "Web Results by Windows Live."
(In fact, MSN is still caching in full violation of everything: "This is a version of [URL] as it looked when our crawler examined the site on 6/18/2006." GRRRRRR)
Thing is, the cached pages I've seen, however properly or improperly obtained/cached, at least show the full URL atop the cached page in the link(s).
2.) Thus far, Viewpoint is the only SE found to be clearly and consistently showing truncated URLs in its SERPs. They show the correctly linked Title (in blue), and they also show an unlinked, truncated URL in green. Try it:
Try a search for "blogspot.com" (because of the long URLs). Pick any result and then click SEARCH THIS SITE to really see the truncated, "green URLs."
3.) Two of my most often-truncated URLs appear in Viewpoint's initial results, each correctly linked and incorrectly truncated:
RIGHT: Viewpoint SERPs Title (linked; blue)
WRONG: Viewpoint SERPs URL (unlinked; green)
The latter is 40 characters on the nose.
And then when I click "SEARCH THIS SITE," there are 11 pages of results, the longest "green URLs" of which all truncate at 40. And there are a LOT of them because my site's name is 10 characters long.
4.) Okay. Okay. Seeing as how this partial/truncated URL mystery has been bugging me for months, I should at least shoot Viewpoint an e-mail. Will do!
| This 82 message thread spans 3 pages: < < 82 ( 1  3 ) > > |