Forum Moderators: open
- requests are truncated to 19 characters, thus in my case almost all are generating 404s
- both HTTP 1.0 and 1.1
- numerous different browser IDs
- numerous (thousands of) different IP addresses
- I have determined that Javascript is NOT enabled (thus not an AdSense attack, for example)
What is going on? Has anyone else seen this? Such vastly distributed spidering makes me think that the IP is being spoofed, or that these are zombie PCs -- but why?
Forgot to mention it yesterday, but none of these requests come with a referring page.
The list of IPs was extremely diverse, but I did a reverse IP lookup on the top 15. 4 of them resolved to universities/colleges around the US. The rest were US networks and it was unclear who the end user was. Actually one of the networks was non-US -- in Australia.
I have a longish domain name and directory/file naming standard, which is probably why I'm seeing almost all 404's. I bet there are tons of sites getting this same treatment that aren't seeing errors and thus have no idea how much traffic is coming from this one source, or that it's due to spidering at all.
Suddenly, scads of hits to "partial" filenames. [webmasterworld.com]
Okay, sleuths. Who -- or what -- goes there?
-- and now I can't because it's closed? Drat.
So I just asked "Receptional" to re-open it (they're the Moderator of the "Website Analytics - Tracking and Logging" Forum). Here's hoping they can/do so we can keep all of our observations and results in one place.
I've never heard of the company, never had a referrer from it by name (and testing on myself, it does show referers), never authorized any bot from there, have no clue where they get their data.
So how do I make sure that Viewpoint [viewpoint.com] never again darkens my doors? (And I would dearly love to send them a bill for the time I've spent dealing with the messes they've made, and are still making apparently.)
Interestingly, their site's IP (for .com and .net) places them squarely in IBM's NetRange:
170.224.0.0 - 170.227.255.255
Hmm. "bluebird" and "blueice" have been around a lot lately...
Also, Viewpoint's DNS servers are SAVVIS and I've seen that name in my WHOIS checks too many times because of chronic HEAD-GET-HEAD-GET reqs, all sans graphics.
Anyway, and again, thank you. I'm glad to know who's/what's behind the partial URLs. Now to figure out how to get them to clean up their act on their end, and stay the heck away from mine.
Viewpoint Toolbar FAQ [search.viewpoint.com]
(Of course, I'm still sticking to my 'mobile device' theory:)
Furthermore, the only thing that's truncated is the visible URL, while the actual link is correct and not 'chopped'.
Guess I'll install this toolbar on a machine just to see what happens.
I've also noticed Yahoo trying to spider these truncated URL's - argh! :(
The only thing I can think of outside of it being a bot, is that every person that goes to the 'bad' url is copying and pasting the display url - which could be possible, but still suspect.
Also of note... when using the toolbar for search, it does not give a referrer.
Have you searched Yahoo for the truncated URLs themselves? ...Might give a clue as to where the 'truncated URL source' is.
Musing on tactics, here....
I haven't seen this one, but if I did, I'd do something like this on my Apache hosts:
# If requested URI is 18+1 characters long [Adjust this number to suit your site:
# (40 characters) - (length of your domain name) - 1 for the leading slash
RewriteCond %{REQUEST_URI} ^/.{18}$
# and If requested resource does not exist as a file or directory
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#
# Then ... (use only one of the following rules)
#
# Return the usual 403-Forbidden response
RewriteRule . - [F]
#
# Pass request to key_master's famous bad-bot IP banning script
RewriteRule . /bad-bot.pl [L]
#
# Pass request to a script that logs IP address, X_FORWARDED_FOR, and other headers
RewriteRule . /log_x_forwarded_for.pl [L]
#
# Pass request to access-denied subdir with 0-byte custom 403 error document
RewriteRule . /path_to_denied_dir [L]
I've used the last technique several times. I have a subdirectory called /pests. In that subdirectory, the .htaccess file contains code that defines a custom error document for 403 errors, and code that denies access to all files in that subdirectory -- except for that custom 403 error document. The 403 errordocument is actually the only file in that subdirectory, and it is a blank (0-byte) file.
So any request that gets rewritten to that subdirectory results in a 403-Forbidden response with a zero-byte content-body. This saves some bandwidth on servers that get attacked. It's nowhere near as good as blocking them at the firewall, but it's a useable option if firewall control is not available... such as on shared hosting.
Jim
I know they're real IPs because I'm able gather quite a bit of information from each hit. And sometimes, they come through on another search and leave a referrer.
Ever since Yahoo has changed the way they display their serps, the number of truncated 404's have gone way down on my sites.
And as mentioned, none ever gave up any referers. So a couple of weeks ago I started redirecting them from a custom error page on-site (call it Page A; named host) to a special page off-site (Page B; numeric host) for extra logging info and to say, 'Hey! E-me!' (or words to that effect:)
Here's where it gets interesting -- referers were NOT disabled at the visitor end: Almost all showed Page A referring to Page B. AND they all retrieved a graphic (e-address as .jpg).
Also, a handful have come back again after a few days, but only to the named site, not the numeric. And as with all first visits, no referers again. Then from Page A to Page B, referers and JPGs.
Go figure.
FWIW, I came this close to calling up one of the schools and talking to their CS department (I saw more than the usual number of cs-related .edu hosts) but the few I looked into didn't have toll-free numbers and I wasn't curious enough to troubleshoot their snafu on my dime. But I'm gettin' there!
However, I'm also seeing other 404 errors that I haven't seen discribed here. I'm seeing weird compound addresses that are composed of real directories and filenames in my site, but which don't add up to a real address in my site. For example, say I have: "dir1/subdir1a/file1.htm" and "dir2/file2.htm". Both run off the home directory. I'll see requests for something like "dir1/dir2/file2.htm" or "dir1/dir1a/file1.htm/file2.htm" and other weird combinations like this.
Usually there is no referring source given in my stats (Awstats). However, sometimes I'll see a referring address that is another one of these compound addresses, or it might be a legitimate address in my site.
I suspect this compound stuff has the same cause or source as the truncated addresses, which I am also getting. But this is clearly more than just a truncation issue and looks much more deliberate.
I have 12 screens of this stuff in my 404 stats so far this month! I can't tell if I have any real 404 errors because all this garbage is in the way. I've wondered if it might be hack attempts, like someone trying to get a directory listing by deliberately requesting a path that doesn't exist.
I'll be very interested to watch what information comes out of this. Whether it's a legitimate error or someone's robot run amok or a virus making the rounds, it's a royal pain and I'll be very glad when it finally gets resolved.
Starhugger
HOST: Comcast (.hsd1.wa.comcast.net)
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
FYI:
Full Filename: www.domainname.com/examples/FILENAME.abcde.html
40-c Filename: www.domainname.com/examples/FILENAME.abc
LOG:
Date Page Status Referer
05/14 10:42:19 /dir/FILENAME.abc 302 -
05/14 10:42:19 /404.html 200 -
05/14 10:42:20 /dir2/email.jpg 200 /404.html
05/14 10:45:22 /dir/FILENAME.abcde.html 200 http: //search.viewpoint.com/pl/websearch?k=[see below]
05/14 10:45:32 /dir/FILENAME.abc 302 -
05/14 10:45:32 /404.html 200 -
05/14 10:45:32 /dir2/email.jpg 304 /404.html
KEY:
1.) First hit to truncated, 40-character file; no referer
2.) Rewritten to custom 404.html; no referer
3.) Retrieves custom 404.html graphic; referer
4.) 3 minutes later, hits full file; referer: search.viewpoint.com (ironically, a whopping 304 characters!)
Full Referer (breaks prevent side-scroll):
http:// search.viewpoint.com/pl/websearch?k=FILENAME&tn=0&type=rel35&
opt=web&iss=d%2Dwww%2Den%2Dus%5Fi%2D8Q1K5H2ECLJK8UJB%5Fs%2D5&fmt=&tab
=Web&vb=1&n=10&st=B&ps=10&xargs=12KPjg1tpSrIGmmvmnCOObHb%5F%2Dvj0Zlpi
3g5UzTYR6a9RL8nR2OdBELPDUmLF4WO5hm0aBnrYhyfZPHvTg4MsuJjaHUFGPW7Khh5nH
uc8OLYeQaoAUkrBYxsvZrg%2E%2E
Try It: [tinyurl.com...]
5.) 10 seconds later, lather, rinse, repeat (#1, #2, #3, above):
6.) Second hit to truncated, 40-character file; no referer
7.) Rewritten to custom 404.html; no referer
8.) Retrieves custom 404.html graphic; referer
9.) Gone.
Beats heck outta me. HTH somebody!
Nope.
Not real.
Over the course of ~100 hits from real people visiting in real time ending up at my custom 302 'e-me for help' page (usually due to temp blacklisting), the average follow-through rate is ~80%.
Of ~100 hits from "partials" to truncated URLs ending up at that same custom 302 page, the total follow-through rate is --
Nada.
Zilch. Zero.
It's as if they're zombie machines running some sort of distributed search. And but for one single 'visitor' with a Viewpoint referer, I don't know where any of them come from, they never visit any other pages, and they rarely come back (and the few that do just repeat the same errors).
For the past six weeks I've waited and hoped -- nay, expected -- to hear from at least ONE partial person, then I could ask them about where they came from, if they had any recent installs, whatever.
I'm still waiting.
(And I'm still very much looking forward to someone solving this mystery!)
Just saw that someone/thing did a search on Yahoo ( with the usual search string referrer )... came to the 'correct' page, even clicked/followed a link - then immediatly requested the 'truncated' version of the originating page... with no referrer.
This is odd...
This is odd.."
I'm seeing more and more of those weird compound URLs, and the referring URL is often another compound URL or a real page in my site. This has got to be bot work. It seems the truncated stuff may be too.
The clues just ain't the same.
Maybe if you started a new thread in this forum and provided more details from the get-go about what you've been/are seeing, your "compound URLs" problem will get specific troubleshooting attention, and you'll get a solution!
The reason I mention scrapers is they like to harvest URI's from search engines to avoid being seen reading robots.txt when in stealth mode.
I see the following:
My Page Title <hyperlinked correctly>
.... snippet...
www.mysite.com/mypage.ht <incorrectly truncated>