homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 82 message thread spans 3 pages: 82 ( [1] 2 3 > >     
000s of Truncated Page Requests from Many IPs

 8:26 pm on May 8, 2006 (gmt 0)

Okay, for a few weeks I have been inundated with thousands of requests in the form
"GET /directory/sample.ht"

- requests are truncated to 19 characters, thus in my case almost all are generating 404s
- both HTTP 1.0 and 1.1
- numerous different browser IDs
- numerous (thousands of) different IP addresses
- I have determined that Javascript is NOT enabled (thus not an AdSense attack, for example)

What is going on? Has anyone else seen this? Such vastly distributed spidering makes me think that the IP is being spoofed, or that these are zombie PCs -- but why?



 5:05 pm on May 9, 2006 (gmt 0)

FWIW, yesterday I had approximately 2,800 requests of this type from 1,400 unique IP addresses and 200+ unique browser identifiers.

Forgot to mention it yesterday, but none of these requests come with a referring page.

The list of IPs was extremely diverse, but I did a reverse IP lookup on the top 15. 4 of them resolved to universities/colleges around the US. The rest were US networks and it was unclear who the end user was. Actually one of the networks was non-US -- in Australia.


 8:05 pm on May 9, 2006 (gmt 0)


Let us know if you figure it out.


 9:01 pm on May 9, 2006 (gmt 0)

Thanks. It looks like the exact same thing, down to the 40-character length of the full URL.

I have a longish domain name and directory/file naming standard, which is probably why I'm seeing almost all 404's. I bet there are tons of sites getting this same treatment that aren't seeing errors and thus have no idea how much traffic is coming from this one source, or that it's due to spidering at all.


 12:24 am on May 10, 2006 (gmt 0)

I've been waiting until I gathered more info about the "partials" before posting an update to the original thread --

Suddenly, scads of hits to "partial" filenames. [webmasterworld.com]
Okay, sleuths. Who -- or what -- goes there?

-- and now I can't because it's closed? Drat.

So I just asked "Receptional" to re-open it (they're the Moderator of the "Website Analytics - Tracking and Logging" Forum). Here's hoping they can/do so we can keep all of our observations and results in one place.


 7:18 pm on May 10, 2006 (gmt 0)

Seeing the same here... have tried for days to track the source down, with no luck thus far :(


 12:16 am on May 11, 2006 (gmt 0)

Receptional and engine fixed the original thread (thanks, gents!) so you'll find new info, my Top Ten and a maybe-musing therein. Sorry the entries are dry as dust but there ya go [webmasterworld.com]...


 12:23 am on May 11, 2006 (gmt 0)

Hey guys, I had the same problem. Traced the truncated urls to Yahoo serps. Not the urls themselves but the display text of the url found after the snippet.

My thought is this is a scraper too- possibly a distributed type from compromised computers.


 1:39 am on May 11, 2006 (gmt 0)

I just wanted to add, Yahoo switched to a larger url text during the last update but you can still find truncated url text in the serps at search.viewpoint.com


 2:26 am on May 11, 2006 (gmt 0)

Thanks for the pointer because I didn't see the problem in Yahoo's SERPs. I do in Viewpoint's. Dammit.

I've never heard of the company, never had a referrer from it by name (and testing on myself, it does show referers), never authorized any bot from there, have no clue where they get their data.

So how do I make sure that Viewpoint [viewpoint.com] never again darkens my doors? (And I would dearly love to send them a bill for the time I've spent dealing with the messes they've made, and are still making apparently.)

Interestingly, their site's IP (for .com and .net) places them squarely in IBM's NetRange: -

Hmm. "bluebird" and "blueice" have been around a lot lately...

Also, Viewpoint's DNS servers are SAVVIS and I've seen that name in my WHOIS checks too many times because of chronic HEAD-GET-HEAD-GET reqs, all sans graphics.

Anyway, and again, thank you. I'm glad to know who's/what's behind the partial URLs. Now to figure out how to get them to clean up their act on their end, and stay the heck away from mine.


 2:35 am on May 11, 2006 (gmt 0)

I meant to say Yahoo switched to a smaller url size so now the whole url is displayed.

Yahoo feeds viewpoint's serps. The IPs are from real people. The mystery is what causes these broken urls to be fetched.

Another possibility I'm looking into is that a toolbar is causing these errors in some way.


 2:59 am on May 11, 2006 (gmt 0)

You probably already know this but FWIW, Viewpoint's got one --

Viewpoint Toolbar FAQ [search.viewpoint.com]

(Of course, I'm still sticking to my 'mobile device' theory:)


 3:06 am on May 11, 2006 (gmt 0)

NT 5.1 is a Windows XP machine, which is a clue in this mystery.

That's why I'm leaning towards a toolbar as the cause of these problems. And yes, Viewpoint's toolbar requires XP or W2000.


 6:31 pm on May 11, 2006 (gmt 0)

I have a sneaking suspicion that these are not 'humans' visiting. I've even tried delivering 301's to the correct page, and in all cases... they never go further than the page requested. Seems a little odd that every single 'person' would follow the exact same pattern 100%.

Furthermore, the only thing that's truncated is the visible URL, while the actual link is correct and not 'chopped'.

Guess I'll install this toolbar on a machine just to see what happens.

I've also noticed Yahoo trying to spider these truncated URL's - argh! :(


 9:17 pm on May 12, 2006 (gmt 0)

After installing the toolbar, I get the same as if I used search.viewpoint.com... the only thing truncated is the visible URL, the actual link is correct and not 'chopped'.

The only thing I can think of outside of it being a bot, is that every person that goes to the 'bad' url is copying and pasting the display url - which could be possible, but still suspect.

Also of note... when using the toolbar for search, it does not give a referrer.


 4:53 am on May 13, 2006 (gmt 0)

> I've also noticed Yahoo trying to spider these truncated URL's - argh!

Have you searched Yahoo for the truncated URLs themselves? ...Might give a clue as to where the 'truncated URL source' is.

Musing on tactics, here....

I haven't seen this one, but if I did, I'd do something like this on my Apache hosts:

# If requested URI is 18+1 characters long [Adjust this number to suit your site:
# (40 characters) - (length of your domain name) - 1 for the leading slash
RewriteCond %{REQUEST_URI} ^/.{18}$
# and If requested resource does not exist as a file or directory
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
# Then ... (use only one of the following rules)
# Return the usual 403-Forbidden response
RewriteRule . - [F]
# Pass request to key_master's famous bad-bot IP banning script
RewriteRule . /bad-bot.pl [L]
# Pass request to a script that logs IP address, X_FORWARDED_FOR, and other headers
RewriteRule . /log_x_forwarded_for.pl [L]
# Pass request to access-denied subdir with 0-byte custom 403 error document
RewriteRule . /path_to_denied_dir [L]

The last two rule options might need some explanation: I'm proposing to log the HTTP_X_FORWARDED_FOR header to see if the original requestor IP address is reported (I assume that these can't all be anonymous proxies in use here). Other HTTP headers could also be logged if present. This method won't provide any useful information if this is a bot-net in use, though.

I've used the last technique several times. I have a subdirectory called /pests. In that subdirectory, the .htaccess file contains code that defines a custom error document for 403 errors, and code that denies access to all files in that subdirectory -- except for that custom 403 error document. The 403 errordocument is actually the only file in that subdirectory, and it is a blank (0-byte) file.

So any request that gets rewritten to that subdirectory results in a 403-Forbidden response with a zero-byte content-body. This saves some bandwidth on servers that get attacked. It's nowhere near as good as blocking them at the firewall, but it's a useable option if firewall control is not available... such as on shared hosting.



 4:06 pm on May 13, 2006 (gmt 0)

It could also be a screen reader for the blind.

I know they're real IPs because I'm able gather quite a bit of information from each hit. And sometimes, they come through on another search and leave a referrer.

Ever since Yahoo has changed the way they display their serps, the number of truncated 404's have gone way down on my sites.


 8:23 pm on May 13, 2006 (gmt 0)

I posted a specific list of my Usual Suspects in the original thread [webmasterworld.com] (message #8). Of the 100 or so IPs and hosts I double-checked, all were legit.

And as mentioned, none ever gave up any referers. So a couple of weeks ago I started redirecting them from a custom error page on-site (call it Page A; named host) to a special page off-site (Page B; numeric host) for extra logging info and to say, 'Hey! E-me!' (or words to that effect:)

Here's where it gets interesting -- referers were NOT disabled at the visitor end: Almost all showed Page A referring to Page B. AND they all retrieved a graphic (e-address as .jpg).

Also, a handful have come back again after a few days, but only to the named site, not the numeric. And as with all first visits, no referers again. Then from Page A to Page B, referers and JPGs.

Go figure.

FWIW, I came this close to calling up one of the schools and talking to their CS department (I saw more than the usual number of cs-related .edu hosts) but the few I looked into didn't have toll-free numbers and I wasn't curious enough to troubleshoot their snafu on my dime. But I'm gettin' there!


 4:02 am on May 15, 2006 (gmt 0)

I am so relieved to see that others are having the same problems. This problem has been driving me nuts for weeks!

However, I'm also seeing other 404 errors that I haven't seen discribed here. I'm seeing weird compound addresses that are composed of real directories and filenames in my site, but which don't add up to a real address in my site. For example, say I have: "dir1/subdir1a/file1.htm" and "dir2/file2.htm". Both run off the home directory. I'll see requests for something like "dir1/dir2/file2.htm" or "dir1/dir1a/file1.htm/file2.htm" and other weird combinations like this.

Usually there is no referring source given in my stats (Awstats). However, sometimes I'll see a referring address that is another one of these compound addresses, or it might be a legitimate address in my site.

I suspect this compound stuff has the same cause or source as the truncated addresses, which I am also getting. But this is clearly more than just a truncation issue and looks much more deliberate.

I have 12 screens of this stuff in my 404 stats so far this month! I can't tell if I have any real 404 errors because all this garbage is in the way. I've wondered if it might be hack attempts, like someone trying to get a directory listing by deliberately requesting a path that doesn't exist.

I'll be very interested to watch what information comes out of this. Whether it's a legitimate error or someone's robot run amok or a virus making the rounds, it's a royal pain and I'll be very glad when it finally gets resolved.



 12:19 am on May 16, 2006 (gmt 0)

Finally. Finally! After five weeks, a partial hit with a referer , the only one I've seen. And it has a (more or less) direct connection to -- wait for it -- Viewpoint [viewpoint.com]!

HOST: Comcast (.hsd1.wa.comcast.net)
UA: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Full Filename: www.domainname.com/examples/FILENAME.abcde.html
40-c Filename: www.domainname.com/examples/FILENAME.abc

Date Page Status Referer
05/14 10:42:19 /dir/FILENAME.abc 302 -
05/14 10:42:19 /404.html 200 -
05/14 10:42:20 /dir2/email.jpg 200 /404.html
05/14 10:45:22 /dir/FILENAME.abcde.html 200 http: //search.viewpoint.com/pl/websearch?k=[see below]
05/14 10:45:32 /dir/FILENAME.abc 302 -
05/14 10:45:32 /404.html 200 -
05/14 10:45:32 /dir2/email.jpg 304 /404.html

1.) First hit to truncated, 40-character file; no referer
2.) Rewritten to custom 404.html; no referer
3.) Retrieves custom 404.html graphic; referer
4.) 3 minutes later, hits full file; referer: search.viewpoint.com (ironically, a whopping 304 characters!)

Full Referer (breaks prevent side-scroll):

http:// search.viewpoint.com/pl/websearch?k=FILENAME&tn=0&type=rel35&

Try It: [tinyurl.com...]

5.) 10 seconds later, lather, rinse, repeat (#1, #2, #3, above):

6.) Second hit to truncated, 40-character file; no referer
7.) Rewritten to custom 404.html; no referer
8.) Retrieves custom 404.html graphic; referer
9.) Gone.

Beats heck outta me. HTH somebody!


 9:46 pm on May 17, 2006 (gmt 0)

Guess the main question is... legitimate traffic or not?


 12:19 am on May 18, 2006 (gmt 0)

If by "legitimate traffic" you mean a real, live person at the other end, in real time, actually looking for what they're hitting --


Not real.

Over the course of ~100 hits from real people visiting in real time ending up at my custom 302 'e-me for help' page (usually due to temp blacklisting), the average follow-through rate is ~80%.

Of ~100 hits from "partials" to truncated URLs ending up at that same custom 302 page, the total follow-through rate is --


Zilch. Zero.

It's as if they're zombie machines running some sort of distributed search. And but for one single 'visitor' with a Viewpoint referer, I don't know where any of them come from, they never visit any other pages, and they rarely come back (and the few that do just repeat the same errors).

For the past six weeks I've waited and hoped -- nay, expected -- to hear from at least ONE partial person, then I could ask them about where they came from, if they had any recent installs, whatever.

I'm still waiting.

(And I'm still very much looking forward to someone solving this mystery!)


 7:48 pm on May 18, 2006 (gmt 0)

To make matters more interesting...

Just saw that someone/thing did a search on Yahoo ( with the usual search string referrer )... came to the 'correct' page, even clicked/followed a link - then immediatly requested the 'truncated' version of the originating page... with no referrer.

This is odd...


 3:42 am on May 19, 2006 (gmt 0)

bobothecat wrote:
"To make matters more interesting...
Just saw that someone/thing did a search on Yahoo ( with the usual search string referrer )... came to the 'correct' page, even clicked/followed a link - then immediatly requested the 'truncated' version of the originating page... with no referrer.

This is odd.."

I'm seeing more and more of those weird compound URLs, and the referring URL is often another compound URL or a real page in my site. This has got to be bot work. It seems the truncated stuff may be too.


 5:19 am on May 19, 2006 (gmt 0)

starhugger, I'm not sure about the "compound URLs" you've mentioned x2 but bot or not, you're describing something I'm not seeing in conjunction with the truncated or partial URLs mystery we've been trying to solve for weeks and weeks.

The clues just ain't the same.

Maybe if you started a new thread in this forum and provided more details from the get-go about what you've been/are seeing, your "compound URLs" problem will get specific troubleshooting attention, and you'll get a solution!


 4:52 pm on May 20, 2006 (gmt 0)

Thanks Pfui, I was starting to think the same thing. ;-)



 4:18 am on May 27, 2006 (gmt 0)

I'll have an answer to this puzzle sometime next week. Interestingly enough, my attorney, in preparation for a lawsuit against a copyright infringer, is occasionally triggering the same truncated page requests from one of my sites.


 5:53 am on Jun 3, 2006 (gmt 0)

Key_Master, the suspense is killing me! Pray tell, have you solved this ongoing mystery? (Was it the Butler in the Library with the Candlestick?)


 8:30 pm on Jun 5, 2006 (gmt 0)

Perhaps the Butler took care of Key_Master before he could say :)


 1:36 am on Jun 8, 2006 (gmt 0)

I found ONE source and I think it's either scrapers or users not willing to click on the hyperlink and trying to cut/paste the displayed URL under the snippet. The one I traced it back to was a search engine called ViewPoint, same as mentioned above.

The reason I mention scrapers is they like to harvest URI's from search engines to avoid being seen reading robots.txt when in stealth mode.

I see the following:

My Page Title <hyperlinked correctly>
.... snippet...
www.mysite.com/mypage.ht <incorrectly truncated>

This 82 message thread spans 3 pages: 82 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved