homepage Welcome to WebmasterWorld Guest from 107.21.163.227
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 82 message thread spans 3 pages: < < 82 ( 1 2 [3]     
000s of Truncated Page Requests from Many IPs
jomaxx




msg:400815
 8:26 pm on May 8, 2006 (gmt 0)

Okay, for a few weeks I have been inundated with thousands of requests in the form
"GET /directory/sample.ht"

- requests are truncated to 19 characters, thus in my case almost all are generating 404s
- both HTTP 1.0 and 1.1
- numerous different browser IDs
- numerous (thousands of) different IP addresses
- I have determined that Javascript is NOT enabled (thus not an AdSense attack, for example)

What is going on? Has anyone else seen this? Such vastly distributed spidering makes me think that the IP is being spoofed, or that these are zombie PCs -- but why?

 

Pfui




msg:400875
 3:41 am on Jul 5, 2006 (gmt 0)

1.) Not sure about everyone else but all of my pages are NOINDEX and have been for months and months. Alas, and despite METAs and robots.txt, there are still cached copies, on MSN, and on Amazon.com-owned Alexa. The latter shows msnscache.com-based SERPs, not Amazon-owned A9's -- which, curiously, is also showing "Web Results by Windows Live."

(In fact, MSN is still caching in full violation of everything: "This is a version of [URL] as it looked when our crawler examined the site on 6/18/2006." GRRRRRR)

Thing is, the cached pages I've seen, however properly or improperly obtained/cached, at least show the full URL atop the cached page in the link(s).

2.) Thus far, Viewpoint is the only SE found to be clearly and consistently showing truncated URLs in its SERPs. They show the correctly linked Title (in blue), and they also show an unlinked, truncated URL in green. Try it:

[search.viewpoint.com...]

Try a search for "blogspot.com" (because of the long URLs). Pick any result and then click SEARCH THIS SITE to really see the truncated, "green URLs."

3.) Two of my most often-truncated URLs appear in Viewpoint's initial results, each correctly linked and incorrectly truncated:

RIGHT: Viewpoint SERPs Title (linked; blue)
www.example.com/dirname1/dirname2/filename.html

WRONG: Viewpoint SERPs URL (unlinked; green)
www.example.com/dirname1/dirname2/filena

The latter is 40 characters on the nose.

And then when I click "SEARCH THIS SITE," there are 11 pages of results, the longest "green URLs" of which all truncate at 40. And there are a LOT of them because my site's name is 10 characters long.

4.) Okay. Okay. Seeing as how this partial/truncated URL mystery has been bugging me for months, I should at least shoot Viewpoint an e-mail. Will do!

incrediBILL




msg:400876
 5:09 am on Jul 5, 2006 (gmt 0)

NOARCHIVE blocks cache copies, not NOINDEX, but NOINDEX should've stopped it from being obviously indexed.

Pfui




msg:400877
 5:18 am on Jul 5, 2006 (gmt 0)

Thanks. My goof. I've got <META NAME="ROBOTS" CONTENT="NOARCHIVE">, etc.

And when I really go nuts, in addition to robots.txt and .htaccess and traps --

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW, NOARCHIVE, NOSNIPPET, NONE">

: )

abates




msg:400878
 11:02 pm on Jul 5, 2006 (gmt 0)

Hmm, those look like the truncated URLs I've been getting in my logs all right. :P

Pfui




msg:400879
 11:31 pm on Jul 5, 2006 (gmt 0)

Welcome to the Mystery Club, abates!:)

I was just looking around Viewpoint [viewpoint.com]'s site for a toll-free number* and they're heavily promoting the "ONE Toolbar" (in conjunction with ONE.org):

[one.viewpoint.com...]

Back in mid-May, Key_Master mentioned the possibility of Viewpoint's toolbar being involved (message #11 [webmasterworld.com]), and bobothecat gave it a go and in message #13 reported that the results were accurate. But the ONE-'branded' version looks like it may be more feature-rich [one.viewpoint.com]?

Ergo, does anyone have Viewpoint's ONE toolbar installed? I'm a Mac person and it's only for Windows 2000 or XP, and Internet Explorer 5.0 + so I can't check it out. If anyone's game, here you go:

[one.viewpoint.com...]

And here's more search-specific info:

[one.viewpoint.com...]

.
*I found an East Coast-based TF number for support, <snip>, so I'll give 'em a buzz tomorrow.

[edited by: volatilegx at 6:28 pm (utc) on July 6, 2006]
[edit reason] lets keep phone numbers out of this [/edit]

incrediBILL




msg:400880
 8:36 pm on Jul 7, 2006 (gmt 0)

FYI, here's an update in my efforts to help visitors get past the broken 404s to the pages they needed.

So far this month only 1,938 404s slipped thru the cracks vs. 53,026 302 redirects to the proper pages in the first 7 days alone.

I can't stop 'em, but it looks like I can almost beat 'em ;)

Pfui




msg:400881
 9:51 pm on Jul 7, 2006 (gmt 0)

A bit ago I spoke* with a nice Tech Support fellow at Viewpoint [viewpoint.com]. Seeing as how it was late Friday afternoon in New York City and this poor guy gets a call from some woman in Seattle going on about seeing "green URLs," he was actually uncommonly polite:)

He looked up a TLD I gave him and saw its truncated "green URLs" in Viewpoint's SERPs so he's going to check around and get back to me via phone or e-mail. I subsequently e-mailed URLs to two of this thread's messages -- the general recap [webmasterworld.com] (#43) and the Viewpoint-specific [webmasterworld.com] post (#61).

We shall see!

.
*If anyone needs it, the phone number I used appears on the download page linked above:)

.
P.S.
The two link-to-messages URLs didn't seem to work, so I made two TinyURLs but those got redirected to WW's front door so it's back to the to-message links. Whew!

Pfui




msg:3004678
 9:22 pm on Jul 12, 2006 (gmt 0)

I got a very nice callback from Viewpoint today. (Their Tech Support / Engineering people are nicer than most Customer Service people!)

Stay tuned...

:)

bobothecat




msg:3006210
 10:27 pm on Jul 13, 2006 (gmt 0)

I got a very nice callback from Viewpoint today. (Their Tech Support / Engineering people are nicer than most Customer Service people!)
Stay tuned...

Do let us know what the nice folks said... would be great if we could put an end to this problem... assuming it's theirs.

viewpoint




msg:3008144
 9:02 pm on Jul 14, 2006 (gmt 0)

Thank you so much for alerting Viewpoint to this issue.

We believe that we have identified a potential cause for the truncated requests that may be appearing in your error logs. We hope that this issue has now been resolved.

A visual search feature made available through some Viewpoint toolbars may have caused the reported anomalies. Following a recent redesign of Viewpoint Search, some of our toolbars began requesting truncated text URLs rather than complete site hyperlinks when generating visual site previews for web searches.

The problem was caused by an HTML coding error on Viewpoint’s search results page. At no time were any searchers ever directed to broken links on any site. URLs are truncated during the SERP rendering process, which only impacts the text appearing below search results, not the actual hyperlinks.

Viewpoint does not operate any search spiders.

Annie alerted us to this issue last Friday. Her comprehensive summary of reported anomalies as well as the detailed contributions of everyone on this thread helped us to identify a possible cause of this issue and resolve this problem within a few days. The fix took affect this afternoon.

Again, we sincerely regret the confusion and headaches that this may have caused for some of you over the past few weeks. Additionally, we cannot overstate our gratitude to all of you for helping us to detect and resolve this problem.

Best regards,

Rick, Product Manager
Viewpoint Corp

bobothecat




msg:3009334
 12:45 pm on Jul 15, 2006 (gmt 0)

At no time were any searchers ever directed to broken links on any site.

Hmm... that's not what my logs show. So unless all of these truncated URL's where from type-in traffic (which I doubt), that's either an incorrect statement, or we're back to square-one.

thetrasher




msg:3009342
 1:04 pm on Jul 15, 2006 (gmt 0)

(...) some of our toolbars began requesting truncated text URLs (...) when generating visual site previews for web searches.

Am I the only one who has problems with automated requests using a browser UA? How can I recognize the Viewpoint toolbar and stop that preview bandwidth waste? Does it send any "preview" or "prefetch" header?

Liane




msg:3011041
 3:38 am on Jul 17, 2006 (gmt 0)

Well I'm still seeing truncated file requests in my logs, so these requests are still originating from somewhere!

Pfui




msg:3011178
 7:00 am on Jul 17, 2006 (gmt 0)

bobothecat, my understanding is that the toolbar was making thumbnail calls ("visual site previews"), and not real people per se. But I don't know triggered those in the first place.

thetrasher, agreed. It would be nice if, like other toolbars, there was some sort of info in the UA string. And definitely some way to say, "No, thanks" to thumbnail requests, whether they originate with Viewpoint's toolbar, or one of Yahoo's gazillion crawlers, or are connected to Yahoo's SERPs (which Viewpoint uses).

Lianne, I'm sorry to hear you're still plagued. Because I'm pleased to report that for the first time in almost 3 1/2 months, I have no truncated file errors after the 13th. None. Nada. Zip!

07/01: 02
07/02: 05
07/03: 14
07/04: 03
07/05: 07
07/06: 07
07/07: 18
07/08: 05
07/09: 09
07/10: 07
07/11: 06
07/12: 12
07/13: 14
07/14: -
07/15: -
07/16: -

Right on cue, the day Viewpoint's fix kicked in, the problem stopped.

If viewpoint, a.k.a. Rick, is experiencing the same no-mail-notification problems I am following the board's upgrade, he may not realize that there are responses. I'll touch base with him tomorrow and ask him to revisit us and our Qs. -Annie

incrediBILL




msg:3011789
 4:28 pm on Jul 17, 2006 (gmt 0)

Yup, that did stop some of it but probably won't stop all of it as I was getting similar page requests from a couple of obvious spiders, not toolbars.

Viewpoint does not operate any search spiders.

You don't have to operate a spider to be the source of the problem as truncated results show on your search site. If scrapers are scraping your search results and incorrectly extract those truncated addresses shown in green, then we'll still see some of this.

Looks good so far today, not the flood of requests that it was.

[edited by: incrediBILL at 4:29 pm (utc) on July 17, 2006]

Liane




msg:3012852
 9:37 am on Jul 18, 2006 (gmt 0)

Well, I just found one of the culprits!

What started out as a "possible" attempt to hijack one of my most important pages a couple of years back using a rather sneaky redirect ... has been changed and now lists the URL without the .html file extension.

I mean, what is the point?

If they weren't trying to hijack the page in the beginning when they used the redirect, what were they trying to accomplish by leaving off the .html extension?

Perhaps it was an honest mistake or perhaps it was done on purpose. I will likely never know. Now to find the rest of them!

Pfui




msg:3013299
 4:41 pm on Jul 18, 2006 (gmt 0)

1.) Liane, so what you saw, and caught (good going!:) was a suffix-cropped URL, and not the Viewpoint-related 40/47 character length crop?

2.) Also and FWIW, I can go to any page on any of my sites, lop off the ".html" and still get where I want to go. For example, I can go to --

http://example.com/dir1/file2

-- and actually end up at --

http://example.com/dir1/file2.html

-- but my location bar, and logs, show my error-free path as sans suffix.

Now as to why anyone/thing would intentionally lop off a suffix -- I dunno. But the same thing happens with .txt, .gif -- the URLs work but look like directories minus the trailing slash. I tested this on my own co-located server and on a shared one owned by a major ISP -- same thing.

I guess those webservers (Apache) complete the URL internally. Or are misconfigured. Or something. Because when I tried lopping the .html off of --

pages.ebay.com/sitemap.html

-- I got a 404.

So anyway, have you tried truncating any of your own pages on purpose? What happens when you lop off the ".html"?

[edited by: volatilegx at 2:00 pm (utc) on July 19, 2006]
[edit reason]
[1][edit reason] delinked ebay link [/edit]
[/edit][/1]

Liane




msg:3014294
 11:35 am on Jul 19, 2006 (gmt 0)

and not the Viewpoint-related 40/47 character length crop?

Correct.

have you tried truncating any of your own pages on purpose? What happens when you lop off the ".html"?

Yeah ... all pages resolve without the .html ... which I find very annoying. Just like I find it annoying that a page resolves if using the /index.html even though that extention isn't used on most sites.

Why do pages resolve properly when using anything other than the actual URL? Is there something I can do to prevent this?

volatilegx




msg:3014456
 2:02 pm on Jul 19, 2006 (gmt 0)

Why do pages resolve properly when using anything other than the actual URL? Is there something I can do to prevent this?

That would be a great topic for the Apache forum [webmasterworld.com].

Liane




msg:3015098
 10:28 pm on Jul 19, 2006 (gmt 0)

volatilegx please check your sticky mail.

I found another one tonight. It is a Keyword driven, scraper site (with Google ads of course) and has truncated every url on the page. Now I'm just getting ticked off. I have to figure out a way to fix this so that my pages won't resolve unless the URL is used properly.

bobothecat




msg:3015136
 10:42 pm on Jul 19, 2006 (gmt 0)

Looks good so far today, not the flood of requests that it was.

Continues to look good from this end... the 'truncated requests' have dropped to zilch, and have stayed that way for the past few days.

One still has to wonder where all of these requests came from.

viewpoint




msg:3016579
 10:59 pm on Jul 20, 2006 (gmt 0)

Indeed, our current implementation is terribly inefficient. Thumbnail calls are initiated by user searches. We are taking steps to reduce the volume of visual search calls. In the near future, we hope to implement a thumbnail database whenever appropriate and possible. Unfortunately, there's no unique parameter in the UA string that you can use to reject these calls in the interim. Your concerns have been noted and we will resolve this issue to the best extent that we can in the near future. If anyone is still receiving truncated URL requests that are 40 characters in length, please let us know and we can investigate further.

Rick, Product Manager
Viewpoint Corp

This 82 message thread spans 3 pages: < < 82 ( 1 2 [3]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved