Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

.NET Challenges - Duplicate URLs Getting Indexed via Forms and More

         

Whitey

2:08 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We're trying to head off a duplicate content problem occurring and ran into some trouble mentioned on a thread over here: [webmasterworld.com...]

I didn't think Google followed form posts - but it looks like it may.

Does anyone have experience with this?

tedster

2:47 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you asking if Google will try to spider a url that occurs ony in a form element's action attribute? Yes, they will.

To discover urls, Google will try to spider any character string they find that looks like it might be a url - no matter where it appears, whether that's a form's action, a string in a javascript files, wherever. And if that url resolves with a 200 OK, then it very likely might get into the index.

Whitey

3:08 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes that seems to be it exactly, and it does seem to be Google specific as it's not occuring yet elsewhere. The first site was launched 16 Dec 07.

No default.aspx pages exist according to Xenu [ which is often used as a way to check and fix ] , but of course they do exist according to Google.

Does anyone know how to fix this issue easily? I've not had much luck on the above thread, but thought some of our duplicate content experts might have an idea if they are familiar with .NET .

We used to use a Perl / Apache environment but have had to move to a .NET environment now. The problem is that the pages should be ending "/" . but they have now received a duplicate page indexing of /default.aspx

[edited by: Whitey at 3:11 am (utc) on Jan. 19, 2008]

tedster

4:59 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The best way I know to "tame" the many IIS challenges is to use the third party utility called IASAPIrewrite. It's not free - does cost some money - but without it even simple fixes can be crazily problematic. And with it, your tech team can execute many of the fixes recommended here that seem to be only for Apache users.

In essence, ISAPIrewrite gives you something very much like the Apache context you are familiar with. With it you can do 301 redirects from default.aspx to the domain or folder root without creating nasty loops, make your urls case sensitive, and establish all manner of canonical fixes. Any tech who is familiar with regex can create the needed rules.

ISAPIrewrite is a whole lot more flexible than using the native .NET rewrite feature. Most of my .NET clients have installed it, and are extremely happy that they did.

Also, I'd suggest you move those form action attributes to different urls - don't let them stay pointed to a default.aspx url. Then the previous url will return a 404 and it will no longer have the potential to create problems in the Google index.

There was a good thread last year that featured many tips for people wrestling with .NET and IIS - especially those whose past was confined to Apache:

[webmasterworld.com...]

Whitey

6:30 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What's the best way to identify any other erroneous URL's .

As I said we did an audit with Xenu and it didn't pick these URL's. The site:tool is unreliable and I didn't receive a notification of duplicate content in Google Sitemaps.

However, I must point out that i did become aware of it via the Site:Tool when checking the filtered results. Just wasn't sure if there's a cleaner method.

btw - I just got the heads up on Xenu , it doesn't follow forms. They said it's a Google / server issue.

[edited by: Whitey at 6:37 am (utc) on Jan. 19, 2008]

tedster

6:56 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You cannot simply search your content or run some tools and uncover all the potential duplicate url errors. So I would suggest you read through all the duplicate threads in the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page. Become thoroughly versed in each kind of problem - canonical, query string issues, capitalization (watch out for this on IIS), https versus http -- check every single potential for a duplicate problem that our members have uncovered over the years.

Then rigorously test your server for each situation by playing with different url patterns in your browser's address bar. Kick the tires, in other words, and do it over and over. Pretend you are a hostile competitor who is looking for ways to sink your site by linking to duplicate urls.

Use the Firefox HTTPHeaders extentions to ensure that the status codes returned are accurate in each situation. It can be grueling, but it is well worth it. I always test every new site like this, and I always verify every change in a rewriting scheme as well.

And I've never found a deployment that was correct in all areas as it was first configured. For this reason, it's a very good idea to have a test server so you can do the testing before deploying the new site on the web.

Whitey

8:43 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



capitalization (watch out for this on IIS)

is this specific to IIS?

Jeepers creepers ...... i thought we'd done this with URL's correctly resolving back to and what did i find :

[mysite.com...] doesn't resolve to [mysite.com...] note "/ "
[mysite.com...] should go to "/"
[mysite.com...] doesn't resolve to ditto

and so on

Thanks for the reminder. We left the gate open!

Pretend you are a hostile competitor who is looking for ways to sink your site by linking to duplicate urls

Surely this leaves everyone open to abuse? How do you police this, just keep looking at site:tool and check for filtered results?

[edited by: Whitey at 8:51 am (utc) on Jan. 19, 2008]

tedster

8:53 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You've got all the tools right here, just be vigilant. And definitely watch your server logs as well as your Google Tools. Lock it down tight and maintain that condition. Yes, anyone can be vulnerable to either accidental or intentional "attacks". Protect yourself before they happen.

pageoneresults

9:15 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We used to use a Perl / Apache environment but have had to move to a .NET environment now. The problem is that the pages should be ending "/" . but they have now received a duplicate page indexing of /default.aspx

Ah, I would have probably cautioned you against that move. Windows is not for everyone. A .NET site "out of the box" is going to fail miserably if certain things are not addressed, I know, I've experienced it. I started working with .NET a couple of years ago and have learned most of its flaws and, there are quite a few of them.

Let me give you a short list of things you'll be contending with out of the box...

  • VIEWSTATE
  • /default.aspx
  • Case Issues
  • https Issues
  • Postbacks

Those are just a "few" on my list. Almost everything can be handled using ISAPI_Rewrite. I say almost everything because there may be a few things to do within your .cs files to achieve the ultimate desired result.

VIEWSTATE is going to be your first project. Out of the box, most .NET developers are going to turn on all the controls. This makes for a very "heavy" VIEWSTATE which is sitting right after the <body> element on every page. And, depending on what is happening on that page, that damn VIEWSTATE can be over 3,000 characters.

/default.aspx is always going to be there unless you do a rewrite. The .NET application will "always" append that damn default.aspx, always!

Case issues have plagued Windows folks since the beginning. If you are using ISAPI_Rewrite, the issue is resolved.

https vs http issues are always present. With ISAPI_Rewrite, that issue is resolved too.

Postbacks are an SEO's Kiss of Death. Anything in a Postback is invisible to the bots, anything. When implementing a rewrite, most of those Postbacks are going to be changed to plain hrefs. Some will remain as they are a "natural" way to prevent duplication believe it or not. :)

There will be many other things you'll run into. You need to be sure that you have the right people behind the helm (Windows Server Administrator) who knows how to use ISAPI_Rewrite and can override the default .NET stuff. If you don't, I can tell you from experience that you may be chasing your tail for quite some time.

Whitey

11:30 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can you be specific here :

And definitely watch your server logs as well as your Google Tools

I'm just trying to clarify :

Logs : I figure that if you see erroneous URL pages in your logs like say :

mysite.com/correct-capitals/Default.aspx when it should be mysite.com/Correct-Capitals/ there could be a problem

Do you have any other tips? [ i thought i knew enough on this subject - now i realise i don't! ]

Google Tools : Did you mean the site:tool or tools in Google sitemaps [ I didn't find the duplicate content report tool was picking up anything of this nature ]

[edited by: Whitey at 11:35 am (utc) on Jan. 19, 2008]

pageoneresults

11:42 am on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



example.com/correct-capitals/Default.aspx when it should be example.com/Correct-Capitals/ there could be a problem?

Yes. Windows is not case sensitive like Apache. The search engines typically follow protocol which is case sensitive. So, in your example above, there are actually six possible starting points immediately...

example.com/correct-capitals/
example.com/correct-capitals/default.aspx
example.com/correct-capitals/Default.aspx

example.com/Correct-Capitals/
example.com/Correct-Capitals/default.aspx
example.com/Correct-Capitals/Default.aspx

And, if you've not addressed non www 301 to www, take the above and multiply it by two which now gives you twelve possible entry points.

Now, think like an unsavvy competitor. Index your entire site, run a routine to regurgitate those indexed URIs in mutiple forms using case. Put that page on a high PR site for a few weeks and let it get indexed. Take it down after the indexing occurs.

What just happened?

It is possible that there was just enough PR on the linking page to cause an indexing challenge. I don't know, I would never do anything like that. From my perspective, that is "absolute foul play" and something I would seek legal damages for if I could prove it and the offending party were within my jurisdiction. ;)

Whitey

1:20 pm on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And, if you've not addressed non www 301 to www, take the above and multiply it by two which now gives you twelve possible entry points.

Yep this one has an open path as well. Site:tool only shows the www.version and both have PR.

Does Google recognise this? or is it just a case nobody has linked to us yet or we haven't been crawled deeply enough, before the big collapse with the duplicate content effects?

Now, think like an unsavvy competitor. Index your entire site, run a routine to regurgitate those indexed URIs in mutiple forms using case. Put that page on a high PR site for a few weeks and let it get indexed. Take it down after the indexing occurs.

I think Google could do more to share this problem. 99% of site owners , I'd estimate,are unaware of how to manage this or better secure their sites. Genuine site owners must be getting hit from all angles or at least be very vulnerable. I'd say e-commerce and information via the internet is potentially vulnerable on a large scale if some unpleasant folks with disruptive abilities and desires wanted to get serious about it. And accessability for ordinary site owner folks is limited - the expertise to deal with duplicate content alone is limited.

But i guess complaining isn't going to help right now. Best to get on with the job :)

[edited by: Whitey at 1:33 pm (utc) on Jan. 19, 2008]

pageoneresults

1:50 pm on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think Google could do more to share this problem.

It is not their problem. And, Google does take the onus right now by following the established protocol.

99% of site owners, I'd estimate, are unaware of how to manage this or better secure their sites.

That is a genuine challenge and one that needs to be addressed at the source. It is not Google's fault that there are flaws in the Windows platform that don't follow protocol, the issue with case sensitivity being one of them.

Personally, I prefer that there not be any case sensitivity. That way you can use camelCasing, PascalCasing, or whatever case preferences you desire. But, protocol does not allow you that luxury. Protocol forces case.

Genuine site owners must be getting hit from all angles or at least be very vulnerable.

I think the challenges are more on the Windows side of things. Apache by default has your back covered on most of the basics, I think. I know Windows doesn't. You need to cover your own back. :)

I'd say e-commerce and information via the internet is potentially vulnerable on a large scale if some unpleasant folks with disruptive abilities and desires wanted to get serious about it.

I'm a firm believer that it is being done regularly without the knowledge of who it is being done to. From a high tech perspective, those in the know could really do some severe damage to a site if they wanted to.

I've sat down and thought about those holes I've found over the years that still exist and WOW! is all I have to say. I think you could literally remove a site from the index in a short period of time using a few unsavory technical approaches.

And accessability for ordinary site owner folks is limited - the expertise to deal with duplicate content alone is limited.

I've learned over the years that it all comes back to the site owner and their assigns. If you do not know most of this, you will be fighting an uphill battle moving forward. You will continually have indexing challenges. You will find pages constantly moving in and out of the index and never really becoming seated. As soon as you bring the dynamics into the equation, the technical challenges grow exponentially. And, if you are Windows, yikes, watch out!

Whitey

9:12 pm on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So is there a best practice configuration that can accommodate .NET and use Apache to close down these holes which could reign havoc for a business?

What options are there for the development team to work in a .NET environment , yet still enjoy the relative security and ease of Apache server?

Does it need to be one or the other or can they co exist?

Is it necessary to be "concerned" at all ......

If you are using ISAPI_Rewrite, the issue is resolved

and

In essence, ISAPIrewrite gives you something very much like the Apache context you are familiar with. With it you can do 301 redirects from default.aspx to the domain or folder root without creating nasty loops, make your urls case sensitive, and establish all manner of canonical fixes. Any tech who is familiar with regex can create the needed rules.

ISAPIrewrite is a whole lot more flexible than using the native .NET rewrite feature. Most of my .NET clients have installed it, and are extremely happy that they did.

[edited by: Whitey at 9:20 pm (utc) on Jan. 19, 2008]

tedster

9:42 pm on Jan 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ISAPIrewrite is a helper application for the Windows IIS server. It is not an Apache application - it just gives the IIS user the kinds of functionality that the Apache user enjoys natively. If you're using the .NET environment, then you're on an IIS server.

pageoneresults

2:15 am on Jan 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is not an Apache application - it just gives the IIS user the kinds of functionality that the Apache user enjoys natively.

I'd like to point out that ISAPI_Rewrite 3.0 utilizes .htaccess. Migrating from Apache to Windows is a bit easier with 3.0.

For those of you using 2.0, there is no need to upgrade unless you want to use a .htaccess file above the root. I prefer 2.0 as it allows me to drop a .ini file into each site's root as opposed to going above and working with a .htaccess file. I can edit my .ini file from within FrontPage believe it or not. I do it every day. :)

Whitey

2:09 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How essential is using ISAPI_Rewrite?

My developers are confident they have now fixed the /Default.asp , lower case , www -www , without the need for it.

tedster

2:39 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I feel that ISAPI Rewrite is invaluable for the IIS server. It makes all kinds of very challenging changes easy. That saves time and also avoids creating hidden problems, now and in the future. Even for websites who are using ISAPI Rewrite, I still find the occasional problem in real world implementation - sites require monitoring every time the tech team implements anything new.

Now to be 100% fair, ISAPI Rewrite couldn't do what it does unless IIS allowed all those technical moves - so in theory your team could develop all their own code to get the same functionality. But I have never seen it done. And no disrespect to your team, but I'll bet there are still holes in whatever approach they have come up with.

IT people do not like to be questioned about their area of experitise and training - no one really does, after all. Resistance is something I run into 100% of the time with a tech team, at least until we build a strong and mutually respectful working relationship.

If you are not getting through on a small issue like using this relatively inexpensive module, then I think the tech team is being defensive. This happens because 1) they feel insulted or 2) they don't think you have the technical knowledge needed to tell them what to do with the server. Maybe it's both. This is a human, political issue most of all.

In your situation, I would thoroughly audit the results the team got for you - really check out the urls, the status codes, etc. You don't need to know IIS server admin inside and out to see if their implementation is getting exactly the proper end results, 100% of the time and right on the money.

If their results are excellent, then congratulate them and move on - and remember to check the end results anytime some new move is introduced. If not, show the team where the shortfall is. Maybe they can fix it the hard way again. And someday, maybe they will see the wisdom of spending a small amount of money for a widely used and well tested module that saves them much time and many headaches.

Also, installing ISAPI Rewrite protects your business against personnel changes in the future. You really can't afford to be completely dependent on any one person's "irreplaceable" knowledge - and a person who can reproduce all the functionality of this module would be a hard person to ever replace.

[edited by: tedster at 6:30 am (utc) on Jan. 23, 2008]

pageoneresults

2:50 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My developers are confident they have now fixed the /Default.asp , lower case , www -www , without the need for it.

Okay, how did they fix the /Default.aspx issue?

And, I know they didn't fix the case issue. If they did, I'd really like to know how. I'll assume they went in and used the .NET facilities for rewriting which are fine for the basics. Once you get beyond the basics, .NET will fail miserably.

Whitey

6:02 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They closed off all internal avenues to /Default.aspx , but the page still exists, should someone take a copy of our site and point it to every single page [ i suppose ]. They weren't too concerned about this.They said that once the pages drop out of the Google index it shouldn't matter.

Not sure about the upper and lower case issues, but I apologise, this is an area they are still looking into. This isn't fixed yet.

Once you get beyond the basics, .NET will fail miserably

Can you provide some specifics? Ae you referring to examples like :


example.com/correct-capitals/
example.com/correct-capitals/default.aspx
example.com/correct-capitals/Default.aspx

example.com/Correct-Capitals/
example.com/Correct-Capitals/default.aspx
example.com/Correct-Capitals/Default.aspx

or beyond this?

[ btw - I appreciate the inputs, I'm just looking for some specifics perhaps beyone the upper and lower case issues, although that may be enough ]

[edited by: Whitey at 6:09 am (utc) on Jan. 23, 2008]

pageoneresults

2:56 pm on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can you provide some specifics? Ae you referring to examples like,

I could, but they probably wouldn't apply in your environment. Each person will have different requirements based on the architecture and taxonomy of their site.

We worked with the .NET rewrite facilities to see if they were a viable alternative to ISAPI_Rewrite. They are not. Not even close. The amount of work to do is trebled with .NET.

I'd suggest your programmers install ISAPI_Rewrite and stop trying to skirt the damn issue. Without it you're going to be limping along.

P.S. I don't think they are going to find a solution to force lower case via .NET. Remember, Windows is not case senstitive so it is an area that is of little concern to Microsoft.

With ISAPI_Rewrite, these rules take care of the case issues. :)

#convert all upper case to lower
RewriteCond URL ([^?]+[[:upper:]][^?]*).*
RewriteHeader X-LowerCase-URI: .* $1 [CL]

RewriteCond Host: (.+)
RewriteCond X-LowerCase-URI: (.+)
RewriteRule [^?]+(.*) http\://$1$2$3 [I,RP]

They said that once the pages drop out of the Google index it shouldn't matter.

I would have fired the person who said that. Right there on the spot with no questions asked.