Valid HTML and Search Engine Signals of Quality - General Search Engine Marketing Issues forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Valid HTML and Search Engine Signals of Quality

Does Valid HTML Belong in the SEO Toolkit?

martinibuster

4:37 pm on Sep 7, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Creating Valid HTML is a good practice because it future-proofs your website to display well for standards compliant browsers of the future, and it makes it easy to revise and correct, when neat and tidy.

However there seems to be disagreement on whether Valid HTML is necessary for Search Engine Optimization, for promoting your site to the search engines.

On one side there is the belief that modern bots are engineered to wade through bad code to get to the content, otherwise vast amounts of quality content would be left unranked. As a consequence, because search engines do not validate websites, valid code does not send a positive signal, nor is it generally necessary for properly indexing a site. Yes, it is within the realm of possibilities for absolutely horrid code throw off a bot, but that generally isn't happening with today's smarter bots.

On the other side, some people state that valid code leads to better and smoother indexing and as a consequence, higher rankings.

What do you say?

victor

6:06 am on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Two thoughts:

1. some malformed HTML will cause some spiders to miss some content (or misinterpret some content, skip some links, etc).

Unless you know precisely what affect those malformed tags have on the spiders you care about, best not to deploy them.

2. indexing the content a spider has retrieved is a huge, high speed, operation. Broadly and crudely speaking, the stages are:

a. parse it to find text and links
b1. pass the links and anchor text to the spider for further retrieval
b2. pass the text to the indexer for indexing.

The first bottleneck is the parse. So there is likely to be more than one parser: a simple high speed one that rips as much content and links as it can. Then a slower, more precise one that handles the cases that the first-pass parser rejects.

That way, the spiders and indexers are being fed as fast as possible.

But a minority of pages (those that trip the simple parse) get put on the back burner for later handling.

So, in effect, some quality signals lead to slower indexing.

Do I know that for sure? NO.

But if I were building the backend to a search engine's indexer, it would be the approach I'd take. Most other approaches would slow things down too much.

Would I want to take the risk that malformed tags lead to slower indexing? NO.

Would you?

Mohamed_E

6:26 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Victor's model of two parsers complicates answering the original question. There are "errors" that validators complain about that no sane browser or spider will have any difficulty dealing with. The most frequent one (in my case) is "invalid" characters in a URL. Any browser or spider must be able to understand unescaped ampersands and spaces in a URL.

To those who view "validity" as a theological issue any invalid construct is evil; for those whose approach is pragmatic I think it is essential to discuss which invalid constructs need to be fixed and which can safely be ignored.

For what it's worth, I attempt to write valid XHTML 1.0-strict (but advertise it on the server as HTML 4.1-strict). However, when I find a lot of pasted URLs with multiple ampersands in them I tend to leave them as is.

[edited by: Mohamed_E at 6:28 pm (utc) on Sep. 8, 2007]