Forum Moderators: open

Message Too Old, No Replies

Percentage of sites with Invalid HTML

Any stats?

         

AhmedF

7:09 pm on Oct 10, 2003 (gmt 0)

10+ Year Member



Ive been looking for a while - does anyone have statistics on how many pages on the web right now would be 'invalid HTML/XHTML'?

I think I saw this somewhere before on the w3 validtor site, but I cant find it now.

Any help? :)

Mohamed_E

7:40 pm on Oct 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From an old post by GoogleGuy [webmasterworld.com] (msg # 6):

Basically what these folks said. :) The only data point I'd add is Eric Brewer's '96 paper that mentioned 40% of pages have actual errors in the pages.

Note that he is referring to a '96 paper (last century ;) ).

I was unable to find that paper in Papers by Prof. Brewer [cs.berkeley.edu].

AhmedF

7:47 pm on Oct 10, 2003 (gmt 0)

10+ Year Member



woah -- 1996. That was my grandmother's era! :)

I would think that its inreased now with nested tables and etc .. 1996 was the age of simple heh

Mohamed_E

8:53 pm on Oct 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just got an email from Professor Brewer pointing to An Investigation of Documents from the World Wide Web [www5conf.inria.fr].

I find it a fascinating paper from the days when the web was young. The analysis of errros does not go very far.

Read it and enjoy the nostalgia :)

hartlandcat

7:25 am on Oct 11, 2003 (gmt 0)

10+ Year Member



Hmm, 40% seems very low for the amount of pages on the internet that don't use valid code -- more like the percentage of pages that come out of validator.w3.org with errors. I'd say it would be more like 80-90% of pages on the whole internet who's code isn't valid.

zaptd

9:44 am on Oct 11, 2003 (gmt 0)

10+ Year Member



In 2001, Dagfin Parnas evaluated 2.5 million sites listed in the Open Directory Project for his master’s thesis. 99.29% did not validate: [ub.uib.no...]

zaptd

9:49 am on Oct 11, 2003 (gmt 0)

10+ Year Member



Oops, thats 99.29% of 2.5 million pages, not sites :)

percentages

9:57 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>99.29% of 2.5 million pages

Holy bad typing Batman.....thank goodness for tolerant browsers;)

Scary to think that if MS launched a 100% compliant only browser that over 89% of the current web users wouldn't be able to surf anymore;)

MonkeeSage

11:01 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looking at section 5.2, figure 5.2 (p. 81, [PDF p. 87]) of Mr. Parnas' Thesis, the 99.29% stat. can be clarified a little bit...it's not as if 99.29% of the pages were just code soup with gaping errors. If the invalid page count excludes documents which omit a DTD, that lowers it to 97.42% [figure 5.3, caption]. Also, almost all of the errors were caused by one of the following: omission of a required attribute, a non-standard attribute, an unknown attribute value, or an attribute being specified twice [figure 5.4].

Mr. Parnas did not indicate the average number of errors per page that I can see from a brief reading of section 5. I would be interested to find that out.

I plan on reading through the whole thesis as I have the time. Looks like a very interesting piece. Nice resource zaptd. :)

Jordan

percentages

11:10 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jordan, I look forward to the actual numbers from your on-going research.....but, my money is on a high percentage.

Whichever way you look at it.....I suspect we would all be in trouble if the browers were strict on us ;)

Mohamed_E

11:12 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great find, zaptd! Many thanks.

Just a warning for those who have a dial-up connection (as I do). The thesis is 125 pages of PDF :( I got Chapter 5, "Statistics on syntactical errors" over the connection slowly, well worth the wait.

Does anyone know whether a shorter version (such as a published paper) exists? A quick search failed to find one.

hutcheson

6:21 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Scary to think that if MS launched a 100% compliant only browser that over 89% of the current web users wouldn't be able to surf anymore;)

Huh. Not worried at all about that. Those gonifs couldn't comply with an ANSI sheet-metal-screw standard even AFTER they stole a thread-cutting device.

While we're fantasizing, suppose M$ launched a 100% compliant only web page development tool! It would be hard on some people -- the page-view-based advertising revenue for forums such as this would dry up faster than a SCO executive in the witness box.

grandpa

5:00 am on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ive been looking for a while - does anyone have statistics on how many pages on the web right now would be 'invalid HTML/XHTML'?

It's sad to admit, but right now 100% of the pages on my site are invalid HTML. 99% of those errors are missing tags: <p> without a </p>, of the rest I have deprecated attributes and not a single DTD. I'm fixing what I can as I work thru my keyword list for those pages. Each page gets a full visual check and missing tags are replaced, including the ever missing </html>. It's hard to believe we even get any orders.

grandpa

10:44 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's sad to admit, but right now 100% of the pages on my site are invalid HTML.

Woohoo! I'm down to 98%. I just validated my first page and learned some CSS along the way.

So I show the boss, he smiles, then I ask him if Mr. Widget, the fellow who has maintained this site for the last 3 years - and is currently coding a new site for us - will bother to validate the new pages before they go online. No answer.

claus

11:03 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> Mr. Widget

As a web designer/developer, building sites (which is something else than optimizing them) a validation is the only accurate measure of how well you do that task; it's a simple yes/no question and you either pass or you don't - meaning: you either do your job or you don't. It should really be adopted by more people in this industry/trade.

All other measures, say, "optimized for browser xyz" is never even what they seem, as "browser xyz" is just not browser xyz when it's the japanese, greek, english, german version, or build 1.1.1.2 vs. 1.1.1.3, or on a small screen vs a large screen, or on one operating system vs. anoter, or having this or that plugin, or whatever settings, or... <cut> this list could continue endlessly.

/claus

AhmedF

1:55 am on Oct 13, 2003 (gmt 0)

10+ Year Member



So any hard 'STATISTICAL' numbers that are recent? :(

claus

2:57 am on Oct 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The 99% invalid sites "guesstimate" on the W3C Quality Assurance site [w3.org] is well in line with the masters thesis [ub.uib.no] that zaptd posted a link to. The thesis is very recent (december 2001) and it is very comprehensive as well (spanning 2,398,226 documents, including 14,563 valid ones).

Knowing that the sample is from DMOZ listings, it would surprise me if a truly random sample drawn from the whole web would reach a higher percentage of valid documents. I have not been able to find more recent figures, but i suppose it is possible to replicate the study given sufficient programming skills, as both the data source and the validator is publicly available.

/claus

g1smd

9:16 pm on Oct 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> optimized for browser xyz <<

... usually means that the code is still non-valid tag soup.

>> Knowing that the sample is from DMOZ listings <<

If a site fails to work in the reviewing editor's browser, then it gets left in unreviewed with a note. If a sufficient number of people cannot access the site, then it gets deleted, so the ODP maybe has a very very small bias away from sites with very bad coding.