MAMA: What is the Web made of?

Forum Moderators: open

Message Too Old, No Replies

MAMA: What is the Web made of?

Opera's new analysis of web page structure

mattur

3:22 pm on Oct 16, 2008 (gmt 0)

Enter MAMA—the "Metadata Analysis and Mining Application". MAMA is a structural Web-page search engine—it trawls Web pages and returns results detailing page structures, including what HTML, CSS, and script is used on it, as well as whether the HTML validates.

MAMA: What is the Web made of? [dev.opera.com]

This is a fascinating glimpse of web page coding based on a sample of 3.5million pages. From the key findings [dev.opera.com] and other pages linked from the main page above:

4.13% of web pages validate
51.0% of pages have a doctype
HTML doctypes outnumbered XHTML doctypes by about 2 to 1
85% of pages render in quirks mode
Just under half of all pages displaying validation icons actually validate
80.39% of pages use CSS
74.58% of pages use scripting
3.20% of pages use XMLHttpRequest
Average page size is ~16,500 characters
The 3 most popular markup elements are <a> (5th most popular out of all elements), <img> (7th) and <table> (8th)

lavazza

4:46 pm on Oct 16, 2008 (gmt 0)

Thanks mattur!

Fascinating reading

tedster

5:36 pm on Oct 16, 2008 (gmt 0)

Here's a tidbit that comes from the URL set information [dev.opera.com]. Only 261 sites in the Alexa top 500 are also in DMOZ. If I were involved with DMOZ, I'd take a very close look at that situation.

I'm surprised that even 4% of the pages in the URL set validate. I would have guessed much lower, but that percentage may be a side effect of how they chose URLs and not a true representation of the entire web.

[edited by: tedster at 5:42 pm (utc) on Oct. 16, 2008]

ergophobe

5:38 pm on Oct 16, 2008 (gmt 0)

Nice find. I think the "What does an average webpage look like" section is pretty interesting too. It kind of sets an interesting bar to measure against. Average page has an 8.5K character external stylesheet. A bit of a surprise.

I wonder how many of the 3/4 of web pages that use scripting degrade nicely. I have to say that I'm surprised by how high that number is and a little surprised by how low the XMLRPC number is.

I'd also like to see numbers like that on a per site basis, because of course sites like MySpace, Facebook and such carry more weight than perhaps they should when you do counts on a per page basis.

rocknbil

6:22 pm on Oct 16, 2008 (gmt 0)

Awesome. :-)

4.13% of web pages validate

Finally, I'm in the upper 95 percentile of something.

HTML doctypes outnumbered XHTML doctypes by about 2 to 1

As it should IMO, and I wonder how many of those XHTML doctypes actually even need XHTML.

Flash detection....Usage of Flash was determined by looking for any of the following items: (PARAM, EMBED,Any scripting content with the substring "flash" or ".swf"

I wonder if this picked up those of us using swfObject as an external file? Makes for clean source documents . . .

Samizdata

6:54 pm on Oct 16, 2008 (gmt 0)

MAMA's URL selection policy did not respect any robots.txt or other spidering methodologies.

Charming.

MAMA identified itself as Opera 9.10 for its User-Agent string in order to experience the Web the way Opera's browser would.

Make that "in order to trample on the wishes of website owners who might not have wanted to take part".

As it says in the DevOpera logo: "Follow the standards - Break the rules".

...

pageoneresults

9:14 pm on Oct 16, 2008 (gmt 0)

I'm surprised that even 4% of the pages in the URL set validate. I would have guessed much lower, but that percentage may be a side effect of how they chose URLs and not a true representation of the entire web.

Okay, we know Opera has a small user base. 4% validate? They must have used the bulk of my properties in the test set. :)

Just under half of all pages displaying validation icons actually validate.

Ya, and I wouldn't be surprised if some of them actually come from the W3. ;)

There is a lot of MAMA coverage elsewhere about the DMoz URL set and the decision to use it as the basis of MAMA's research. MAMA did not analyze ALL of the DMoz URLs, though. Transient network issues, dead URLs, and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million.

The data set is biased. I don't think it is really representative of the whole. Only a very small part of it. And, it's a human based bias. They got to start off with a quality data set for the most part. How about a random sampling of the top 100 properties across a variety of highly competitive keyword phrases? You know, that 90/10 rule thing? ;)

henry0

10:35 pm on Oct 16, 2008 (gmt 0)

I was wondering if it will be available as a tool

In the immediate MAMA is going through the testing process
Plan is to release it by invitation (only) by end of year (08)
Then made public by spring 09

Xapti

11:59 pm on Oct 16, 2008 (gmt 0)

I thought 4% was about right, but I guess if you think about it, it could be even less pages that validate. Not suprized how many used scripting, but as mentioned already, it'd be nice to know how many degrade gracefully.

I was suprized about the validation though. It's not like unmaintained sites can all of a sudden become invalid, right? (just the fact that the validation verification wasn't maintained)

A bar graph of "last updated" pages would be nice. Only problem is with all the dynamic pages... you can't really measure that, right?

Receptional Andy

12:03 am on Oct 17, 2008 (gmt 0)

They discuss the choice of URLs in some detail [dev.opera.com]. They were looking for a statistically valid sample, rather than a comprehensive study of the whole web. The information on how they chose and filtered URLs is pretty interesting too. They do talk about the possible bias, but I don't think there's any particular reason to discount any of the findings because of it.

[edited by: Receptional_Andy at 12:05 am (utc) on Oct. 17, 2008]

encyclo

12:10 am on Oct 17, 2008 (gmt 0)

For a comparative set of statistics (although the focus of the research is not the same), you can compare the findings with the December 2005 Google Web Authoring Statistics [code.google.com], which includes data from approximately one billion documents from the Google Search database.

Of the MAMA findings, I find the 4.13% of valid pages excessively high, and if it is anywhere near the truth represents a huge increase in valid documents on the web. It is true, however, that many tools available on the market have, in the last few years, been developed or improved to generate valid markup by default. The high percentage of documents declaring themselves as XHTML is a surprise too, but could be due to the same reason - tools such as Wordpress and other CMS scripts generating XHTML syntax by default.

Hester

3:51 pm on Oct 17, 2008 (gmt 0)

80.39% of pages use CSS

The 3 most popular markup elements are <a> (5th most popular out of all elements), <img> (7th) and <table> (8th)

Looks like CSS isn't being used for DIVs if table elements are still so popular.

[edited by: Hester at 3:52 pm (utc) on Oct. 17, 2008]

blooberry

3:00 am on Oct 18, 2008 (gmt 0)

> > MAMA's URL selection policy did not respect any robots.txt or other spidering methodologies.
>
> Charming.

This comment got me all defensive for a minute, because I really worried about this issue at the time I started that big MAMA study. I agree that robots.txt should be honored where appropriate. I defend the strategy MAMA employed though.

The first excuse I have is that, at around the time of MAMA's gather that is highlighted in this study, I was rather ignorant of robots.txt and spiders in general. A co-worker of mine pointed out that this would be a BAD thing and I angsted about it quite a bit, did some research into the issues that could be caused, bought a book on spidering and vowed to make the necessary changes. The problem was that my primary crawl was just about to begin. I decided against honoring robots.txt during that crawl for several reasons:

- robots.txt is for spiders/crawlers/wanderers that "traverse many pages in the World Wide Web by recursively retrieving linked pages". [http://www.robotstxt.org/orig.html] MAMA's URL set for this study was fixed, and no crawling outside its set would happen. It was a fixed URL set. MAMA *is* automated, but it didn't crawl links.
- Previous studies that used DMoz as a source did not respect robots.txt (that I ever saw mention of), and to adhere as closely to previous study methodologies as possible it seemed best for MAMA to do the same
- MAMA's domain capping and list randomization strategy meant that the maximum number of times a domain would be hit by MAMA was 30...over a few weeks. This seemed to honor the spirit of robots.txt by respecting a server's right to live and not be hassled to its knees by rapid-fire accesses.

Having said that, as MAMA's URL set grows and becomes less and less a DMoz stepchild...including spidering...robots.txt *will* be respected.

tedster

3:11 am on Oct 18, 2008 (gmt 0)

< Note: I can confirm that this is an opera representative. >

Welcome to the forums, blooberry, and thanks for your post. I appreciate that future work will respect robots.txt.

blooberry

3:40 am on Oct 18, 2008 (gmt 0)

Flash detection....Usage of Flash was determined by looking for any of the following items: (PARAM, EMBED,Any scripting content with the substring "flash" or ".swf"

I wonder if this picked up those of us using swfObject as an external file? Makes for clean source documents . . .

Someone else brought this up and I had to do some digging to find out if this impacted things. The answer is, it doesn't seem to. Some of the search factors MAMA looked for were arrived at through a lot of evidential research into what authors were actually using to create flash. Every DOM reference to swfObject in script was noted separately by MAMA as well but not counted towards the "uses flash" total. To see if this made a difference, I ran some more queries and out of about 50,000 URLs that used the DOM swfObject reference, only 9 URLs used it and *didn't* fall into MAMA's "uses flash" bucket. Now that I have seen reference to swfObject on this issue twice, I'll definitely consider adding it to MAMA.

(if you meant swfobject as an external script file name, that was used 43,751 times in MAMA's set, and of those, 43,555 were judged as "using flash" by MAMA's other means. So there *is* an opportunity for some fine tuning here. Thanks!)

[edited by: tedster at 7:01 am (utc) on Oct. 18, 2008]
[edit reason] add quote boxes [/edit]

Samizdata

5:00 am on Oct 18, 2008 (gmt 0)

the spirit of robots.txt

Welcome to WebmasterWorld Blooberry.

As I made the original comment I want to stress that it benefits us all to have you engaging with the webmaster community, and that I really appreciate you coming here and responding to our comments.

As you can see from the other posts, many webmasters are interested in your research and consider it valuable - and you will find a lot of them (including me) strongly support the Opera browser for its innovation and support for standards, which puts many other browser makers to shame.

Webmasters here are well aware that honouring the robots protocol is effectively voluntary, and relies entirely upon the goodwill of reputable companies. I hope, however, that you will come to understand why some might consider your deliberate decision to misrepresent your robot's identity and ignore clear instructions from website owners as an act of deception that is anything but reputable.

I agree that robots.txt should be honored where appropriate

The vast majority of websites do not even have a robots.txt file. Those that do have clearly stated their wishes, and I for one cannot envisage any circumstances where it would be "appropriate" for you or anyone else to unilaterally decide to disrespect them.

It would not have diminished your research one iota to have confined yourself to sites that did not have a robots.txt file in place, or had a liberal one that allowed your robot access. From my point of view you have single-handedly made your company look arrogant and dishonest.

Please continue to support the standards.

But please stop breaking the rules.

...

g1smd

10:08 am on Oct 18, 2008 (gmt 0)

If their system were crawling a site and ignoring robots.txt I would be very unhappy.

However, if I understand this right, they simply downloaded and analysed most of the pages listed in the ODP - all of which are meant to be viewed by a human - and this was a study to see what humans see.

So, if their system only looks at those URLs and doesn't wonder off round the rest of the site, then I have no issues with it whatsoever.

poppyrich

2:13 pm on Oct 19, 2008 (gmt 0)

I'm trying to reconcile in my head the 51% doctype number with the 85% quirks mode rendering number.

The main reason to add a doctype is to trigger "standards" mode in IE6 and 7. I've always assumed the other browsers act similarly, I've never had a problem.
This result just doesn't make sense to me.

Any input?

g1smd

2:32 pm on Oct 19, 2008 (gmt 0)

An incomplete or truncated DOCTYPE (one without the URL part at the end) triggers quirks mode in most (all?) browsers.

blooberry

6:57 pm on Oct 23, 2008 (gmt 0)

I'm trying to reconcile in my head the 51% doctype number with the 85% quirks mode rendering number.
The main reason to add a doctype is to trigger "standards" mode in IE6 and 7.
...
This result just doesn't make sense to me.

I'd agree that it doesn't make a lot of sense from the view of triggering standards/quirks mode. Something I'm starting to glean though from all the data that MAMA gathers is that maybe the Web isn't exactly the way I always think it is. I keep finding surprises.

Perhaps many authors are *trying* to trigger quirks mode instead of standards? Perhaps tools are the ones doing all the doctype inserting, and are just using quirks doctypes? The truth is, I'm not really sure. MAMA is able to provide a lot of archeological evidence, but then the dots must be connected manually. The more dots, the better. =)

Samizdata

8:13 pm on Oct 23, 2008 (gmt 0)

Perhaps many authors are *trying* to trigger quirks mode instead of standards

Some certainly do, but I would have guessed the number to be insignificant.

What boggled my mind was the suggestion that two-thirds of webpages in China have Flash on them.

I am no expert on statistics or the Chinese web, but Adobe must love those dots.

...