Forum Moderators: open
Enter MAMA—the "Metadata Analysis and Mining Application". MAMA is a structural Web-page search engine—it trawls Web pages and returns results detailing page structures, including what HTML, CSS, and script is used on it, as well as whether the HTML validates.
MAMA: What is the Web made of? [dev.opera.com]
This is a fascinating glimpse of web page coding based on a sample of 3.5million pages. From the key findings [dev.opera.com] and other pages linked from the main page above:
I'm surprised that even 4% of the pages in the URL set validate. I would have guessed much lower, but that percentage may be a side effect of how they chose URLs and not a true representation of the entire web.
[edited by: tedster at 5:42 pm (utc) on Oct. 16, 2008]
I wonder how many of the 3/4 of web pages that use scripting degrade nicely. I have to say that I'm surprised by how high that number is and a little surprised by how low the XMLRPC number is.
I'd also like to see numbers like that on a per site basis, because of course sites like MySpace, Facebook and such carry more weight than perhaps they should when you do counts on a per page basis.
4.13% of web pages validate
HTML doctypes outnumbered XHTML doctypes by about 2 to 1
Flash detection....Usage of Flash was determined by looking for any of the following items: (PARAM, EMBED,Any scripting content with the substring "flash" or ".swf"
I wonder if this picked up those of us using swfObject as an external file? Makes for clean source documents . . .
MAMA's URL selection policy did not respect any robots.txt or other spidering methodologies.
Charming.
MAMA identified itself as Opera 9.10 for its User-Agent string in order to experience the Web the way Opera's browser would.
Make that "in order to trample on the wishes of website owners who might not have wanted to take part".
As it says in the DevOpera logo: "Follow the standards - Break the rules".
...
I'm surprised that even 4% of the pages in the URL set validate. I would have guessed much lower, but that percentage may be a side effect of how they chose URLs and not a true representation of the entire web.
Okay, we know Opera has a small user base. 4% validate? They must have used the bulk of my properties in the test set. :)
Just under half of all pages displaying validation icons actually validate.
Ya, and I wouldn't be surprised if some of them actually come from the W3. ;)
There is a lot of MAMA coverage elsewhere about the DMoz URL set and the decision to use it as the basis of MAMA's research. MAMA did not analyze ALL of the DMoz URLs, though. Transient network issues, dead URLs, and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million.
The data set is biased. I don't think it is really representative of the whole. Only a very small part of it. And, it's a human based bias. They got to start off with a quality data set for the most part. How about a random sampling of the top 100 properties across a variety of highly competitive keyword phrases? You know, that 90/10 rule thing? ;)
I was suprized about the validation though. It's not like unmaintained sites can all of a sudden become invalid, right? (just the fact that the validation verification wasn't maintained)
A bar graph of "last updated" pages would be nice. Only problem is with all the dynamic pages... you can't really measure that, right?
[edited by: Receptional_Andy at 12:05 am (utc) on Oct. 17, 2008]
Of the MAMA findings, I find the 4.13% of valid pages excessively high, and if it is anywhere near the truth represents a huge increase in valid documents on the web. It is true, however, that many tools available on the market have, in the last few years, been developed or improved to generate valid markup by default. The high percentage of documents declaring themselves as XHTML is a surprise too, but could be due to the same reason - tools such as Wordpress and other CMS scripts generating XHTML syntax by default.
This comment got me all defensive for a minute, because I really worried about this issue at the time I started that big MAMA study. I agree that robots.txt should be honored where appropriate. I defend the strategy MAMA employed though.
The first excuse I have is that, at around the time of MAMA's gather that is highlighted in this study, I was rather ignorant of robots.txt and spiders in general. A co-worker of mine pointed out that this would be a BAD thing and I angsted about it quite a bit, did some research into the issues that could be caused, bought a book on spidering and vowed to make the necessary changes. The problem was that my primary crawl was just about to begin. I decided against honoring robots.txt during that crawl for several reasons:
- robots.txt is for spiders/crawlers/wanderers that "traverse many pages in the World Wide Web by recursively retrieving linked pages". [http://www.robotstxt.org/orig.html] MAMA's URL set for this study was fixed, and no crawling outside its set would happen. It was a fixed URL set. MAMA *is* automated, but it didn't crawl links.
- Previous studies that used DMoz as a source did not respect robots.txt (that I ever saw mention of), and to adhere as closely to previous study methodologies as possible it seemed best for MAMA to do the same
- MAMA's domain capping and list randomization strategy meant that the maximum number of times a domain would be hit by MAMA was 30...over a few weeks. This seemed to honor the spirit of robots.txt by respecting a server's right to live and not be hassled to its knees by rapid-fire accesses.
Having said that, as MAMA's URL set grows and becomes less and less a DMoz stepchild...including spidering...robots.txt *will* be respected.
Flash detection....Usage of Flash was determined by looking for any of the following items: (PARAM, EMBED,Any scripting content with the substring "flash" or ".swf"
I wonder if this picked up those of us using swfObject as an external file? Makes for clean source documents . . .
Someone else brought this up and I had to do some digging to find out if this impacted things. The answer is, it doesn't seem to. Some of the search factors MAMA looked for were arrived at through a lot of evidential research into what authors were actually using to create flash. Every DOM reference to swfObject in script was noted separately by MAMA as well but not counted towards the "uses flash" total. To see if this made a difference, I ran some more queries and out of about 50,000 URLs that used the DOM swfObject reference, only 9 URLs used it and *didn't* fall into MAMA's "uses flash" bucket. Now that I have seen reference to swfObject on this issue twice, I'll definitely consider adding it to MAMA.
(if you meant swfobject as an external script file name, that was used 43,751 times in MAMA's set, and of those, 43,555 were judged as "using flash" by MAMA's other means. So there *is* an opportunity for some fine tuning here. Thanks!)
[edited by: tedster at 7:01 am (utc) on Oct. 18, 2008]
[edit reason] add quote boxes [/edit]
the spirit of robots.txt
Welcome to WebmasterWorld Blooberry.
As I made the original comment I want to stress that it benefits us all to have you engaging with the webmaster community, and that I really appreciate you coming here and responding to our comments.
As you can see from the other posts, many webmasters are interested in your research and consider it valuable - and you will find a lot of them (including me) strongly support the Opera browser for its innovation and support for standards, which puts many other browser makers to shame.
Webmasters here are well aware that honouring the robots protocol is effectively voluntary, and relies entirely upon the goodwill of reputable companies. I hope, however, that you will come to understand why some might consider your deliberate decision to misrepresent your robot's identity and ignore clear instructions from website owners as an act of deception that is anything but reputable.
I agree that robots.txt should be honored where appropriate
The vast majority of websites do not even have a robots.txt file. Those that do have clearly stated their wishes, and I for one cannot envisage any circumstances where it would be "appropriate" for you or anyone else to unilaterally decide to disrespect them.
It would not have diminished your research one iota to have confined yourself to sites that did not have a robots.txt file in place, or had a liberal one that allowed your robot access. From my point of view you have single-handedly made your company look arrogant and dishonest.
Please continue to support the standards.
But please stop breaking the rules.
...
However, if I understand this right, they simply downloaded and analysed most of the pages listed in the ODP - all of which are meant to be viewed by a human - and this was a study to see what humans see.
So, if their system only looks at those URLs and doesn't wonder off round the rest of the site, then I have no issues with it whatsoever.
The main reason to add a doctype is to trigger "standards" mode in IE6 and 7. I've always assumed the other browsers act similarly, I've never had a problem.
This result just doesn't make sense to me.
Any input?
I'm trying to reconcile in my head the 51% doctype number with the 85% quirks mode rendering number.
The main reason to add a doctype is to trigger "standards" mode in IE6 and 7.
...
This result just doesn't make sense to me.
I'd agree that it doesn't make a lot of sense from the view of triggering standards/quirks mode. Something I'm starting to glean though from all the data that MAMA gathers is that maybe the Web isn't exactly the way I always think it is. I keep finding surprises.
Perhaps many authors are *trying* to trigger quirks mode instead of standards? Perhaps tools are the ones doing all the doctype inserting, and are just using quirks doctypes? The truth is, I'm not really sure. MAMA is able to provide a lot of archeological evidence, but then the dots must be connected manually. The more dots, the better. =)
Perhaps many authors are *trying* to trigger quirks mode instead of standards
Some certainly do, but I would have guessed the number to be insignificant.
What boggled my mind was the suggestion that two-thirds of webpages in China have Flash on them.
I am no expert on statistics or the Chinese web, but Adobe must love those dots.
...