This tool, geared by an XSLT stylesheet, tries to extract some information from a HTML semantic rich document. It only uses information available through a good usage of the semantics defined in HTML.
In the past month, I've probably run well over a few hundred pages through that tool. I'm wanting to see how many of the semantic elements I can target on one page. I've extracted all the data that it looks for and this is the list you end up with. I am now using this list as a general guideline for page development. Depending on the content of the page, I want to make sure that I've covered my bases in these areas...
Explicit language annotations within the document
Table of Contents
Defined Terms The following terms are defined in the given HTML page:
Abbreviations and Acronyms The following abbreviations and/or acronyms are used in the given HTML page: standing for ""
Citations and Quotes There are some quotes and citations in this page: * [source] References were found to the following sources: *
Document Outline *
How do your pages display semantically? What do you see when you turn styles off? Or images? Or both? If you run your pages through the Semantic Data Extractor, how many of the above areas are being extracted from your documents?
^ Based on my document testing so far, a large percentage of websites fail miserably when it comes to extracting semantics from their pages. That can't be a good sign, can it?
Msg#: 3837274 posted 6:02 pm on Feb 2, 2009 (gmt 0)
I heard the whistling sound right after I pressed the Submit Button. I should have known better. That whistling sounds comes from a topic that gets posted and then sinks to the depths of WebmasterWorld never to be seen again. Tis a shame too, this "could have" been a good one. ;)
Msg#: 3837274 posted 10:05 pm on Feb 17, 2009 (gmt 0)
I have been experimenting with this for several days and think it's worthy of adding to the toolbox. At least until I create a better one. The frigging error codes don't help much when all you want is semantic information.
Msg#: 3837274 posted 10:49 pm on Feb 17, 2009 (gmt 0)
One thing i don't understand is the error messages. I don't know if they are reporting problems withing the document or the network.
The frigging error codes don't help much when all you want is semantic information.
Would that be this particular error?
Using org.apache.xerces.parsers.SAXParser Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog. org.xml.sax.SAXParseException: Content is not allowed in prolog.
That is because you are invoking the tool a second time and sending an encoded URI on that second trip. You may have to enter the URI into the field again and clear up the encoding issues. Even then, there appears to be a caching mechanism at play. I've had to restart me session to get that thing to extract the latest document changes. ;)
Pssst, we already built a better one. It just needs to be converted to jQuery and is in the pipeline for production. That tool gave me all sorts of ideas back in the day. :)
Msg#: 3837274 posted 11:22 pm on Feb 18, 2009 (gmt 0)
i'm guessing it's hard to understand the semantics of an invalid document, which might give you a hint about the importance of validation. if your competitor is "doing well" on that page, it's probably not the semantics.
Msg#: 3837274 posted 3:43 pm on Feb 19, 2009 (gmt 0)
I haven't seen too much evidence that validation makes a large impact on search ranking. I am not discounting that it can have an effect - just that it's often not large enough to see by itself.
Anyhow, I already rolled my own semantic analyzer. I use a lot of "small ball" tactics when it comes to my flavor of SEO and these semantic elements have the look of something that can make a difference -- when combined with other techniques.