Forum Moderators: open
The Ultimate SEO Guide for 2009
[webmasterworld.com...]
Once you become hooked at the above, you then get hooked by all the other references that seem to be infinite at times. Actually, many of them loop back to themselves. ;)
There are a plethora of tools that are buried at various authoritative resources, one of those being the W3C. One of those tools that doesn't get much play is the...
Semantic Data Extractor
[w3.org...]
This tool, geared by an XSLT stylesheet, tries to extract some information from a HTML semantic rich document. It only uses information available through a good usage of the semantics defined in HTML.
In the past month, I've probably run well over a few hundred pages through that tool. I'm wanting to see how many of the semantic elements I can target on one page. I've extracted all the data that it looks for and this is the list you end up with. I am now using this list as a general guideline for page development. Depending on the content of the page, I want to make sure that I've covered my bases in these areas...
Extracted DataGeneric Metadata
- Title
- Author
- Description
- Contact Information
- Language Code
- Explicit language annotations within the document
- HTML Profile
Related Resources
- Translations
- Alternate Formats
- Starting Page
- Next Page
- Previous Page
- Table of Contents
- Index
- Glossary
- Copyright
- Chapters
- Sections
- Subsections
- Appendix
- Help
- Bookmarkable Points
Defined Terms
The following terms are defined in the given HTML page:Abbreviations and Acronyms
The following abbreviations and/or acronyms are used in the given HTML page:
standing for ""Citations and Quotes
There are some quotes and citations in this page:
* [source]
References were found to the following sources:
*Document Outline
*
How do your pages display semantically? What do you see when you turn styles off? Or images? Or both? If you run your pages through the Semantic Data Extractor, how many of the above areas are being extracted from your documents?
^ Based on my document testing so far, a large percentage of websites fail miserably when it comes to extracting semantics from their pages. That can't be a good sign, can it?
One thing i don't understand is the error messages. I don't know if they are reporting problems withing the document or the network.
The frigging error codes don't help much when all you want is semantic information.
Would that be this particular error?
Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog.
org.xml.sax.SAXParseException: Content is not allowed in prolog.
That is because you are invoking the tool a second time and sending an encoded URI on that second trip. You may have to enter the URI into the field again and clear up the encoding issues. Even then, there appears to be a caching mechanism at play. I've had to restart me session to get that thing to extract the latest document changes. ;)
Pssst, we already built a better one. It just needs to be converted to jQuery and is in the pipeline for production. That tool gave me all sorts of ideas back in the day. :)
Anyhow, I already rolled my own semantic analyzer. I use a lot of "small ball" tactics when it comes to my flavor of SEO and these semantic elements have the look of something that can make a difference -- when combined with other techniques.