homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / WebmasterWorld / Accessibility and Usability
Forum Library, Charter, Moderators: ergophobe

Accessibility and Usability Forum

Semantic Data Extractor

 11:44 am on Jan 29, 2009 (gmt 0)

I've been doing quite a bit of research lately into this whole semantic web thing and I've found myself entrenched in the new WCAG 2.0 documents.

The Ultimate SEO Guide for 2009

Once you become hooked at the above, you then get hooked by all the other references that seem to be infinite at times. Actually, many of them loop back to themselves. ;)

There are a plethora of tools that are buried at various authoritative resources, one of those being the W3C. One of those tools that doesn't get much play is the...

Semantic Data Extractor

This tool, geared by an XSLT stylesheet, tries to extract some information from a HTML semantic rich document. It only uses information available through a good usage of the semantics defined in HTML.

In the past month, I've probably run well over a few hundred pages through that tool. I'm wanting to see how many of the semantic elements I can target on one page. I've extracted all the data that it looks for and this is the list you end up with. I am now using this list as a general guideline for page development. Depending on the content of the page, I want to make sure that I've covered my bases in these areas...

Extracted Data

Generic Metadata

  • Title
  • Author
  • Description
  • Contact Information
  • Language Code
  • Explicit language annotations within the document
  • HTML Profile

Related Resources

  • Translations
  • Alternate Formats
  • Starting Page
  • Next Page
  • Previous Page
  • Table of Contents
  • Index
  • Glossary
  • Copyright
  • Chapters
  • Sections
  • Subsections
  • Appendix
  • Help
  • Bookmarkable Points

Defined Terms
The following terms are defined in the given HTML page:

Abbreviations and Acronyms
The following abbreviations and/or acronyms are used in the given HTML page:
standing for ""

Citations and Quotes
There are some quotes and citations in this page:
* [source]
References were found to the following sources:

Document Outline

How do your pages display semantically? What do you see when you turn styles off? Or images? Or both? If you run your pages through the Semantic Data Extractor, how many of the above areas are being extracted from your documents?

^ Based on my document testing so far, a large percentage of websites fail miserably when it comes to extracting semantics from their pages. That can't be a good sign, can it?



 6:02 pm on Feb 2, 2009 (gmt 0)

I heard the whistling sound right after I pressed the Submit Button. I should have known better. That whistling sounds comes from a topic that gets posted and then sinks to the depths of WebmasterWorld never to be seen again. Tis a shame too, this "could have" been a good one. ;)


 9:34 am on Feb 16, 2009 (gmt 0)

One thing i don't understand is the error messages.
I don't know if they are reporting problems withing the document or the network


 10:05 pm on Feb 17, 2009 (gmt 0)

I have been experimenting with this for several days and think it's worthy of adding to the toolbox. At least until I create a better one. The frigging error codes don't help much when all you want is semantic information.

Thanks for the link.


 10:41 pm on Feb 17, 2009 (gmt 0)

which error codes are you seeing?
are you validating your document first?


 10:49 pm on Feb 17, 2009 (gmt 0)

One thing i don't understand is the error messages. I don't know if they are reporting problems withing the document or the network.

The frigging error codes don't help much when all you want is semantic information.

Would that be this particular error?

Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog.
org.xml.sax.SAXParseException: Content is not allowed in prolog.

That is because you are invoking the tool a second time and sending an encoded URI on that second trip. You may have to enter the URI into the field again and clear up the encoding issues. Even then, there appears to be a caching mechanism at play. I've had to restart me session to get that thing to extract the latest document changes. ;)

Pssst, we already built a better one. It just needs to be converted to jQuery and is in the pipeline for production. That tool gave me all sorts of ideas back in the day. :)


 3:25 pm on Feb 18, 2009 (gmt 0)

I am getting all kinds of errors. Typically, they are validation errors. For my pages I can control that, but for competitive analysis... I just want the semantic data - not a lesson in validation.


 11:22 pm on Feb 18, 2009 (gmt 0)

i'm guessing it's hard to understand the semantics of an invalid document, which might give you a hint about the importance of validation.
if your competitor is "doing well" on that page, it's probably not the semantics.


 3:43 pm on Feb 19, 2009 (gmt 0)

I haven't seen too much evidence that validation makes a large impact on search ranking. I am not discounting that it can have an effect - just that it's often not large enough to see by itself.

Anyhow, I already rolled my own semantic analyzer. I use a lot of "small ball" tactics when it comes to my flavor of SEO and these semantic elements have the look of something that can make a difference -- when combined with other techniques.


 3:08 pm on Feb 25, 2009 (gmt 0)

It's a nice tool to see the important parts of your pages. In the "Outline of the document" bit.

As to validation not helping with SERPs... It doesn't but it sure is nice to see that green bar :)

Global Options:
 top home search open messages active posts  

Home / Forums Index / WebmasterWorld / Accessibility and Usability
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved