Wednesday, October 12, 2011

Is Schema.org the right way to go?

[ Originally written for Kuliza Technologies on June 14th, 2011 ]
Did the three big companies take the correct decision in introducing Schema.org ?

Semantic web is a web of information, which is marked with machine understandable metadata in addition to the Human readable web-content. Recently, Google, Yahoo and Microsoft collaborated and came up with Schema.org, which is their privately hosted Semantic mark-up vocabulary.
This introduction has been a hot topic of discussion in the Semantic web community, majorly because of the syntax chosen by the three companies to develop the vocabulary. The major issue with this release has been that the terms in Schema.org are expressed in microdata syntax, as opposed to the currently popular RDFa serialization of RDF. I am currently contributing open-source code to the Semantic web community through my project, which involves creating an RDF Vocabulary publishing platform. So maybe I might appear a bit biased towards RDFa over microdata here.

Bit of History -
RDF is a knowledge representation framework that encodes data as subject-predicate-object triples. When you combine triples, they form graphs. Initially, RDF/XML serialization format was used for semantic marking, and it separated the semantic marking from the HTML content. Over the course of time, Microformat syntax emerged, wherein the Semantic metadata content was integrated into the HTML itself. RDFa is another serialization of RDF, that was based on Microformat, i.e., integrating HTML Content and the metadata. Microdata is a set of tags, introduced with HTML5, which claimed to improve upon RDFa.
An important thing to note here is that RDFa and Microdata – both are syntaxes. Both are both Entity-Attribute-Value models that support using URIs as universal identifiers. There also exists an algorithm for converting Microdata to RDF. On the other hand, Schema.org is a vocabulary. A vocabulary has terms, which can be specified in any syntax. Schema.org terms have been originally specified in Microdata syntax.

Can’t we just specify all the terms in RDFa syntax and continue using them?
The answer is Yes, and as a matter of fact, the work is already in progress as I write this post. People in the RDFa community, Richard Cyganiak (My Google summer of code 2011 mentor) and Michael Hausenblas, have worked to develop an RDFS definition for the terms of Schema.org, and hosted it at http://schema.rdfs.org/.

So what is the issue here?
Google has asked the web community to use either microdata or RDFa since using both the syntaxes confuses its parsers.

“While it’s OK to use the new schema.org mark-up or continue to use existing Microformat or RDFa mark-up, you should avoid mixing the formats together on the same web page, as this can confuse our parsers.” … “If you have already done mark-up and it is already being used by Google, Microsoft, or Yahoo!, the mark-up format will continue to be supported. Changing to the new mark-up format could be helpful over time because you will be switching to a standard that is accepted across all three companies, but you don’t have to do it.”

And then it adds:
“We will also be monitoring the web for RDFa and Microformat adoption and if they pick up, we will look into supporting these syntaxes.”
This sounds as if Google is pushing developers who are looking for SEO to start using microdata syntax, a standard that is not in much use yet, since it gets a sort of priority in its parsing algorithms. This takes away the freedom from the developers to choose whatever syntax works best for them.  Although RDFa is a bit more complex than Microdata, it can covers more use cases, and some developers might be more comfortable using it.

Few years ago, the web-developers community was reluctant in semantically marking their web-content. The semantic web community worked hard to make the web developers understand the future benefits of having linked data all over the web. So, many of the developers slowly started using RDFa and Microformat, and a recent survey showed that 4% of websites used RDFa, which is more than any other. See http://tripletalk.files.wordpress.com/2011/01/rdfa-deployment.png for the comparison.

RDFa is being used by Drupal 7, Facebook OGP, Best Buy, all e-commerce sites which use the GoodRelations Vocabulary and many more major deployments globally.
And now schema.org asks them to learn a new syntax yet again. Lets face it; if Google, MS and yahoo declare that they would support only microdata for parsing content on the web, most of the web developers who are majorly looking for SEO would definitely follow. This would adversely affect the growth of RDFa deployments.

Thus, a large portion of the Semantic Web community is not happy with the decisions. Some believe that the vocabularies provided by schema.org won’t suffice if you want to cover complex domains since it is not extensible.

Another matter of concern is that it seems w3c was not consulted at all, while schema.org was developed. Commercialization of standards is never a good thing, and that’s what Schema.org does. In fact, Manu Sporny, chairperson of RDFa group in w3c, has been very aggressive in opposing schema.org and he goes to the extent of saying that he would soon start a revolution against “The false choice” of using microdata in schema.org. I have been following him on twitter and he has been gathering support there to put pressure on the three big Companies. He also believes that “Microdata doesn’t scale as easily as RDFa – early successes will be followed by stagnation and vocabulary lock-in.”

The solutions-
The most obvious solution to this problem is that Google, bing and yahoo announce that they would treat RDFa and microdata with equal priority in their parsing algorithms.
Bing has already stated that it can parse a page that includes multiple syntaxes. However, Google parsers cant do this, and needs to incorporate this feature in their parsing algorithms as soon as possible.

However…
Schema.org does seem to have a created a lot of negative buzz, but lets not forget that some kind of RDF vocabulary standardization like this was long due. Currently, due to lack of a definite standard, it is difficult for developers to decide on which one to use for mark-up. Schema.org does solve this problem and makes life easier for developers as well as for search engines. As Google states:

“Creating a schema supported by all the major search engines makes it easier for webmasters to add mark-up, which makes it easier for search engines to create rich search features for users.”
Powered By Blogger
Custom Search