Skip to content

Semantics in HTML Part III – Towards a semantic web

  1. Part I – Traditional HTML Semantics
  2. Part II – Standardizing Vocabularies
  3. Part III – Directions in HTML Semantics (this article)

Fundamental to any science or engineering discipline [is] a common vocabulary for expressing its concepts, and a language for relating them together

Brad Appleton

The World Wide Web is a simple thing really. It is, at the bottom, HTTP and HTML. Throw in some image formats (delivered via HTTP), CSS for styling (delivered via HTTP), Javascript for interaction design, and that’s more or less it.

The jewel in the crown, at least from the perspective of content developers, is HTML. HTML is used to markup, and so convey, the vast majority of the web’s information.

Today’s web is a web of HTML, and it is difficult to imagine a web much different any time soon. An almost unimaginable amount of time and money has been invested by individuals, organizations and companies in acquiring the skills and technology to publish and consume HTML. Even incremental changes to this landscape, such as the development of XHTML, a semantically, and syntactically near equivalent to HTML grounded in XML rather than SGML, and of CSS for developing web page presentation have taken in the order of a decade to get any kind of significant adoption by even professional web developers.

In this context, while not impossible (disruptive technologies do replace well entrenched perfectly serviceable ones – think of air traffic over rail or road for long distance travel, CDs over LPs, digital music formats and players of CDs, and so on), it is difficult to imagine within any reasonable time frame the web of HTML being replaced by “a better web” (this alone should make Adobe pause in considering the role Apollo will play). And this goes equally well for “The Semantic Web” an ambitious project to recast the web as a web of data, largely for machine consumption, using new markup formats centered on the Resource Description Framework (RDF), “a common framework for expressing … information so it can be exchanged between applications without loss of meaning.”

Whether or not such a project comes to pass, the web of HTML will be with us for a long time yet. Various projects are under way to further mature HTML – both within the W3C (in December 2006, the W3C announced a new HTML working group (chartered in March of 2007) to oversee the future development of HTML), as well as through the work of ad hoc groups like the WhatWG, whose work on HTML5 maps out an alternative, though not necessarily competing vision for the future of HTML.

What all those with more than a passing interest in the future of HTML (and indeed a great many professional developers today) devote not insignificant attention to is the question of semantics in HTML. In the first of these articles, I paid attention to the nature of the semantics in today’s HTML (HTML 4.01, XHTML 1+) – the kinds of “built in” semantics HTML provides, through elements and attributes. In the second, I looked at current processes and mechanisms for extending the semantics of today’s HTML – in particular, the one truly successful project for doing so to date, microformats, and foreshadowed the subject of this, the final installment in the series – future developments in the semantics of HTML.

Future Semantics

So far we’ve seen that there are three sources of semantics in HTML

  • The built in semantics of HTML itself – its elements and attributes
  • The ad hoc semantics of developers inventing their own vocabularies, which is typically “injected” into HTML largely using the class and id attributes of HTML
  • Semi structured approaches to developing richer semantics, in particular the microformats project.

It would make sense that future semantic developments of HTML would come from these or similar sources or approaches. In this article I want to focus on each in turn, and consider the benefits and shortcomings of each approach to developing richer semantics for HTML.

I’ll begin with the second approach, “bottom up” semantics, which I considered in the first article, and have paid no small amount of attention to with previous research. In short, despite the success of bottom up ontologies, what Thomas Vander Wal terms “folksonomies”, where common vocabularies for describing things emerge through ad hoc usage (well known examples are Flickr’s tags, and Del.icio.us), vocabularies for describing common data on the web simply haven’t emerged. This is not just an assertion, as my previous research indicates. It should in fact not come as a surprise, because class values, for example, are “hidden”, while tags at del.icio.us or flickr, by comparison are visible giving rise to a positive feedback loop – when I as a user see a tag for a particular kind of thing, I am more likely to use it myself for similar kinds of things. Over time, particular terms appear to “win”, and become the conventionally accepted tag for that kind of thing. With class and id values on the other hand, we simply don’t get the network effect to anoint particular words as the names of things.

In short, I’d argue that looking to emergent semantics for future vocabularies for HTML is a futile exercise – at least without sensible scaffolding to help the process. I’ll turn to just what that scaffolding might be shortly.

The primary source of semantics in HTML is of course that built into HTML itself – the elements and attributes of HTML. It would seem to be a reasonable argument then that the best source of developments in HTML semantics should come through the language itself. If observation leads us to the conclusion that certain constructs are particularly common on the web – whether they be structural ones like document sections, data constructs like addresses, or rhetorical constructs like irony, then it makes sense to include these very common constructs in HTML using the raw materials of HTML itself.

This is certainly one approach HTML5 takes to developing the semantics of that language. HTML5 proposes new elements such as section, nav and footer. HTML5 also proposes other mechanisms for extending the semantics of HTML, in particular by mechanism for anointing particular values for the class attribute. I’ll turn to that in the final section of this article.

It’s not clear at this stage whether this is a direction the new HTML working group at the W3C will take in developing future versions of HTML. It may be particularly challenging in the context of backwards compatibility. But certainly the XHTML2 project took this approach as one mechanism for extending HTML’s semantics (however, it may well be argued that in fact XHTML2 is no more HTML in any meaningful sense than say Docbook is). Similarly to HTML5, XHTML2 introduces new elements, such as section, and new attributes, one in particular the role attribute.

Interestingly, XHTML effectively introduces no new elements or attributes to the language (with the exception of Ruby in XHTML 1.1). We’ll return to why that might be in a moment.

Given that two major projects to overhaul HTML, both within and without the W3C have embraced the mechanism of adding elements and attributes to the language in order to extend HTML’s semantics, the initial conclusion one might draw is that this would be the best possible model for extending the semantics of HTML. Afterall, some very smart hardworking people with a lot of theoretical and practical experience in markup languages, parsers, browsers, and so on have come to this conclusion.

Respectfully, I’d argue that it is a really bad mechanism for doing so.

It both goes too far, and not nearly far enough, both breaking hundreds of millions of installed browsers, many of which will simply never be upgraded, while actually not solving the problem at all. Because the problem is not that the semantics of HTML are impoverished and need enriching (that is certainly a significant problem), the problem is that there is no mechanism for enriching the semantics of HTML without redefining the language, with all the attendant problems of backwards compatibility of sites, and software. If we are going to go to the trouble of making such high impact changes (and I’m not sure it’s necessary), then by not fixing this problem, we simply ensure that it will happen in perpetuity, each time we feel that the semantics of HTML is insufficiently rich. Which of course it will always be.

Whatever choices even the wisest, most diligent custodians of HTML make in terms of adding elements and attributes to the language, there will always be the need for more. The proliferation of XML based languages, or the very broad use of class and id values as evidenced by my research of late 2005, demonstrates that there is far too wide a possible set of vocabularies for a single language to embody using the built in mechanisms of elements and attributes alone. What HTML needs is a mechanism for extending the semantics of the language without changing the language itself. In the last part of this article, I want to consider what these mechanisms may be.

Language is a virus

HTML is a language, but not quite in the sense that English or French or Swahili are. Languages whose primary purposes is to communicate with machines – which HTML I’d argue is – require stability, to the point of rigidity, unambiguity (because software is very bad at deducing meaning from language), and constrained semantics. Programming languages, like C, or Fortran, tend to have very long half lives, their syntax and semantics changing very slowly, if at all, and almost always with a strong emphasis on backwards compatibility. HTML, though a language for communicating primarily with machines, is a curious hybrid – because the purpose of this human to machine communication is, at least in part, for those machines then to communicate to other humans. As a consequence, natural language semantics are embedded far deeper into HTML than they typically are into what was once termed 3GLs (or third generation programming languages). The problem is that natural language semantics is far more fluid (terms emerge rapidly, over the period of years, or even shorter), and far broader than the semantics required in order to communicate directly with hardware.

This presents the particular problem I outlined a moment ago, of the project to enrich HTML by innovations within the language being both too broad and too narrow. There’s little doubt that the further development of the semantic capacity of HTML is fundamentally important, but the real challenge is how to do it.

Perhaps the use of the term “semantic capacity” in the previous paragraph, as well of course my significant investment in microformats, and the previous articles in this series all point in the direction of how I would suggest projects to extend the semantics of HTML should go.

Just as CSS separated the presentation of a web page from its structure, content or semantics (various people will argue which of those three is the more accurate), HTML needs a mechanism to separate the semantics of a document from its structure. In fact, it largely already has several such mechanisms, which is not to say that no innovation is required, but rather, more attention needs to be paid to these existing, widely used, and more than a little successful mechanisms.

Before I continue, I need to make clear that I don’t think this is an argument in favor of RDF, inline RDF, or a non HTML mechanism. In this sense, it’s different to CSS, where a new language was developed to separate out the appearance of a page. Rather, the argument is for making HTML essentially semantically neutral, much like div and span elements (while not simply abandoning the existing built in semantics of the language), and providing a mechanism within HTML to enable it to be “semanticked” (sorry) in the same way that it can be styled.

Which all sounds nice in principle, but what about in practice? How might that work? I’d argue HTML needs both a mechanism that is part of the language, and processes to help guide emergent vocabularies, in order to both enable and develop richer semantics.

As we’ve seen, both in this article, and in the previous in the series, HTML itself (through the class, id, rel, rev attributes in particular), HTML5 (the class attribute, with a less anarchic approach to allowing values than HTML, where within syntactic constraints, anything goes) and HTML2 (through the class and role attributes) all provide mechanisms to extend the semantics of the language.

It remains an interesting open question as to whether the class, id, rel, rev and possibly other existing attributes of HTML suffice to enable the adequate extension of HTML, without the need for further innovation within the language, and hope that the new W3 HTML working group pay attention to that issue (and that the WhatWG might do so as well). At present, one very common use case that has been discovered by those working with microformats is the need for a mechanism for marking up content that is both unambiguous and standardized (for the benefit of machine communication) and also “human friendly”. A good, but far from unique example is dates. ISO8601, as adopted as the standard format for dates on the web by the W3C, is an unambiguous but far from humanly friendly date format. Humans write dates in all manner of (at times even) ambiguous ways – for example it is unclear whether 5.12.2007 is May 12th or December 5th. The microformats project has developed the abbr design pattern for marking up such data. One problem is that it arguably stretches the semantics of the abbr element considerably. It may be that an innovation within HTML (perhaps as “simple” as to explicitly enable the use of the abbr element for precisely this kind of purpose), or the development of a new attribute which may be used with some or all HTML elements, or even a whole new HTML element to solve this markup problem is required.

But in addition to mechanisms, the failure of bottom up evolution to develop meaningful, widespread vocabularies for even very common markup constructs such as page headers, or navigation, indicates that processes are required to enable the development of the actual semantic content that these mechanisms of HTML will enable. In the second article in this series, I looked in considerable detail at the microformats project, in part because I have quite a bit of interest in it myself, but in particular because it is the project which has been by far the most successful innovator of semantics for the web. In the order of tens of millions of pages are now published on the web using one or more microformats. The lessons I drew from that article is that microformats embrace a number of principles and practices which seem to have underpinned their success.

Microformats
solve a specific problem
start as simple as possible
design for humans first, machines second
reuse building blocks from widely adopted standards
modularity / embeddability
enable and encourage decentralized development, content, services

An important question which remains is, is the microformats project alone sufficient to provide all the required semantics for the web? Partly this is a question of whether the processes scale. But it is also a sociological question as to whether there is “one true” process which will give rise to all possible outcomes within a space in the most efficient manner. I also argued in the previous article that microformats to date have concerned themselves with only two of at least four distinct categories of semantics. Of these four categories I identified by analyzing HTML in the first two articles, structural, content/data, rhetorical, and relational, microformats concern themselves with content/data and relational semantics. This may simply be accidental, or reflect the interests of the people who initially worked on microformats, as well as those who were then attracted to the project.

All of these questions lead me to believe that the microformats project alone will not provide a platform, organization, or specific process for developing all the required semantics of the HTML web. Nor indeed would those associated with that project even argue that this is a goal or ambition of the project. One of the successful features of the project is its focus, as clearly stated “microformats are not … a panacea for all taxonomies, ontologies, and other such abstractions“.

But as a successful project focussed on extending the semantics of HTML, it serves as an excellent model for the kind of process which can be successful.

So what are the key lessons any project, whether internal to an organization to help standardize internally used semantics, all the way through loose affiliations of interested people within a domain, to much more structured organizations like WhatWG, and all the way to the gold standard, the W3C itself, learn from the success (and teething pains) of the microformats project, when it comes to the process of developing new or richer semantics for the web?

It seems to me that three aspects of the microformats process are central.

  • they solve focussed, existing, real word problems, not theoretical ones, by building on existing work, whether on the web, or in other related fields
  • They are open, and take on board the input of anyone who wishes to get involved in a specific format, with the minimum of fuss – it’s as simple as joining a mailing list.
  • They are iterative – and so help enable emergent consensus, not unlike the IETF’s “rough consensus and running code” model.

Where else?

So if the microformats project is unlikely to tackle in particular structural and rhetorical semantics, how else might the semantics which fall into these categories be developed and enhanced? In some ways, these two areas of semantics fall at two ends of a spectrum. Rhetoric has a very long scholarly and philosophical tradition, while, outside the reasonably narrow constraints of traditional publishing (from which ironically HTML has taken very little by way of its technical vocabulary), the conventions of web design are new, and emerging, and have much less by way of consensus when it comes to naming constructs. As such, in some ways the project of creating a vocabulary of common rhetorical conventions should pose no great challenge, though having it adopted may prove much more of a challenge. A far greater challenge, and one for which there would be much more immediate benefit, is in developing a vocabulary to describe even the very common structural and user interface features of web pages and sites.

This is in fact something which I have been pondering, and writing on for some time. I also know through conversations with many of our peers that it is a challenge that a great many of them (you) think it would be important to solve. For some time, I have been of the strong opinion that design patterns, and pattern languages more generally, as commonly applied to architecture, object oriented analysis and design, and other areas of computer science provide a framework for developing what Brad Appleton argues is “[f]undamental to any science or engineering discipline … a common vocabulary for expressing its concepts, and a language for relating them together”. To that end I started a project almost 18 months ago to develop a pattern language approach to solving the problem of richer semantics for HTML when it comes to web page and site structure and architecture. My involvement with microformats has meant that I’ve learned a lot about how such a project might work, and I hope to reanimate the project (at least the public aspects of it), beginning with a presentation on my ideas to the Information Architecture Summit, who have graciously extended an invitation to speak on this subject.

An earlier article of mine goes into this whole area in considerable detail and so rather than repeat it here, I’ll simply link to it. If these ideas interest you, I hope you might like to help work on that project, or if you’ve not already done so, investigate microformats.

I began this long article with the quote from Brad Appleton, and finish with it, because to me it captures the whole tangled issue of semantics on the web, that I’ve been trying to unravel a little with these articles, succinctly, and clearly. Semantics is about language. As a maturing profession, to move forward, we who develop [for] the web quite simply need this “common vocabulary for expressing its concepts, and a language for relating them together”.

[tags]semantics, patterns, microformats, HTML, CSS, RDF, Semantic Web, semweb[/tags]

{ 2 } Comments

  1. emilk | September 5, 2007 at 3:41 am | Permalink

    Though you put them into opposition, the two ways how to improve HTML are both valuable.

    First is the structure of HTML document. Your survey into real world semantics shows that there clearly are structural elements that were ommited (header, footer, menu) but your research shows that it is a limited set after all. That means one round of additions. These structures are so common and STRUCTURAL that they deserve to be elements.

    On the other way is the content, the real sea of meanings. Nobody (I hope) proposes unique element to markup a meaning from this set. If we want to canonize the syntax for a unique meaning from this group, a microformats way is a suitable tool here.
    I think.

  2. john | September 5, 2007 at 3:50 am | Permalink

    Emil,

    I do agree that the two approaches aren’t necessarily in opposition. But the approach of adding to HTML itself is a far more complex and fraught one than developing a mechanism for extending semantics in more limited ways on an ad hoc and yet still meaningful ways.

{ 5 } Trackbacks

  1. [...] Part III – Directions in HTML Semantics [...]

  2. [...] Part III – Directions in HTML Semantics [...]

  3. [...] John Allsopp has an interesting set of articles about semantic markup on Microformatique. This last installment, the future of semantic markup is well worth reading. Future Semantics [...]

  4. [...] I was doing a little reading up on microformats when I found my way to the future service Resolio (or is it Resiolio as in the page title of this sample resume? There’s something magical about funprofessional young startups). It’s refreshing to see another great microformats implementation materializing while the semantic web is still nowhere in site. Tim O’Reilly’s even starting to revise his vision of the semantic web, but curiously, he’s left out microformats, even though they are here and now. [...]

  5. Chris Norton | September 10, 2007 at 11:51 am | Permalink

    Arguing Semantics: HTML Revisited…

    A series of posts about semantics in HTML.
    ……

Post a Comment

Your email is never published nor shared. Required fields are marked *