Skip to content

Semantics in HTML Part II – standardizing vocabularies

  1. Part I – Traditional HTML Semantics
  2. Part II – Standardizing Vocabularies (this article)
  3. Part III – Directions in HTML Semantics

The aim of these articles is to delve more deeply into than nature of “real world” HTML semantics – to investigate where the semantics of HTML comes from, and in future may come from, and categorize the different kinds of semantics in HTML.

In the first part, we looked into the “built in” semantics of HTML – the elements and attributes of HTML that have semantics defined for them.

In this part, we’ll look at other sources of semantics from outside HTML itself, including the mechanisms used to “inject” these semantics.

In the final part, I’ll turn to ways in which HTML may acquire richer semantics, whether through changes to the language itself (as currently under development through the HTML5 project and XHTML2, as well as the W3C’s recently formed HTML Working Group), through projects similar to microformats (to develop controlled vocabularies, and mechanisms/patterns for appropriately using them with HTML, as well as other mechanisms, such as using other XML languages embedded in XHTML. In particular, I’ll pay attention to the use of pattern languages, and “WebPatterns” a pattern language I have been developing for some time.

Emergent Semantics

Language is a function of human brain physiology – as a species, language use, and even creation, appears to be essentially spontaneous, or instinctive. To the best of my knowledge, no group of people have ever been found without language. When groups with divergent mother languages become a single group (as for example with African slaves in the new world) pidgin languages develop very quickly, and within a generation, the children of the original group members are speaking a new mother language, a creole.

In short, we are very good at using, and creating language.

So, you might conjecture that on the web, as thousands, indeed millions of developers around the world adopt a “semantic” approach to marking up content, some kind of common language, at least for very common pieces of information, or constructs might emerge. In late 2005 I set out to test this conjecture. Having observed that the use of class and id had become increasingly common over the last 5 years or so, I wanted to determine whether there was some kind of consensus emerging in this use. Were we developing a “pidgin” – a simplified “language” common to web developers, for describing the kinds of common constructs, such as page architecture elements, that developers use a lot? It seemed like a fair surmise, but overwhelmingly the result was that there is no such emergent language. Across the more than 1000 sites I surveyed, more than 4000 distinct class values and id values were found. That means on average every site surveyed invented 4 or more class and id values that no other site used. Indeed, the most frequently used values were used less than 100 times across all sites, that is in fewer than 2.5% of all sites (by the way, the sites were chosen on the basis that they were more likely to use HTML in this semantic way – you can read about the project, and its methodology, as well as get the results in the original article).

In short, waiting around for a shared vocabulary for even the most commonly used terms to emerge is going to be very long and almost certainly fruitless. Which means specific projects to develop these vocabularies are going to be required if we are ever going to see them (this does of course leave aside the whole issue as to whether or not they are even required, but I think that’s been covered sufficiently in the first of these articles, and many other things I and other smarter people have written as to be a given.)

In the remainder of this article, I’ll look at the current projects of this nature, their mechanisms, and their focus. In the final article I’ll consider future projects for extending the semantics of HTML.

I know not seems

The truth is that as yet, there is very little in the way of projects or commonly adopted approaches to developing richer semantics for HTML, despite the considerable interest, and number of words written on the subject. Even very useful “semantic” features of HTML, such as definition lists (which are both poorly named, as they are not, according to the HTML specification, simply for definitions, and also underutilized because many developers are not aware that each term may have more than one description), are very under utilized. It’s also interesting given the enormous effort that has gone into schemas, whether for data interchange (such as vCard, iCalendar), for XML, or SGML based languages, that little if any of the semantics from these efforts have in any way influenced the way web designers and developers mark up their code. This is probably a function of the relative isolation of web developers from developments in those fields, and also the reverse. In short, there seems to be little dialogue between web practitioners, and practitioners in related but only slightly different fields such as these, to everyone’s detriment.

So, what projects or approaches exist to bridge these gaps, or to otherwise help develop richer semantics for HTML? I’m really only aware of one that has made any significant headway – microformats. Certainly a good many people have talked largely informally of similar projects, but microformats is the only one to have delivered widely used tools for extending the semantics of HTML. There are many places to learn more about what microformats are, and how to use them, starting with the home of microformats, and including my soon to be released book on the subject. In this article I am more interested in considering why microformats have been successful, what general mechanisms (or “design patterns”) they use, which may be useful in similar projects, and the scope of semantics they focus on. All these will be very useful in thinking about parallel projects to similarly augment the semantics of HTML, in areas that microformats do not reach.


An approach to solving a certain kind of problem

I think there are several factors in the success of microformats. Some of them are certainly personal – highly respected and connected “founders” like Tantek Çelik, Eric Meyer and Matt Mullenwegg, and close associations with a search engine company that could immediately bring “network effect” payoffs by implementing search services based on microformats like rel-tag. But leaving those aside the project did and does a great deal right.

restrict themselves in their focus – only looking to solve actual problems, rather than theoretical problems – so they are useful for a large number of developers right way
look to solve the smallest possible subset of a problem that is meaningfully useful, rather than an entire problem space from the get-go
Reuse existing schemas, and other standards, both formal and informal, within and without the web space. They leverage other people’s hard work

In short, they help developers do what they are already doing, in a slightly more standardized way.

Categories of semantics with Microformats

In the first part of these articles, I attempted to categorize the kind of semantics that are built into HTML. I argued that there are three kinds of semantics in HTML

  • document structure – for example headings and paragraphs
  • content (or data) – for example code or address
  • rhetorical – for example strong and emphasis

What kind of semantics do microformats concern themselves with? All or some of these? None of these? New categories of semantics?

In my opinion, and I’ll happily be corrected on this, none of the current microformats concern themselves with document structure. Nor could any current microformats be considered “rhetorical”. Certainly, most microformats fall into the category of content semantics – formats like hCard, hCalendar, hResume, hReview, provide us with a vocabulary for marking up specific kinds of content more consistently, and with more meaning than straight HTML – for example contact details, events, resumes and reviews. But what about formats like rel-tag and rel-license? It’s a truism that they markup content, but the nature of the content they mark up is interesting. These microformats allow us to express the relationships between documents, or what these documents represent. rel-tag captures the idea that the page being linked to is a tag for the page containing the link, while rel-license the idea that the linked page is a license for the page containing the link. Other rel (and rev) based microformats like Votelinks and XFN also capture similar information. This category of semantics might be termed “relational semantics“. In fact, HTML does have built in relational semantics, which I neglected in the first article, because for the most part it provides mechanisms for expressing these semantics, through the link element, and rel attribute – but, for the most part it leaves the provision of actual vocabularies (with the exception of a small vocabulary of values for the rel and rev attribute) as “an exercise for the reader.” Microformats take this mechanism, and use it to provide much richer, open ended semantics for expressing the relationships between documents, and the things they represent (for example people, via XFN).

Given that microformats have reached a fair degree of stability, both in terms of specific formats, and also in terms of principles, and processes, it’s perhaps possible to conjecture that as a function of these principles and processes, microformats focus on these two categories of semantics – relational, and content. Indeed, it may be that they focus on a subset of one or both of these. This, by the way, is far from a criticism of the project, in fact, it is very much in keeping with its “stay focussed” philosophy, and is quite possibly another factor in the success of the project overall.

But, it leaves open the question, how might we extend the structural and rhetorical semantics of HTML, if these are not a focus of the microformats project? I would argue that as yet there really aren’t parallel projects in these areas, but there are however efforts underway which may ultimately enrich the semantics of HTML in these categories. These projects will be the focus of the third and final article in the series.

Design Patterns

Before leaving the subject of microformats, I’d like to take a look at some of the fundamental aspects of that project which I think will be very valuable for any similar projects. As the efforts to develop microformats have matured, a number of techniques for solving elemental problems which seem to recur frequently when trying to create richer semantics for HTML have emerged and been codified. Termed “design patterns”, these aren’t in themselves microformats, but are used frequently as building blocks for microformats. Two which I think will be extremely useful in similar projects to augment the semantics of HTML are the “class design pattern” (and by analogy, a possible novel pattern the “id class pattern”), and the “ABBR design pattern”.

Class Design Pattern

The class design pattern is a standard way to add semantics to an element in HTML. It simply codifies, and names, a long standing existing pattern of developer behavior. For example, hCard uses this design pattern to markup some text as the country name in an address like this

<span class="country-name">Australia</span>

(the value “county-name” comes from the vCard schema)

Whatever other projects spring up for augmenting the rhetorical, structural, or other categories of semantics in HTML, this design pattern is likely to be very very important. In the final article, we’ll look at how in particular the HTML5 project looks to move this process of augmenting HTML semantics into the language itself, by creating a number of new elements such as <footer>. I’d argue that this is a very bad idea and that the class design pattern is clearly a much better solution. But that’s a subject for the next article.

ABBR Design Pattern

The ABBR Design Pattern has evolved to enable the markup of information in simultaneously a human and machine friendly form. For example, dates are notoriously ambiguous and difficult for software to parse, given that we humans write them in so many different ways. For example, May 11 2007, 5.11.2007, 11.5.2007 all may be the same, or different days, and that’s in English alone. Now, there are standardized ways of expressing dates and times, in particular the ISO8601 date time format, adopted by the W3C for the web. It’s unambiguous – for example the date above is written 20070511 [1] – but hardly particularly humanly friendly. Rather than force software to try and understand any number of date time formats, and human languages, or for all people to adopt a format for the benefit of software, the ABBR design pattern is designed to allow both to be encoded at once. It uses the abbr element of HTML, and puts the humanly readable version of the content as the content of the element, and the machine readable version as the value of the title attribute.

For example, the date we just saw could be marked up as

<abbr title="20070511" lang="en">May 11th 2007<abbr>

Again, for a certain kind of semantics, this time most likely content semantics, the ABBR design pattern provides a mechanism for augmenting HTML’s existing semantics.

The wrap

In this article, I’ve been concerned with how the semantics of HTML are currently being augmented, or enriched, both in ad hoc ways, through the anarchic practice of developers adding class values with little if any conformity, to their HTML markup, and in more standardized ways, in particular the microformats approach. Microformats, as we saw, focus on two particular categories of semantics, content and relational. But in this and the previous article, I argued that there are other categories of semantics in HTML – structural, and rhetorical. In the next and final article in the series, I look into a number of projects to augment the semantics of HTML at a language level, and also outline a project similar to microformats, but focussed on these other semantic categories.


1: Yes, I know that in the U.S and other places May 11 is 5.11, while the U.K and elsewhere, 11.5, but I am sure you get my point

[tags]microformats, semantics[/tags]

{ 7 } Trackbacks

  1. [...] Semantics in HTML Part II – standardizing vocabularies at microformatique – a blog about microformats and ?data at the edges? – In this part, we?ll look at other sources of semantics from outside HTML itself, including the mechanisms used to ?inject? these semantics. [...]

  2. disambiguity - » links for 2007-02-23 | February 23, 2007 at 1:37 pm | Permalink

    [...] Semantics in HTML Part II – standardizing vocabularies at microformatique – a blog about microformats and “data at the edges” the next chapter in my continuing effort to better get my head around Microformats (tags: microformats SemanticWeb) [...]

  3. [...] Last week I came across the second part of John Allsop’s series on Semantics in HTML. Paired with the first article in the series, it’s quite an eye opener. [...]

  4. [...] Part II – Standardizing Vocabularies [...]

  5. Chris Norton | September 10, 2007 at 11:51 am | Permalink

    Arguing Semantics: HTML Revisited…

    A series of posts about semantics in HTML.

  6. [...] Semantics in HTML Part II – Standardizing Vocabularies [...]

  7. [...] Part II – Standardizing Vocabularies [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *