- Part I – Traditional HTML Semantics
- Part II – Standardizing Vocabularies
- Part III – Directions in HTML Semantics
This is the first in a series of articles which aims to survey the issue of semantics in current web design and development (for the HTML based web, not the “Semantic Web”). The goals are to further understand the way in which semantics are currently added to today’s web, either “formally” (through the use of “semantic” HTML, or microformats) or less formally, though the use of commonly used design patterns. In this article, I focus a little on precisely what the term “semantics” means, and then turn my attention to the nature of the “built in” semantics in HTML. Later articles will look at “data” semantics, such as those formalized by the microformats project, and at how pattern languages can help formalize the ad hoc semantics you find by analyzing page and site architecture, which I also spent a lot of time investigating about 12 months ago. In one sense, these articles extend the work I began on patterns in 2005, but they also represent quite a good deal of rethinking of where patterns fit in. Rather than seeing pattern languages as a relatively distinct aspect of thinking about designing and developing for the web, in this series, I’ll bring a number of different approaches to building a better web under the big tent of semantics.
The increasing focus on semantics in web design and development over the last 4 or 5 years, while encouraging, is essentially trivial in the context of the rich, centuries long tradition, in the fields of philosophy and linguistics, of the study of how meaning is constructed through the use of language and other codes. While it is not necessary to have a full grasp of the issues and concepts in those fields in order to better create semantic content for the web, some further understanding may be helpful in clarifying an are that is usually rather poorly defined.
Before we get to semantics on the web, let’s take a brief look at the study of semantics is in the field of linguistics.
Just what is semantics anyway?
Semantics is the
In the field of linguistics, semantics is most commonly juxtaposed with “syntax” –
Within linguistics and philosophy, semantics has a rich tradition, going back literally millennia to Aristotle, and like any scientific or philosophical area of research, is not necessarily easy to get a good high level understanding of, nor are papers and other publications in the field necessarily straightforward to read and understand. Let me say that I am very far from an expert in this area, though it has been an area of interest on and off for over two decades. Ideas from the study of semantics informed both SGML, and HTML, as we’ll now see.
Semantics in SGML and HTML
SGML, as most readers will be aware, is a markup language for creating other markup languages (and so sometimes referred to as a “meta markup language”). HTML is (to cut a long story short, and leave out a number of areas of debate), one of, and by far the most widely used, very many markup languages created with SGML. SGML itself has a syntax, which all applications of it (that is all languages created with it) must follow, but in itself (to gloss over a couple of issues that aren’t relevant to this discussion), no semantics. The semantics of SGML based languages come at the application level – that is they are defined for those languages specifically. This is where HTML gets (most of) its semantics from (we’ll see why I say “most of” a little later).
So, where do we find the semantics of HTML, in other words, where can we learn what the constructs of the HTML language “mean”? Here we must turn to the HTML Specification, which defines both the set of rules that a valid HTML document must follow (its syntax) and the meaning of elements and attributes (its semantics).
Here’s a quick example of how the syntax is defined, in this case for headings (the intricacies of the syntax of DTDs and how to read them is beyond the scope of this article, but there is a good article at the W3C about this.)
<!ENTITY % heading "H1|H2|H3|H4|H5|H6"> <!-- There are six levels of headings from H1 (the most important) to H6 (the least important). --> <!ELEMENT (%heading;) - - (%inline;)* -- heading --> <!ATTLIST (%heading;) %attrs; -- %coreattrs, %i18n, %events -->
The semantics are defined in the specification after the syntax from the DTD is listed. In the case of headings, we read “A heading element briefly describes the topic of the section it introduces“, in the case of paragraphs “The P element represents a paragraph“.
What should be immediately obvious is that while we can quickly understand what heading or paragraph elements are meant for, they are semantically very loosely defined (as opposed to syntactically, syntax is very strictly defined for HTML). This reflects natural language, where meaning is much more fluid than syntax. It also gives us a lot of semantic flexibility (after all, the wikipedia entry for “paragraph” runs to nearly 1000 words). But such looseness can potentially give rise to issues of interpretation, and appropriatye use, which Dan Cederholm’s SimpleQuiz (referred to below) illustrates very well. Another example of the issues which can arise from this looseness of definition can be found in this discussion of the appropriate use of the rel attribute of links and anchors
This, what I refer to as “built in” semantics of HTML elements (and to a lesser extent, attributes) has been the main source of what we might call “traditional” semantic HTML. But HTML provides a small number of mechanisms for extending its “built in” semantics, which we’ll turn to after we take a look at “traditional web semantics”, and then in detail in subsequent articles.
Traditional web semantics
Although HTML stems from the SGML markup tradition, the semantics of HTML were, and remain, lightweight – though the small number of mechanisms for extending the semantics of HTML marked up documents, such as the use of the
class attribute, and attributes such as
rel have been increasingly used over the last half a decade or so to increase the semantic nature of web content, and extend HTML’s reasonably limited built in semantics.
Indeed the whole area, despite considerable interest among web developers, has seen very little even general theoretical work – evidenced by searches on the term “web semantics” turning up almost exclusively results associated with the “Semantic Web” project. You’ll also find little more of any depth when searching for “semantic HTML”. Perhaps the area in which the most attention has been paid to the semantics of HTML has been the microformats project, which we’ll turn to in the next article. This lack of general semantic theory probably ultimately stems from the fact that web developers are an eminently practical bunch, and the sense that this discussion is “theoretical” with no great practical value. I’ll argue at the conclusion of this article that there is in fact considerable practical value in understanding web semantics in more depth.
- Some of the best known efforts to address the area of “traditional web semantics” in a more detailed way include
- Jason Kottke’s “Standards don’t necessarily have anything to do with being semantically correct” (which possibly introduced the idea of “semantic correctness” – it’s a commonly heard, instructive but somewhat misleading concept)
- Douglas Bowman’s “on standards and semantics“, in response to Kottke’s article
- Dave Shea’s “semantics and bad code” , also in response to Kottke’s piece
- Perhaps the most significant body of work to emerge on the issue of semantics on the web is Dan Cederholm’s “SimpleQuiz“, a series of questions and associated discussions which attempted to elicit readers’ thoughts as to semantic best practices with the use of HTML.
What’s interesting is that all these focus specifically on the semantics of HTML – that is, the use of the “built in” semantic aspects of HTML. It’s also interesting that they are well over three years old now.
Of course, work has been done since, and I will definitely be turning to that later, but for now, let’s look a little deeper into this aspect of semantics on the web.
The first question which comes to mind is, precisely what aspects of HTML might be properly termed “semantic”? Despite considerable effort, I’ve yet to find any kind of detailed survey of the built in semantics of HTML, so, as usual because I have little better to do, I’ve spent far to long going over the HTML specification looking for elements and attributes which might properly be termed semantic”. This related document lists these“semantic” elements and attributes of HTML (with some exceptions I’ve outlined there.)
My interest isn’t simply in cataloguing these features, although I hope that might be useful. What I hope to do over this series of articles is to ask whether there are different categories of semantics when it comes to the web, despite the source of these semantics. For example, in the case of the built in semantics of HTML, some elements, like headings and paragraphs, are associated with what you might call “document” or “structural” semantics. They play a role in the document independent of the content they markup. It doesn’t matter what the content of a paragraph is, what matters is, in the words of wikipedia, that the element is a “self-contained unit of a discourse in a written text dealing with a particular point or idea, or the words of a speaker”. But other elements, for example
address, apply only for particular kinds of information – they are content specific semantic elements. A third possible category presents itself, but I’m not entirely sure whether it is distinct from the first. Elements like
strong are in a sense rhetorical, rather than structural or content markup. What do I mean by that? Just as with a paragraph, its not the content marked up as
strong which determines that using this markup is appropriate (as would be the case for say
address). But clearly, to me at least, there is a difference between structural pieces of a document like a paragraph, and emphasized text. So for now, I’ll propose a third classification of semantic markup, rhetorical, for this kind of element or attribute.
To recap, for elements, there appears to be three semantic classifications an element may belong to (and elements may arguably belong to more than one of these categories at the same time).
- The semantics of these elements are a function of the role they play in a documents structure. Examples would be
- The semantics of these elements is a function of the type of content they markup, rather than the role these elements play in a document’s structure. Examples would be
- These semantics of these elements is a function of the author’s rhetorical devices, rather than the role the elements play in a document, or the content they markup in particular. Examples are
My full categorization is available here. In it I also take a look at and attempt to classify semantic HTML attributes as well. Attributes seem to fall into two categories – those associated with content semantics, and those whose role is to extend HTML’s built in semantics.
But is this really useful?
You might argue that this is simply philosophizing for its own sake, but I think that understanding the way in semantics works, both the semantics built into HTML, and other ways of adding richer semantics to the web (such as through standardized data formats like microformats) helps developers design and implement better sites. I’ve outlined some of the main benefits in a related article on pattern languages, so I won’t repeat them in detail here. But in short – developers benefit, users benefit, and software based services (like search engine, aggregators and so on) benefit.
- for developers, understanding semantics
- helps solve complex design problems in better ways, more efficiently, while avoiding potential pitfalls, based on our own and others’ previous work
- helps to facilitate larger development teams, which will become more common as the practice of designing and developing for the web continues to mature, by providing a common vocabulary and language to aid communication
Users benefit from more rich, consistent use of semantics because sites become more consistent, and so more learnable, while content more readily findable. Think of the way any book you read follows structural conventions for chapters, and pages are consistently laid out, with page numbers, often section or chapter names visible on each page and so on.
Software discovery and use of information of the web will benefit dramatically from more rich semantic markup, as evidenced by the already valuable services built around microformats, one aspect of web semantics we’ll look at in more detail in an upcoming article in this series.
In the next article in the series, we’ll look at two more sources of web semantics, microformats, and “HTML compounds” and consider the semantic classifications these belong to, or whether they belong to new and distinct classifications.
- What we say here about SGML and HTML applies pretty much equally well to XML and XHTML respectively.
- “Correct” is typically used in a binary sense – either something is correct or it is incorrect. This makes sense in the context of syntax, with restrictive, exact rules, but far less so with semantics. I’d suggest semantically “appropriate” as a more useful term.
[tags]html, semantics, webdesign, webdevelopment, microformats[/tags]