Skip to content

Semantics in HTML part I – Traditional HTML Semantics

  1. Part I – Traditional HTML Semantics
  2. Part II – Standardizing Vocabularies
  3. Part III – Directions in HTML Semantics

This is the first in a series of articles which aims to survey the issue of semantics in current web design and development (for the HTML based web, not the “Semantic Web”). The goals are to further understand the way in which semantics are currently added to today’s web, either “formally” (through the use of “semantic” HTML, or microformats) or less formally, though the use of commonly used design patterns. In this article, I focus a little on precisely what the term “semantics” means, and then turn my attention to the nature of the “built in” semantics in HTML. Later articles will look at “data” semantics, such as those formalized by the microformats project, and at how pattern languages can help formalize the ad hoc semantics you find by analyzing page and site architecture, which I also spent a lot of time investigating about 12 months ago. In one sense, these articles extend the work I began on patterns in 2005, but they also represent quite a good deal of rethinking of where patterns fit in. Rather than seeing pattern languages as a relatively distinct aspect of thinking about designing and developing for the web, in this series, I’ll bring a number of different approaches to building a better web under the big tent of semantics.

The increasing focus on semantics in web design and development over the last 4 or 5 years, while encouraging, is essentially trivial in the context of the rich, centuries long tradition, in the fields of philosophy and linguistics, of the study of how meaning is constructed through the use of language and other codes. While it is not necessary to have a full grasp of the issues and concepts in those fields in order to better create semantic content for the web, some further understanding may be helpful in clarifying an are that is usually rather poorly defined.

Before we get to semantics on the web, let’s take a brief look at the study of semantics is in the field of linguistics.

Just what is semantics anyway?

Semantics is the study of how language produces meaning, or more generally how meaning arises in a system of signs. The term also refers to the content carried by words or signs … when communicating through language. When applied to the web, and in particular HTML markup, typically semantics is used in the second sense – the content or meaning carried or conveyed by the markup itself (rather than the words marked up).

In the field of linguistics, semantics is most commonly juxtaposed with “syntax” – the study of the rules, or “patterned relations”, that govern the way words combine to form phrases and phrases combine to form sentences. In the context of programming and markup languages, syntax is essentially the set of rules which govern how valid expressions can be created. Coupled together, syntax defines the rules that must be followed in order to create a valid document, semantics, the meaning associated with these syntactical constructions.

Within linguistics and philosophy, semantics has a rich tradition, going back literally millennia to Aristotle, and like any scientific or philosophical area of research, is not necessarily easy to get a good high level understanding of, nor are papers and other publications in the field necessarily straightforward to read and understand. Let me say that I am very far from an expert in this area, though it has been an area of interest on and off for over two decades. Ideas from the study of semantics informed both SGML, and HTML, as we’ll now see.

Semantics in SGML and HTML

SGML, as most readers will be aware, is a markup language for creating other markup languages (and so sometimes referred to as a “meta markup language”). HTML is (to cut a long story short, and leave out a number of areas of debate), one of, and by far the most widely used, very many markup languages created with SGML. SGML itself has a syntax, which all applications of it (that is all languages created with it) must follow, but in itself (to gloss over a couple of issues that aren’t relevant to this discussion), no semantics. The semantics of SGML based languages come at the application level – that is they are defined for those languages specifically. This is where HTML gets (most of) its semantics from (we’ll see why I say “most of” a little later).

So, where do we find the semantics of HTML, in other words, where can we learn what the constructs of the HTML language “mean”? Here we must turn to the HTML Specification, which defines both the set of rules that a valid HTML document must follow (its syntax) and the meaning of elements and attributes (its semantics).

Here’s a quick example of how the syntax is defined, in this case for headings (the intricacies of the syntax of DTDs and how to read them is beyond the scope of this article, but there is a good article at the W3C about this.)

<!ENTITY % heading "H1|H2|H3|H4|H5|H6">
<!--
There are six levels of headings from H1 (the most important)
to H6 (the least important).
-->

<!ELEMENT (%heading;)  - - (%inline;)* -- heading -->
<!ATTLIST (%heading;)
%attrs;                              -- %coreattrs, %i18n, %events -->

The semantics are defined in the specification after the syntax from the DTD is listed. In the case of headings, we read “A heading element briefly describes the topic of the section it introduces“, in the case of paragraphs “The P element represents a paragraph“.

What should be immediately obvious is that while we can quickly understand what heading or paragraph elements are meant for, they are semantically very loosely defined (as opposed to syntactically, syntax is very strictly defined for HTML). This reflects natural language, where meaning is much more fluid than syntax. It also gives us a lot of semantic flexibility (after all, the wikipedia entry for “paragraph” runs to nearly 1000 words). But such looseness can potentially give rise to issues of interpretation, and appropriatye use, which Dan Cederholm’s SimpleQuiz (referred to below) illustrates very well. Another example of the issues which can arise from this looseness of definition can be found in this discussion of the appropriate use of the rel attribute of links and anchors

This, what I refer to as “built in” semantics of HTML elements (and to a lesser extent, attributes) has been the main source of what we might call “traditional” semantic HTML. But HTML provides a small number of mechanisms for extending its “built in” semantics, which we’ll turn to after we take a look at “traditional web semantics”, and then in detail in subsequent articles.

Traditional web semantics

Although HTML stems from the SGML markup tradition, the semantics of HTML were, and remain, lightweight – though the small number of mechanisms for extending the semantics of HTML marked up documents, such as the use of the class attribute, and attributes such as rel have been increasingly used over the last half a decade or so to increase the semantic nature of web content, and extend HTML’s reasonably limited built in semantics.

Indeed the whole area, despite considerable interest among web developers, has seen very little even general theoretical work – evidenced by searches on the term “web semantics” turning up almost exclusively results associated with the “Semantic Web” project. You’ll also find little more of any depth when searching for “semantic HTML”. Perhaps the area in which the most attention has been paid to the semantics of HTML has been the microformats project, which we’ll turn to in the next article. This lack of general semantic theory probably ultimately stems from the fact that web developers are an eminently practical bunch, and the sense that this discussion is “theoretical” with no great practical value. I’ll argue at the conclusion of this article that there is in fact considerable practical value in understanding web semantics in more depth.

Some of the best known efforts to address the area of “traditional web semantics” in a more detailed way include
Jason Kottke’s “Standards don’t necessarily have anything to do with being semantically correct” (which possibly introduced the idea of “semantic correctness” – it’s a commonly heard, instructive but somewhat misleading concept)
Douglas Bowman’s “on standards and semantics“, in response to Kottke’s article
Dave Shea’s “semantics and bad code” , also in response to Kottke’s piece
Perhaps the most significant body of work to emerge on the issue of semantics on the web is Dan Cederholm’s “SimpleQuiz“, a series of questions and associated discussions which attempted to elicit readers’ thoughts as to semantic best practices with the use of HTML.

What’s interesting is that all these focus specifically on the semantics of HTML – that is, the use of the “built in” semantic aspects of HTML. It’s also interesting that they are well over three years old now.

Of course, work has been done since, and I will definitely be turning to that later, but for now, let’s look a little deeper into this aspect of semantics on the web.

The first question which comes to mind is, precisely what aspects of HTML might be properly termed “semantic”? Despite considerable effort, I’ve yet to find any kind of detailed survey of the built in semantics of HTML, so, as usual because I have little better to do, I’ve spent far to long going over the HTML specification looking for elements and attributes which might properly be termed semantic”. This related document lists these“semantic” elements and attributes of HTML (with some exceptions I’ve outlined there.)

My interest isn’t simply in cataloguing these features, although I hope that might be useful. What I hope to do over this series of articles is to ask whether there are different categories of semantics when it comes to the web, despite the source of these semantics. For example, in the case of the built in semantics of HTML, some elements, like headings and paragraphs, are associated with what you might call “document” or “structural” semantics. They play a role in the document independent of the content they markup. It doesn’t matter what the content of a paragraph is, what matters is, in the words of wikipedia, that the element is a “self-contained unit of a discourse in a written text dealing with a particular point or idea, or the words of a speaker”. But other elements, for example address, apply only for particular kinds of information – they are content specific semantic elements. A third possible category presents itself, but I’m not entirely sure whether it is distinct from the first. Elements like em and strong are in a sense rhetorical, rather than structural or content markup. What do I mean by that? Just as with a paragraph, its not the content marked up as em or strong which determines that using this markup is appropriate (as would be the case for say cite or address). But clearly, to me at least, there is a difference between structural pieces of a document like a paragraph, and emphasized text. So for now, I’ll propose a third classification of semantic markup, rhetorical, for this kind of element or attribute.

To recap, for elements, there appears to be three semantic classifications an element may belong to (and elements may arguably belong to more than one of these categories at the same time).

structural
The semantics of these elements are a function of the role they play in a documents structure. Examples would be div, span, headings, lists, paragraphs
content
The semantics of these elements is a function of the type of content they markup, rather than the role these elements play in a document’s structure. Examples would be abbr, address, code
rhetorical
These semantics of these elements is a function of the author’s rhetorical devices, rather than the role the elements play in a document, or the content they markup in particular. Examples are em and strong

My full categorization is available here. In it I also take a look at and attempt to classify semantic HTML attributes as well. Attributes seem to fall into two categories – those associated with content semantics, and those whose role is to extend HTML’s built in semantics.

But is this really useful?

You might argue that this is simply philosophizing for its own sake, but I think that understanding the way in semantics works, both the semantics built into HTML, and other ways of adding richer semantics to the web (such as through standardized data formats like microformats) helps developers design and implement better sites. I’ve outlined some of the main benefits in a related article on pattern languages, so I won’t repeat them in detail here. But in short – developers benefit, users benefit, and software based services (like search engine, aggregators and so on) benefit.

for developers, understanding semantics
helps solve complex design problems in better ways, more efficiently, while avoiding potential pitfalls, based on our own and others’ previous work
helps to facilitate larger development teams, which will become more common as the practice of designing and developing for the web continues to mature, by providing a common vocabulary and language to aid communication

Users benefit from more rich, consistent use of semantics because sites become more consistent, and so more learnable, while content more readily findable. Think of the way any book you read follows structural conventions for chapters, and pages are consistently laid out, with page numbers, often section or chapter names visible on each page and so on.

Software discovery and use of information of the web will benefit dramatically from more rich semantic markup, as evidenced by the already valuable services built around microformats, one aspect of web semantics we’ll look at in more detail in an upcoming article in this series.

In the next article in the series, we’ll look at two more sources of web semantics, microformats, and “HTML compounds” and consider the semantic classifications these belong to, or whether they belong to new and distinct classifications.

{ 5 } Comments

  1. milan | January 9, 2007 at 5:56 pm | Permalink

    Just few thoughts about semantics and the web from the philosopher’s point of view. Semantics for me is a discipline which studies relation between signs and their meanings. To say that something is semantic then means that it has something to do with relation between signs and their meanings (or what they refer to). To say that HTML should be semantic is a bit weird then. HTML is and always was semantic in the sense that there were some relations between the signs it consists of and the things this signs should stand for. All HTML signs were made to carry some meaning, that means they’ve always been semantic in this sense. The problem is, that this semantic relations have been ignored for a long time, which means, the signs (markup) have been (and still are) used meaninglessly – eg. the sign which should denote a table (…) isn’t often denoting any table at all. This is just the first level of HTML semantics though – level which governs relation between HTML and a document it describes and which has been the main topic of the web standards movement (and which was connected with the fight for a correct syntax of HTML as well -”All tags have to be closed!”). However, this is not a problem of HTML but of people which misuse it. To use semanticly and syntacticly correct HTML in this sense means to use it meaningfully.
    The problem of HTML is in that it is a very primitive language – all it can talk about are documents. It has to be primitive as it has to be understood by machines. On the other hand, documents are written in natural language which almost always is about something much more complicated and diverse than just documents with their headings, paragraphs, etc. (in another words, it has much more rich semantics than HTML). Downside of natural language is that it is so complicated that no machine can understand it. Now the problem is, how to connect these two worlds – world of simple documents and the real world of people, cars, unicorns and all those things we can talk and write about. Internet is about this rich world not just about some documents, after all.
    There are two possible solutions to this problem – to create more intelligent machines or to create more appropriate language for describing internet content. The second option then consists of two suboptions – to create a new internet language or to improve current one. The last way is obviously the easiest one to follow. If you improve HTML and you don’t break any of it’s old semantic and syntactic rules (you can do this by using some of it’s generic features such as “class” attribute) nobody would tell you that you are doing something wrong. That’s why microformats are being adopted so fast now. It’s not because they make HTML semantic, but because they add some more semantic rules to it, they make it more more diverse, able to talk about more things than just documents. In other words, they make it more meaningful – to us and to machines too.
    I know, I didn’t say anything new and I didn’t mean to do so. What I’d like to suggest at the end is to begin to talk about “more meaningful” HTML (HTML which conveys more meaning) instead of “semantic” HTML because as I said, HTML has been semantic from it’s begining, but what it really lacks is the power to express more meanings.

  2. R. Brown | January 21, 2007 at 1:31 am | Permalink

    The semantic features of HTML don’t really seem like they have much to offer the Semantic Web. Microformats are a nice step towards a smarter web, and RSS has given us a glimpse of how many great applications there are for well-described data published in a common format. But it seems like things won’t really get rolling until more sites start publishing straight XML and delivering their XHTML pages through XSL transforms. For people who want separate their content from style this is the holy grail. Particletree did a good article on how XSLT fits into things here:

    http://particletree.com/features/4-layers-of-separation/

    The focus on the semantics of HTML is useful now chiefly in that it makes development cleaner and gives developers awareness of and desire for more robust semantic data that XML can provide. But I’m glad this wasn’t another article talking about how you need to be sure to use li and p everywhere — li ain’t going to make the World Brain.

  3. ezgi | May 18, 2007 at 1:28 pm | Permalink

    I am not sure we can mention “semantics” of html. In literal sense, semantics is the arbitrary relation between the markup signs and objects. When we consider of “Html”, it is only markup side of the consept semantics.

  4. john | May 18, 2007 at 5:36 pm | Permalink

    @ezgi,

    conventionally, the term “semantics” in relation to HTML is very widely used, and in a way that I think is cosistent with your definition. In this sense it refers to the association between the syntactic structures of HTML, and the meaning units that these structures model or convey.

    thanks

    john

  5. mattur | September 1, 2007 at 6:23 pm | Permalink

    What mila/ezgi said: There’s a difference between semantics as in syntax and semantics as in meaning. As you say, the term semantics is widely used in the web world – wrongly.

    s/semantic/structural

{ 23 } Trackbacks

  1. All in a days work… | January 11, 2007 at 2:35 am | Permalink

    [...] Semantics in HTML – 1.”traditional semantic HTML” the nature of the “built in” semantics in HTML. Later articles: “data” semantics, such as those formalized by the microformats project, and at how pattern languages can help formalize the ad hoc semantics you find by analyzing architecture (tags: Semantics HTML Microformats) [...]

  2. [...] John Allsopp’s Semantics in HTML – 1.”traditional semantic HTML” [...]

  3. [...] Well I didn’t like that last theme. But I found this one being used on the following site: http://microformatique.com/?p=83, its a good article too! [...]

  4. [...] semantics in HTML (tags: html semantics xhtml semantic markup webdev) [...]

  5. [...] Part I – Traditional HTML Semantics [...]

  6. [...] Last week I came across the second part of John Allsop’s series on Semantics in HTML. Paired with the first article in the series, it’s quite an eye opener. [...]

  7. [...] Hay muy buenos artículos sobre la web semántica, y muy buenas galerías de sitios que cumplen con “reglas” de diseño o de jerarquización y usabilidad. [...]

  8. Semantic Markup | September 3, 2007 at 3:39 am | Permalink

    [...] John Allsopp surveys the issue of semantics in current web design and development [...]

  9. Real Semantic Markup » Tecnotertulia | September 3, 2007 at 7:25 pm | Permalink

    [...] Roger Johansson over at 456 Berea Street, reflecting on a series of articles by John Allsopp regarding HTML semantics, asks the question: “Should there be another way of extending and improving the semantics of HTML without requiring the specification to be updated?” [...]

  10. (pluri)TAL / ILPGA [U. Paris 3] | September 7, 2007 at 2:37 am | Permalink

    [...] Le CENTAL (Centre d’informatique rattaché à la Faculté de philosophie et lettres de l’UCL) a mis en ligne un outil permettant de convertir un texte français en langage SMS.Wex is an ambitious effort to construct a collaboratively-created, public-access law dictionary and encyclopedia.Vidéos de présentation de Open Office .HTML Code Tutorial : HTML Code Tutorial goal is to provide the most helpful and complete guide to creating web pages anywhere…Understanding and extending semantics in HTML : Part I – Traditional HTML Semantics, Part II – Standardizing Vocabularies, and Part III – Directions in HTML Semantics) (par John Allsopp). Transcender CSS. Sublimez le design web ! Andy Clarke avec la contribution de Molly E. Holzschlag , Dave Shea, Eyrolles, Prix public : 32,00 EUR.Support de cours sur le web 2.0 au format powerpoint, complété par une webliographie thématique.“What is the structured Web” par Michael Bergman qui explique en termes simples le tournant vers un Web centré sur les données.Beyond Google : How do students conduct academic research ? (Au delà de Google : Comment les étudiants mènent leurs recherches ?) : This paper reports findings from an exploratory study about how students majoring in humanities and social sciences use the Internet and library resources for research. Using student discussion groups, content analysis, and a student survey, our results suggest students may not be as reliant on public Internet sites as previous research has reported. Instead, students in our study used a hybrid approach for conducting course–related research. A majority of students leveraged both online and offline sources to overcome challenges with finding, selecting, and evaluating resources and gauging professors’ expectations for quality research.. Lire aussi sur ecrans.fr : Les étudiants ne se ruent pas sur le net [...]

  11. [...] Part I – Traditional HTML Semantics [...]

  12. nur Bahnhof » links for 2007-09-07 | September 7, 2007 at 1:20 pm | Permalink

    [...] microformatique – a blog about microformats and “data at the edges” : Semantics in HTML part I – Traditional HTML Semantics (tags: accessibility css html reference semantic usability web webdev xhtml markup) Tags: Linktipps von Finkregh Permalink • drucken • E-Mail • Spuren im Schnee [Kommentare] [...]

  13. [...] microformatique – a blog about microformats and “data at the edges” : Semantics in HTML part I – Traditional HTML Semantics This is the first in a series of articles which aims to survey the issue of semantics in current web design and development. (tags: xhtml microformats Semantics html) 09-07-2007 | 11:20 am [...]

  14. [...] microformatique – a blog about microformats and “data at the edges” : Semantics in HTML part I – Traditional HTML Semantics (tags: html semantics microformats markup xhtml standards) [...]

  15. links for 2007-09-08 « toonz | September 8, 2007 at 7:22 pm | Permalink

    [...] microformatique – a blog about microformats and “data at the edges” : Semantics in HTML part I – Traditional HTML Semantics (tags: html semantics microformats markup xhtml standards) [...]

  16. Chris Norton | September 10, 2007 at 11:51 am | Permalink

    Arguing Semantics: HTML Revisited…

    A series of posts about semantics in HTML.
    ……

  17. [...] Ottimo articolo suddiviso in 3 sezioni per capire le fondamenta della semanticità. Potete trovarlo su http://microformatique.com/. [...]

  18. Best of August 2007 | Best of the Month | September 18, 2007 at 8:20 am | Permalink

    [...] Semantics in HTML Many Web-Evangelists see the future of the Web in the concept of Semantic Web. But what exactly is it? And what is the current state of developments? John Allsopp about the Semantics in HTML. [...]

  19. [...] Semantics in HTML Das zukünftige Web sehen viele Web-Evangelisten als das Semantische Web. Doch was macht es aus? Und wie ist der Stand der Dinge? John Allsopp setzt sich mit dem Thema auseinander. [...]

  20. lost node » Blog Archive » Best of August 2007 | September 18, 2007 at 12:24 pm | Permalink

    [...] Semantics in HTML Many Web-Evangelists see the future of the Web in the concept of Semantic Web. But what exactly is it? And what is the current state of developments? John Allsopp about the Semantics in HTML. [...]

  21. » Best of August 2007 | September 26, 2007 at 2:14 pm | Permalink

    [...] Semantics in HTML Many Web-Evangelists see the future of the Web in the concept of Semantic Web. But what exactly is it? And what is the current state of developments? John Allsopp about the Semantics in HTML. [...]

  22. [...] Part I – Traditional HTML Semantics [...]

  23. [...] Understanding Web Semantics [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *