There are two languages commonly used to express data, but they are very different, and are better and worse for various applications. Some folks get emotional about one or the other, and insist that one should be banned and that the other solves all problems. This is exactly like saying that a screwdriver is better than a hammer, or vice versa. It’s really a matter of using the right tool for the right job. So, let’s take a quick look at these two data languages, and understand what they’re meant for.
The eXtensible Markup Language (XML) is used for exchanging human-readable documents in electronic form. We learn right from the name that XML is a markup language, which means that it provides tools for inserting annotations, called markup, into ordinary human-readable text. The annotations “mark” boundaries within the text, and tell us things about the text between the boundaries that might not be obvious—at least, not obvious to a machine. Another markup language in wide use is the HyperText Markup Language (HTML), where markup is used primarily to indicate how text should be rendered on a Web page for a human reader. In contrast, the proper use of XML is to indicate what the text means, so that separate specifications can indicate how text should be rendered based on what it means. This enables one set of text to be rendered differently for different audiences and viewing devices.
See below for a snippet of an XML document. The names enclosed in angle brackets are called tags, and constitute the markup of what is otherwise plain text. Most tags come in pairs with text between the start tag and end tag, and the whole construction is called an element. For example, in Figure 1 the plain text Chapter 1 is surrounded by the start tag <title> and the end tag </title>. Elements can nest. For example, the Chapter 1 title is nested inside a <chapter> element. The same <chapter> element also contains two <para> elements. The <chapter> element is nested inside the <book> element.
See below for an example of a JSON text. Text enclosed in curly braces expresses the value of an object, while text enclosed in square brackets expresses the value of an array. Within an object, each component has a name, followed by a colon, followed by a value. Within an array, nameless values follow each other in a list. The values may themselves be objects, however. For instance, within the phoneNumbers array, there are two nameless objects, each of which consists of two name/value pairs.
JSON is often compared to XML as a more efficient language with the same expressive power. This is not accurate. The confusion has arisen because, before JSON was available, XML was used heavily as a data interchange language, even though its original design intent was that it be used as a markup language. As a data language, where an annotation’s position within human-readable text is irrelevant, XML is horribly inefficient because of all those end tags, where the element name is repeated with a slash in front of it. The result can be an XML document many times larger than the data it is carrying.
A JSON text might include human-readable text as data, but not marked-up text in the same sense as XML. It would be an odious task to adapt JSON for marking up text, because JSON does not preserve the order of name/value pairs in an object. In case you haven’t noticed, important order is in natural language. Just ask Yoda.
Both XML and JSON perpetuate terminological confusion by their use of the terms attribute (XML) and object (JSON).
In XML, an attribute is a compact way of associating a simple string value with an element, without that value being considered part of the element itself. But from a data-theoretic point of view, an element’s value is just as much a data attribute of the element it’s nested within, as an attribute is a data attribute of the element to which it applies. We would have preferred some other term than attribute.
What JSON calls an object is really a data structure. Properly speaking, an object is material, and occupies space. A computer object occupies space in a computer’s memory or storage. In contrast, a JSON object expresses a value that can be represented by the state of an object in a computer, or just by ink on paper.
These two overloads of the terms attribute and object help keep the entire computer industry from breaking down the barriers between data, semantics, and software. My book, NoSQL and SQL Data Modeling, explains how to escape this confusion.
Which Should I Use?
It really gets quite simple. For marking up human-readable text so a machine has access to fragments of meaning, and to prepare for rendering text in a variety of contexts, use XML. For exchanging data that is not to be embedded in human-readable text, use JSON.
You will find that there are robust ecosystems built around XML as a markup language and JSON as a data language. There are tools, analyzers, schemas for validation, user groups, and all sorts of resources to help you use each language in the best way possible. It’s not about which tool is better; it’s about which tool is fit for the purpose at hand.
This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://comn.dataversity.net/ for more information.
Copyright © 2017, Ted Hills