codedread

The Road To XML is Paved with Good Intentions

For those of you who don't know what XML is, I'll start by saying XML is possibly one of the worst acronyms ever. It stands for stands for "eXtensible Markup Language" and not only does this acronym illegally steal the ever-cool letter 'X' for its own deviant purposes, it turns out that XML is not even a language, it's a syntax.

A syntax is a set of rules that describe how components in a language are to be assembled. But a language is a particular set of rules (involving a syntax, a grammar and more) that allow someone to convey/understand meaning or context or ideas.

I'll try to give an example. The syntax of all western languages is:

- a document consists of a series of one or more paragraphs
- a paragraph consists of a series of one or more sentences
- a paragraph ends with a line break
- a sentence consists of a series of one or more words
- a sentence ends with a period, exclamation point or question mark
- within a sentence, words are separated by whitespace or punctuation
- a word consists of a series of one or more letters or digits
- a letter is one of the following values: {a, b, c, ...., x, y, z}
- a digit is one of the following values: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

I've probably made some gross generalizations by including "all Western languages" here, so if I have, please forgive me, but many languages follow the above syntax (including English, French, German, Latin, etc). However, the syntax itself is not enough to convey meaning or ideas because the meaning comes from lower-level rules such as the individual meaning attached to each word, the different types of words (verbs, nouns, adjectives, etc) and the grammar (i.e. order of how words of different types are arranged and the effect of the meaning).

So to start with, XML is not a language. If you still don't believe me, go read this which calls XML a "meta-language" or read it from the horse's mouth which calls XML a "text format". They should have called it "EMS" for "Extensible Markup Syntax" but I suppose that wasn't cool enough and I suppose there are enough EMS acronyms out there already (ElectroMagnetic Spectrum, Expanded Memory System).

Ok, but so what, right? Aren't I just playing with semantics here because I want to nitpick on a technology I "just don't get"?

Yeah, a little. But not exactly. Now I'll spend a little time getting into what XML is and can do. In future entries, I'll try to get into what XML has done and what it will do and try to pimp my own text format just for okshiggles.

XML is a syntax that allows us to define new languages. That's it. It does this by laying out several very well-defined rules that all the new languages must follow. I personally think this is a great idea because standardization will usually improve interoperability and help the industry tackle more complex problems.

Now, I'm not an expert in the definitions within the XML spec so I'll just write about them as I call them:

- an XML document consist of an XML Declaration followed by an XML Document Root
- an XML Declaration describes the XML version (and optionally the character encoding)
- an XML Document Root is an XML Element that sits at the top-level of the document
- an XML Element consists of an opening Tag, the Element contents and a closing Tag
- an Opening Tag consists of a '<', the Element name followed by zero or more Attributes, followed by a '>'
- an Attribute consists of the Attribute Name, followed by '=', followed by double-quotes ("), followed by the Attribute Value, followed by double-quotes
- Element names and Attributes are separated by whitespace (spaces, carriage returns, etc)
- the Element Contents consists of zero or more XML Elements and zero or one chunk of regular text
- a Closing Tag consists of a '<', a '/', the Element name which it is closing, followed by a '>'

Whew, that's a lot of rules, but an example will help demonstrate the above rules:

< ?xml version="1.0" encoding="ISO-8859-1"?>

<employees>
  <employee id="1" type="fulltime" >
    <name>Jeff Schiller</name>
    <salary>$5000000.00</salary>
  </employee>
  <employee id="2" type="parttime" >
    <name>Rob Russell</name>
    <salary>$2.00</salary>
  </employee>
</employees>

I've left out some optional things you can put into a document between the Declaration and the Root, but you get the general idea.

I guess my main beef is that, to me, the above text format is both ugly to the human eyes and inefficient in terms of storage (i.e. wastes space). There are additional rules that I didn't put in because it would make things even muddier but you cannot use certain characters within your chunks of text (like the angle-brackets) either.

XML zealots will proclaim that XML does not need to be pretty to the human eyes, only easily readable by machines and that tools can be built to make XML editing nice and clean to the human eyes. There is also no requirement of XML to be efficient in terms of its utilization of bytes/characters.

I say why not have the best of both worlds? Before I learned a bit about XML, I was working on a couple game development projects and realized I needed to store data about the game in an easily editable format (i.e. configuration files). Without first looking into XML, I thought I would come up with my own configuration file format which I later could expand into a scripting language. I used the project to learn about language grammars (i.e. BNF) and used the boost::spirit (a C++ parsing builder that I highly recommend) to come up with ConfigParser (a C++ module that can be used to read in configuration files in my format). It was a great learning experience.

The rules of my config file syntax are:

- A config file contains zero or more elements
- An element is either a simple variable declaration or a composite object declaration
- A composite object conists of the variable name, an opening brace, zero or more elements and a closing brace
- A simple variable is either an integer, a floating point number, a boolean value or a string
- A simple variable declaration consists of the variable name, an equals sign, the variable value and a semicolon
- A string value is zero or more characters be surrounded by double-quotes
- An integer value consists of an optional negative sign and then one or more digits
- A boolean value is one of {true, false}
- A floating point value consists of an optional negative sign, zero or more digits, a decimal point, and then one or more digits

To compare XML to my format I'll post the same content in that format:

/* This is a file that will be used to store personnel
    records for my company */
Employees {
  Employee {
    id = "1";
    type = "fulltime";
    name = "Jeff Schiller";
    salary = 5000000.00;
  }
  Employee {
    id = "2";
    type = "parttime";
    name = "Rob Russell";
    salary = 2.00; // C++-style comments allowed too
  }
}

The whitespace is all optional above, in fact it could all be crammed onto one line if you really wanted to and you happen to be insane.

Now, being partial to C/C++/Java, and having constructed the grammar, parser and read/writer myself obviously I'm partial to my format but I will state that the only real difference between the two is that XML has a separation of attributes that describe the data (i.e. meta-data about the element's contents) and the element's data itself (i.e. the nested elements or text of the element). However, often-times the difference between attributes and element data can be confusing.

On the other hand, what I like about my format is that there is a distinction between types: integers, floating point numbers, strings, objects. Everything in XML is an element which consists of zero or more nested elements and optionally a chunk of text. This makes parsing rather trivial, yet to get anything meaningful you will always need to have a higher-level parser to take the contents and format them into something that your software understands. (i.e. take the text "$5000000.00" and convert it into a floating point number 5000000.00).

The one thing that I think both XML and my format lack is an efficient means of representing simple arrays of data (like integers). My proposal as an extension to my format is currently:

MyLuckyNumbers[] = { 3, 5, 8, 15, 19, 28 };

which mirrors C/C++/Java format and rules could be built out for arrays of arrays or multi-dimensional arrays. Such rules can be built into the parser to return the delimited data in an array format that is directly usable by the client software. Eventually I plan to update my ConfigParser to support this.

However, for something similar XML would need to do one of the following:

a)

Which requires additional parsing of the text into a series of numbers.

OR

b)

Which is lengthy and still requires additional parsing of each element (i.e. the text "19" into an integer 19).

I'd be interested to hear comments from XML fans (even flames) at this point. I'm sure I'll get the "Why bother making your own configuration file format and parser when XML already exists?" but to this I will only pose the question: "Do you use lightbulbs or candles"?

§21 · January 21, 2005 · Software, Technology, Web, XML · · [Print]

codedread

The Road To XML is Paved with Good Intentions

Leave a Comment to “The Road To XML is Paved with Good Intentions”

Subscribe

Blogroll

Pages

Archives