So I want a structured text format usable for configuration files and data interchange. My key requirements can be boiled down to:
- Syntax allows concise expression (this alone rules out XML)
- Simple to parse (this also rules out XML)
- Suitable for human “consumption” (reading, editing). To some degree, this rules out XML.
As you can see, XML is definitely out. But oddly enough, I’m struggling to find anything I’m really happy with.
Oh yeah, I know, I know, JSON, right? I have these main problems with JSON:
1. Excessive quotation marks required
So for a simple set of key-values I need something like:
"key1" : "overquoted",
I mean this isn’t crazy bad, but why are the quotes even necessary? I mean wouldn’t it be nice if I could instead write:
key1 : overquoted,
Given that, at the point key1 and key2 appear, an alphabetical character may not otherwise legitimately be present, what would be the harm in allowing this? (sure, if you want spaces or punctuation in your key, then you should need to quote it, but not otherwise). Similarly for values. It would be nice if those unnecessary quotes weren’t actually required.
2. No comments
This one really irks me. For a configuration file, comments are pretty much mandatory. Douglas Crockford gives an explanation for why there are no comments in JSON (why he removed comments from the spec, in fact), and it sucks. Basically: people weren’t using comments the way I wanted them to, so I removed comments. Yeah, I think we just hit ludicrous speed. There are so many things wrong with this argument I barely know where to begin. At the very outset, anyone using comments as parsing directives was going to need a custom parser anyway – what they were dealing with wasn’t plain JSON. So actually changing JSON does not affect those people; they will continue to use their custom parsers. In fact all you do by removing comments is make the standard less useful generally. The follow up:
Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.
… is equally ridiculous. I mean sure I could strip comments before handing off the the parser, but then my original data isn’t actually JSON, is it? And so interoperability is destroyed anyway, because I can no longer use any standard-JSON tools on my configuration files.
3. Unclear semantics
The current RFC for JSON has, in my opinion, a lot of guff about implementations that just doesn’t belong in a specification. Problematically, this discussion could be seen to legitimise limitations of implementations. Take for example:
An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings. When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.
What does that mean, exactly? That it’s allowed that objects with non-unique names cause software to behave “unpredictably”? This is meant to be a specification imposing requirements on implementations, but it’s hard to glean from this text precisely what the requirements are, in particular because (just above that):
The names within an object SHOULD be unique.
That’s SHOULD, not MUST (see RFC 2119); which implies that non-unique names are in fact permissible and must be handled by an implementation. I wonder how many JSON libraries will properly represent such an object… not many, I’d guess.
So How About YAML?
YAML does solve the problems that I identified with JSON above. It doesn’t require superfluous quotation marks, it’s clear about map semantics, and it allows comments. On the other hand, its specification is quite large. Part of the complexity comes from the concept of tags which are a way of identifying types. While the YAML core specification (“failsafe schema”) deals only with maps, sequences and strings, it allows for explicitly tagging any value as a particular type identified by a “tag”. The schema notionally defines how tags are mapped to actual types and allows for specifying rules for determining the type of otherwise untagged ‘plain scalar’ (roughly: unquoted string) values. So for instance the JSON schema – which makes YAML a superset of JSON – maps sequences of digits to an integer type rather than the string type. The fact that different schemas yield different semantics, and that arbitrary types (which a given implementation may not know how to handle) can be assigned to values, in my opinion reduces YAML’s value as an interchange format.
(For instance, a tool which merges two YAML maps needs to know whether 123 and “123” are the same or not. If using the failsafe schema, they are strings and are the same; if using the JSON schema, one is a number and they are not the same).
In fact, the whole notion of schemas leads to the question of whether it is really up to the text format to decide what type plain nodes really are. Certainly, maps and sequences have a distinct type and are usually unambiguous – even YAML doesn’t allow a schema to re-define those – and are enough to represent any data structure (in fact, just sequences would be enough for this). I also think it’s worth while having a standard quoting mechanism for strings, and this is necessary to be able to disambiguate scalar values from structures in some cases. But beyond that, to me it seems best just to let the application determine how to interpret each scalar string (and it can potentially use regular expressions for this, as YAML schemas do), but that for purposes of document structure scalars are always just strings. This is essentially what the YAML failsafe scheme does (and it even allows disambiguating quoted strings from unquoted strings, since the latter will be tagged with the ‘?’ unknown type).
It’s worth noting that YAML can handle recursive structures – sequences or maps that contain themselves as members either directly or indirectly. This isn’t useful for my needs but it could be for some applications. On the other hand, I imagine that it greatly complicates implementation of parsers, and could be used for attacks on poorly coded applications (it could be used to create unbounded recursion leading to stack overflow).
TOML is a relative newcomer on the scene of simple structured text formats. It has a fixed set of supported types rather than allowing schemas as YAML does, and generally aims to be a simpler format; on the other hand it is much closer to YAML in syntax than JSON and so is much easier to read and edit by hand.
Among the supported types are the standard map / sequence / string, but also integer, float, boolean and date-time. This seems fine, but again I’m uncertain that having more just than the basic “string” scalar type is really necessary. On the other hand having these types properly standardised is unlikely to cause any harm.
I think the one downside to TOML is the ungainly syntax for sequences of maps – it requires double-square brackets with the name of the sequence repeated for each element:
key1 = value1
key2 = value2
key1 = value1
key2 = value2
Nested maps are also a bit verbose, requiring the parent map name to be given as a prefix to the child map name:
key1 = value1 # this key and all following keys are in the child map
The top level node of a TOML structure, if I understand correctly, must always be a map, since you specify key-value pairs. This is probably not a huge concern for my purposes but is certainly a limitation of the format. Once you’ve opened a map (“table” in TOML parlance) there’s also no way to close it, it seems, other than by opening another table.
I think the occasional ugliness of the syntax, together with the immaturity of the format, are deal breakers.
And so the winner…
… Is probably YAML, at this stage, with the failsafe schema, although the potential for recursive structures makes me a little uneasy and it’d be nicer if I didn’t have to explicitly choose a schema. It’s also a shame that the spec is so wordy and complex, but the syntax itself is nice enough I think and seems like a better fit than either JSON or TOML.