Why is there no decent, simple, structured text format?

So I want a structured text format usable for configuration files and data interchange. My key requirements can be boiled down to:

  • Syntax allows concise expression (this alone rules out XML)
  • Simple to parse (this also rules out XML)
  • Suitable for human “consumption” (reading, editing). To some degree, this rules out XML.

As you can see, XML is definitely out. But oddly enough, I’m struggling to find anything I’m really happy with.

Oh yeah, I know, I know, JSON, right? I have these main problems with JSON:

1. Excessive quotation marks required

So for a simple set of key-values I need something like:

{
    "key1" : "overquoted",
    "key2" : "the perils of javascript"
}

I mean this isn’t crazy bad, but why are the quotes even necessary? I mean wouldn’t it be nice if I could instead write:

{
    key1 : overquoted,
    key2 : "the perils of javascript"
}

Given that, at the point key1 and key2 appear, an alphabetical character may not otherwise legitimately be present, what would be the harm in allowing this? (sure, if you want spaces or punctuation in your key, then you should need to quote it, but not otherwise). Similarly for values. It would be nice if those unnecessary quotes weren’t actually required.

2. No comments

This one really irks me. For a configuration file, comments are pretty much mandatory. Douglas Crockford gives an explanation for why there are no comments in JSON (why he removed comments from the spec, in fact), and it sucks. Basically: people weren’t using comments the way I wanted them to, so I removed comments. Yeah, I think we just hit ludicrous speed. There are so many things wrong with this argument I barely know where to begin. At the very outset, anyone using comments as parsing directives was going to need a custom parser anyway – what they were dealing with wasn’t plain JSON. So actually changing JSON does not affect those people; they will continue to use their custom parsers. In fact all you do by removing comments is make the standard less useful generally. The follow up:

Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

… is equally ridiculous. I mean sure I could strip comments before handing off the the parser, but then my original data isn’t actually JSON, is it? And so interoperability is destroyed anyway, because I can no longer use any standard-JSON tools on my configuration files.

3. Unclear semantics

The current RFC for JSON has, in my opinion, a lot of guff about implementations that just doesn’t belong in a specification. Problematically, this discussion could be seen to legitimise limitations of implementations. Take for example:

An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings. When the names within an object are not unique, the behavior of software that receives such an object is unpredictable.

What does that mean, exactly? That it’s allowed that objects with non-unique names cause software to behave “unpredictably”? This is meant to be a specification imposing requirements on implementations, but it’s hard to glean from this text precisely what the requirements are, in particular because (just above that):

The names within an object SHOULD be unique.

That’s SHOULD, not MUST (see RFC 2119); which implies that non-unique names are in fact permissible and must be handled by an implementation. I wonder how many JSON libraries will properly represent such an object… not many, I’d guess.

So How About YAML?

YAML does solve the problems that I identified with JSON above. It doesn’t require superfluous quotation marks, it’s clear about map semantics, and it allows comments. On the other hand, its specification is quite large. Part of the complexity comes from the concept of tags which are a way of identifying types. While the YAML core specification (“failsafe schema”) deals only with maps, sequences and strings, it allows for explicitly tagging any value as a particular type identified by a “tag”. The schema notionally defines how tags are mapped to actual types and allows for specifying rules for determining the type of otherwise untagged ‘plain scalar’ (roughly: unquoted string) values. So for instance the JSON schema – which makes YAML a superset of JSON – maps sequences of digits to an integer type rather than the string type. The fact that different schemas yield different semantics, and that arbitrary types (which a given implementation may not know how to handle) can be assigned to values, in my opinion reduces YAML’s value as an interchange format.

(For instance, a tool which merges two YAML maps needs to know whether 123 and “123” are the same or not. If using the failsafe schema, they are strings and are the same; if using the JSON schema, one is a number and they are not the same).

In fact, the whole notion of schemas leads to the question of whether it is really up to the text format to decide what type plain nodes really are. Certainly, maps and sequences have a distinct type and are usually unambiguous – even YAML doesn’t allow a schema to re-define those – and are enough to represent any data structure (in fact, just sequences would be enough for this). I also think it’s worth while having a standard quoting mechanism for strings, and this is necessary to be able to disambiguate scalar values from structures in some cases. But beyond that, to me it seems best just to let the application determine how to interpret each scalar string (and it can potentially use regular expressions for this, as YAML schemas do), but that for purposes of document structure scalars are always just strings. This is essentially what the YAML failsafe scheme does (and it even allows disambiguating quoted strings from unquoted strings, since the latter will be tagged with the ‘?’ unknown type).

It’s worth noting that YAML can handle recursive structures – sequences or maps that contain themselves as members either directly or indirectly. This isn’t useful for my needs but it could be for some applications. On the other hand, I imagine that it greatly complicates implementation of parsers, and could be used for attacks on poorly coded applications (it could be used to create unbounded recursion leading to stack overflow).

Or TOML?

TOML is a relative newcomer on the scene of simple structured text formats. It has a fixed set of supported types rather than allowing schemas as YAML does, and generally aims to be a simpler format; on the other hand it is much closer to YAML in syntax than JSON and so is much easier to read and edit by hand.

Among the supported types are the standard map / sequence / string, but also integer, float, boolean and date-time. This seems fine, but again I’m uncertain that having more just than the basic “string” scalar type is really necessary. On the other hand having these types properly standardised is unlikely to cause any harm.

I think the one downside to TOML is the ungainly syntax for sequences of maps – it requires double-square brackets with the name of the sequence repeated for each element:

[[sequencename]]
key1 = value1
key2 = value2
[[sequencename]]
key1 = value1
key2 = value2

Nested maps are also a bit verbose, requiring the parent map name to be given as a prefix to the child map name:

[parent]
[parent.child]
key1 = value1  # this key and all following keys are in the child map

The top level node of a TOML structure, if I understand correctly, must always be a map, since you specify key-value pairs. This is probably not a huge concern for my purposes but is certainly a limitation of the format. Once you’ve opened a map (“table” in TOML parlance) there’s also no way to close it, it seems, other than by opening another table.

I think the occasional ugliness of the syntax, together with the immaturity of the format, are deal breakers.

And so the winner…

… Is probably YAML, at this stage, with the failsafe schema, although the potential for recursive structures makes me a little uneasy and it’d be nicer if I didn’t have to explicitly choose a schema. It’s also a shame that the spec is so wordy and complex, but the syntax itself is nice enough I think and seems like a better fit than either JSON or TOML.

Advertisements

4 thoughts on “Why is there no decent, simple, structured text format?

    1. I hadn’t seen that before. I like the simplicity, but it suffers one of the same problems as TOML: nested objects are not convenient, you have to specify the full “path” of the object {a.b.c}. Apart from the extra noise in the text, this redundancy also makes editing more painful: if you want to change the name of an object, you have to change that name in all its sub-objects as well. Also, that “re-declaring” an object actually opens it so further elements can be inserted seems like it could cause problems. I’d rather get an error, or at least be able to detect the condition in the internal representation after parse. On the other hand, it looks like it’s pretty good for what it was designed for (representing structured text).

  1. I faced the same problem a while ago and came up with RON: https://github.com/kvark/ron
    It doesn’t have a working implementation yet, but hey! you didn’t mention it in the requirements anyway 🙂
    An example:

    Scene( // class name is optional
        materials: { // this is a map
            "metal": (
                reflectivity: 1.0,
            ),
            "plastic": (
                reflectivity: 0.5,
            ),
        },
        entities: [ // this is an array
            (
                name: "hero",
                material: "metal",
            ),
            (
                name: "monster",
                material: "plastic",
            ),
        ],
    )
    
    1. Ok, interesting. It took me a few seconds what you’re doing there – most of the structured formats represent an object as a hash, whereas you’re choosing to distinguish between the two. I can see this might have some benefits, even though it makes any in-memory representation slightly more complex (since there is another type, now); in particular the ability to tag objects with their type, which would otherwise have to be a record within the object, and perhaps better error detection/diagnosis when the input structure isn’t correct.

      I think this is pretty good, actually. About the only thing I don’t like is the apparent mandatory quoting for strings; I think perhaps simple string values shouldn’t need that. But that’s not necessarily a deal breaker.

      On the other hand, no implementation is a problem 🙂 Also, I have resolved the requirement that prompted me to write this blog post by implementing a very straightforward custom format which is easy to parse and which met my needs quite well, with the only downside being that it’s a unique format and probably won’t be applicable to other projects, and there are no general-purpose tools available for manipulating it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s