Sunday, September 13, 2009

bXML

Lately, I've been feeling undecided. There are so many great things I want to write about, but I fear that there are fewer and fewer people who want to read them. I'm split between not caring and just doing my own thing, or trying harder to pick more accessible topics to gain acceptance.

We're drifting farther and farther away from wanting to know -- really know -- about stuff, and getting more into just having quick but meaningless details at our finger tips. It seems to be a dangerous side-effect of the Information Age. Still most people will eventually come to realize that even if you have the facts, without any context, you can't put them into perspective. They are useless on their own.


bXML

For this post, I figured I'd just dump out an old idea that I had a long time ago for some simple development. The need, requirements and drivers for the project are all long gone, ancient history now, leaving only this little orphaned idea.

It's really just a simple spin on XML, one that compacts the format and allows it to contain binary data. It can be indexed as well, using an extra small amount of space to allow for faster internal access for large files. Split into two related files, it could even anchor some novel type of database.

The acronym I like is bXML, but the full name is probably better described as "indexed binary XML". Since the need and usefulness of the index mechanism is crucial to this being more than just XML.

My inspiration comes from several of the better bits from the PDF format, which are quite intelligent and elegant, all things considered. PDFs marched out of nowhere to become a major important technology in a huge number of fields including both the Internet and the commercial printing industry. Internally they are complex, yet they do not require the demented rocket science understanding necessary for really bad spaghetti-data formats such as Microsoft Word files. They are easy to learn, and easy to generate.


ON LARGE

I do like XML, it has that great quality were it is mostly simple, yet it contains a significant amount of power in its expressiveness. Most of the best technologies have that property.

Still, there is a necessary redundancy to it, driven by its roots from the text-based format SGML. The all-text aspect is fine, but sometimes we want to contain a more condensed larger amount of data, and even a 20% overhead in size is too significant a price to pay.

No matter how much faster they are getting, machines are always way too limited to compute a lot of what we know would be interesting. Moore's law may help with video games, but it can not exceed our expectation for data processing. There are just too many things to capture, and too many ways to analyze the results. We're a long way away from gaining our freedom from the hardware, if it's even possible.

As such, the first and most important point in bXML is to find the smallest possible representation that still matches the power and expressiveness of XML, but also maintain that degree of simplicity.

Some intensely powerful representation that is also very obfuscated would not be particularly useful. Simplicity, especially initially, is an all important aspect for this to work.

To compact XML data we need two things. Cheaply parsable data, and a compact format for the file structure.


PARSABLE DATA

The first thing we need is to make sure the resulting format is "reasonably" parsable. Since this format can also contain binary data, variable length elements become an interesting issue.

The programming language Pascal created variable length strings that were preceded by the length of characters in the string. The C language was the opposite, it put an "in-band signal" -- the null character -- into the string as a terminator. These two languages represent the main ways of managing variability within a computer. An explicit size or a terminator.

In-band signals are easier and often more flexible, but with binary data there is always the chance that the terminator itself can appear in the data.

A way around this could be to use a custom variable length terminator. One that is explicitly picked each time because it does not belong in the current stream of binary data.

While it is a neat idea, it's complexity does seem to be higher than just specifying the size at the beginning. So we should probably take the simple size road, for all variable data in the file. It's slightly more expensive in size, but much less in CPU, and it is simple.

For now along with variable length data, we can also have a list of well know types and their explicit size and interpretation. Things such as shorts, integers, dates and floating point numbers.

That type of list tends to grow with the specification getting longer and longer as it ages. Still, well-defined, compact types represent an easy and reliable way to tightly pack data.


STRUCTURE

The second thing we need is to somehow pack the tags into an arrangement that isn't highly replicated throughout the entire document.

If there are a lot of small elements with long names in a large XML file, the structural burden of the document can easily exceed the content one. For small data, this difference is not significant, but this is a real show-stopper for big things.

The best way to handle this is by stealing a few tricks from PDFs. I've always rather admired how they managed to create a binary format that was still mostly readable in an editor. It can be very useful in many operational circumstances.

They use a reference table, often stored at the end of the document, to list out the index (in characters) for all of the internal objects in the file. While each object is delineated by itself -- you don't need the reference table to parse -- it can be quickly accesses from the table. The whole underlying scheme has a quality to it that we can use.


THE DOCUMENT

The overall document should be pretty simple, with a header, footer, reference table and of course, the data itself.

For the header and footer we can use some syntax similar to PDFs. The PDFs syntax was really based around Postscript syntax, but for this format, the similarly is just for show.

We can start by defining the fact that the first line of an bXML file references its type and version, and the second line is just a set of four binary values (like PDF) to insure that the data-type of the document is correctly understood.

%bXML-1.0
%âãÏÓ

Would do nicely for a header.

The next thing we want, is for some of the last bytes in the file (the footer) to be an integer that lists out the starting position for a reference table. Since we can use network encoded 4 byte integers directly, we could finish the file with a consistent 11 bytes:

BBBB\n
%%EOF\n

Where BBBB is the four-byte integer character offset of the start of the reference table.

In between the header and the footer, is first the data, and then a reference table to allow the data to be interpreted correctly.


THE REFERENCE TABLE

The reference table consists of one line for each unique element encoded in the data. Element attributes would be dropped into full tags, as part of normalizing the data, so the following:
<painting>
<img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
<caption>This is Raphael's "Foligno" Madonna, painted in
<date>1511</date>-<date>1512</date>.</caption>
</painting>

Would have a list of element names:

painting
img
src
alt
caption
date

Now, since we want to be able to identify these tags in the data as we encounter then, we can implicitly associate each tag with an index into the reference table. Since the number of different tag types could be small, or it could be large, we can to use a mildly sophisticated method for encoding them.

In the data, each tag is one or more bytes. If the 8th bit is set on a byte, then the following byte is also part of the tag as well. In that case the first 7 bits allows us 128 unique tags, then the next 14 allows us 16384 tags, and so on. In this scheme there is no limit to the number of tags, but each latter tag starts to eat up more space.

The algorithm for finding the index is fairly simple. Keep reading bytes until one doesn't have an eight bit set. Pack all of the 7 bit pieces together, and then convert to the index number. Since it is always positive integers, we can do a straight binary to decimal flip.

Now of course, some of the element names are duplicated, over and over again. It is up to the creator of the file as to whether these repeated elements have separate tags or not. Usually 'not' would be better as it is more space efficient.

This leaves us with a reference table that would look something like:

crossref
painting
img
src
alt
caption
date
end-crossref

Where each entry is separated for convenience by a newline (just to make it partially readable).


MORE PACKING

So we have a reference table that relates a variable string back to a byte tag embedded in the data.

In many XML files, there are often larger hierarchies of tags that refer to the same structure of elements, over and over again. If we know the underlying element types, we can parse them out individually without having to have a separate tag for each sub-element. In complex, highly repetitive data that is a considerable savings.

To accomplish this, in the reference table, we can just add a value that says our "depth" in a structural tree.

If our dept is not listed or 0, we are essentially to be parsed as a top-level (explicit) tag. If the depth is higher, then we are an implicit tag packed in sequence behind our parent tag (which may also be implicit). In that case the two elements of data appear side-by-side without any intervening tag (and empty elements still need a zero size indicator).

Another modification we can do is to first assume that all data in the file is variable length. In that case, each element looks like:

BBBBdddddddddddddddddddddddddddd

Where BBBB is a four byte network encoded integer, followed by a variable array of data.

Now for some data types, we may know right away that they have a fixed size and are a well-known data type. As such, in the reference entry, while the default is variable binary data, other types can be specified too such as network encoded integer, short or double.

Any type really, so long as the length is fixed and there is some reference tag for that type added to the bXML definition.

All of this fiddling with extra information gives us a rather arbitrary length reference table, where each entry contains a variable length string, and possibly soem other values.

Since each table entry is already variable, we might as well accept that. A separator character like # can be used to delineate these fields. A starting and ending string, also help to delinate the boundries of the table. "crossref" and "end-crossref" are easily readable.

Also, since all of the entries and parameters in the table are explicity separated, we can use the ASCII representation for numbers, instead of something less readable like network encoded floats. The small extra size is a reasonable trade-off for easy readability.

Another thing that would be nice, would be a bunch of index references for the main tags. We'd like to know where they are in the file. Since this is a one-to-many relationship, we can just use square braces and commas to delinatate a list of character positions for the tags.

So a more complete reference file might look something like:

crossref
painting##empty#[0]
img#1#empty#[1]
src#2###[1]
alt#2###[16]
caption#1##[46,103,113]
date#2#year#[98,108]
end-crossref

Given that, we can then encoded the file as:

{1}BBBBmadonna.jpgBBBBFoligno Madonna, by RaphaelBBBBThis is Raphael's "Foligno" Madonna, painted in Y1511BBBB-Y1512BBBB.

Where I've used {N} to represent the byte(s) needed to encode the tag index, Y1511 to represent a "year" data-type and BBBB to represent a 4 byte size.

Notice that we don't need the quotes, and that the caption is broken up into three parts.

All and all it is a pretty compact representation. Of course, the reference table overhead is fairly large for small data, so in a simple example with one entry XML is a far better representation. Still, it makes for a nearly readable file.

With the exception perhaps of pure binary data (such as an embedded JPEG), this format is entirely translatable into standard XML, and any standard XML is translatable into this format.


CONTAINING THINGS

Still, the really clever reader will have no doubt noticed a fatal flaw in the currently described scheme.

We can delineate variable elements with a fixed size that tells us the end of the data. We can specify tags, but the current mechanism doesn't really support any type of reasonable containment.

Composite tags do have structure (because of the depth indication), but even there it does not help with the positioning and embedding of the data.

This layout how no real way of embedding an arbitrary depth of explicit tags within tags. That type of nesting isn't included, but it is crucial to making XML useful so we will have to add it.

Mostly to define some container scheme (particularly if it is recursive), we generally need two explict tokens in the data, a start one and an end one. In this case, although it is tempting to use the element tag itself as the start token that assumes that most things will be containers.

For this scheme, since we are using it to deal with massive amounts of data, it is best to assume that most tags are not actually composite. That is, that most of them contain a one-to-one relationship with some discrete piece of data. If the structure outweighs the data, then pure XML is a better choice for representation.

Now, to keep this simple, we'd really only like to introduce the most minimal syntax to accommodate the container start and ends.

One easy and inexpensive way is to directly embedded them into the existing tag mechanics. We simply just create a one byte tag for each of them, drawing very little from the reference table. Thus for the original example we could have a table like:

crossref
__BEGIN__###[]
__END__###[]
painting##empty#[]
img##empty#[]
src####[]
alt####[]
caption###[]
date##date#[]
end-crossref

Where we've picked up the convention of wrapping system-tokens in preceding and proceeding double underscores. Then BEGIN and END just delineate the the boundaries for any tag container.

This can encode our earlier example as:

{3}{1}{4}{1}{5}BBBBmadonna.jpg{6}BBBBFoligno Madonna, by Raphael{2}{7}{1}BBBBThis is Raphael's "Foligno" Madonna, painted in {8}Y1511BBBB-{8}Y1512BBBB.{2}{2}

And of course the first and second tag are only a byte each.

Now of course BEGIN and END can float to other locations in the table. Any interpreter much just bind back the syntax __BEGIN__ to the operation of pushing a new context on the stack, and __END__ of popping it. Un-tagged data is then just matched to the current context. Tags, values, data, etc are all converted back to XML based exactly on their explicit positions in the data.


SUMMARY

With the base information and the reference table we can encoded all of the information from an XML file into several different forms of tighter representation, but still allow it some minimal aspects of readability. PDFs too can be quickly examined, even though chunks of them can be quite unreadable to the human eye.

There are a whole lot of other games we could play to further bring down the data size, but the most important quality to this format is to still make it mostly readable.

Convenience in a operational environment, for quickly viewing files and diagnosing problems should not be underestimated in value. Our systems are so flaky that anything we can do to make it even the tinniest bit easier to find problems is going to help insure that these technolgoies don't end up in the bit bucket with the rest of the might-have beens.

When I was first considering this representation, my ultimate goal was to essentially replace "table" in a relational database with an "XML tree". To create a new type of database.

SQL crafts set-oriented arrangements that return tables while operating on tables. We could keep the same "set" mechanics, but return a tree, while operating on forests (sets of trees).

We could also consider an XML tree to be a table with an infinite number of columns and some inter-column relationship.

I'm not sure how these ideas would work in practice, but by keeping the data and index separate in "forest" files, and then using bXML as the basis for storage mechanism, the actually coding of this esoteric type of database should be fairly well understood.

Another point is that it should be fairly trivial to create some bXML -> XML and vice versa tools. In that way, if this format is being used to move around massive data then it won't require new special parts to get it embedded into the infrastructure. It's almost the equivalent of just zipping up the file, accept that it still mains some readability while in flow.