I'm not exactly sure about how many of the various pieces of software I use frequently that excessively nag me about upgrading to their latest version. It's a lot. Way too many.
I guess many programmers find it really neat that they can call home in their code, check the version number and then automatically inform me that my version is no longer up-to-date. It is a neat trick, the only thing is, I find it exceptionally irritating.
First of all, for most releases, on most of the software out there, it just doesn't provide enough value to offset the risk of upgrading. Risk? Yes, for a larger part of our industry, new versions are just as likely to contain a really bad new bug as they are to work properly.
Upgrading has always been a risky proposition. Even major vendors of seemingly ancient (and stable) products, let clunkers get out the door far too often.
I've been caught enough times, spending hours upgrading, only to find out the current version is so broken that I have no choice but to downgrade to get it back to working again.
Over the years I've just stopped wanting to be up-to-date. They've beaten it out of me. I'd prefer to let someone else waste their life first.
Of course, it is completely different, if the new version fixes something critical (that broke in the last release), or provides an all important feature that I just can't live without. But, over the last few years, almost none of any of my upgrades have held any such promise or potential.
Most are relatively minor enhancements or cosmetic releases. They may make the programmers feel good about getting their code out there, but they offer little real improvement. And even when they do, perhaps a major leap or some big new feature, they are mostly followed with disappointment, not satisfaction. I've come to dread new releases, especially those that involve large, open projects. They often have the worst track records.
The other big thing I find with new releases is that while most programmers can write an algorithm to call home, they utterly fail in writing something that can properly update from one version to another. It's a much more complex problem that is frequently ignored.
Most often, the result is that all of the those neat little customizations that we spend hours and hours tinkering with, get lost. Get flushed away. All of the effort to change and adapt the software, fit it in properly, is wasted.
Because of that, it is no surprise that I rarely customize my software anymore. Pretty much everything is straight out of the box. The default set-up.
In the olden days, when we were actually allowed to use the same version for a while, we'd spend huge efforts getting it set up just right. That amount of tailoring meant that our environments were more adapted to our efforts, that the functions we used frequently were more easily accessible. We sharpened our tools, and made sure they were ready.
Why, I often wonder, do they go through all of the hassle to provide us with a billion little customization options if for the most part changing any of them makes their code unstable, and the change isn't even going to be preserved from version to version? Why waste all of that time providing flexibility that we can't use? Why waste my time, in having to guess which options really do work properly?
I've been burned so often that I'm always shocked when some software product actually manages to get it right. It is so rare.
Nothing is more insulting than being tempted with neat options that don't actually work. It's like spending too long standing in front of a bakery, with no money in your pocket. You can look at all of the delicious treats, but you're powerless to do something about them.
The core of what we do as software developers is to make tools for people. Tools that aren't trustworthy, or waste time, or are just hard to use are not good tools.
It doesn't matter what's inside the code, how fancy the architecture, or how clean the algorithm, or even if the code is elegant. If the outside is dysfunctional, the system is dysfunctional. Users shouldn't have to put up with quirky behavior, or jump through hoops just to do simple things. Ultimately they have a problem which they need addressed. The solution either addresses that, or it falls short.
BATCH ME
It reminds me of a time, long ago, when one of my fellow programmers (and a close friend) was exceptionally proud of the simplicity of one of his algorithms. It was a clean and simple way to take user input, and spend some time in the evenings doing the processing. It could allow a data entry clerk to spend five minutes, queuing up an hour or so in updating work. A huge time saver.
In those days, CPU was always tight, and there was a fixed window each evening for batch processing. The work got done, or it started to interfere with the daytime computing needs.
After many months of my friend's code being in production, proving itself in a real environment, the company hired a new data entry clerk. It seems as if the old one had fallen a bit behind on her duties, so the new replacement -- looking to prove herself -- stayed late one night, entering changes. She stayed really late. Really late.
Of course, the nice, new, clean version of the code featured a 12:1 multiplier in terms of converting effort into batch processing. Not a problem for the old clerk, but a major disaster for the new one. After spending hours and hours catching up on all of the old updates, the new clerk fatally doomed the system for the next day.
My friend hit the roof. He blamed the incompetence of the clerk, and of management (for not stopping her) for "ruining his clean code". He felt that the algorithm was fine, it was a failure in operations that caused the problem. Ultimately, to stop the batch processing we had to erase all of the pending updates. They got lost, became wasted effort.
Once I calmed him down, after a bit of pushing, I finally got him to realize the point that his "nice, clean code", as nice and clean as it was, was just plain wrong! The problem, after all, wasn't solved with nice, clean code. The problem was solved with something that could handle batching updates over a longer period. The rein of the old data clerk was the anomaly and he owned the new one a really big apology. Whatever his code was, it wasn't solving the problem he was claiming that it solved.
Code -- any of it -- from something simple that batches up work, or something that helps to inform us that there are new potential updates that we may (or may not) be interested in, is only really good and useful if it does good and useful things for the user. If it doesn't meet the needs of the users, then it is bad code.
SELF CENTERED
Too often, we see programmers optimizing their code, or their algorithms, or even their interfaces to be what is best and easiest for themselves. Too often we see them subverting the covenant between coder and user in order to "do something" they think the user will want, but is in fact totally wrong. Or annoying, or silly, or just plain old horrible.
Programming, for the sake of programming is a hobby. One that people can (and should) keep to themselves. If you are writing code to make other peoples lives easier, then if the right thing is hard, messy and ugly, it doesn't make it any less right.
As for updates, sure if there is some huge, critical, emergency change, then maybe that justifies bugging me. Maybe.
If I want to go find another version, then that should be my choice. I should initiate it. If I never change versions, then the coders should be happy for me (I've found contentment).
Advice like "release early, release often" is too programmer-centric and not in the best interests of the users. It may make you feel good about getting things out there, but it's bound to cause a lot of anger and apathy in the users.
Computers are tools we use to do things, not mediums through which the programmers should nag us about keeping up with their minor upgrades. That flips the relationship on it's head, and it does so in a very bad way.
So, if I'm in the middle of a task, and I see a stupid dialog pop up telling me I'm a few versions behind, I pretty much know that the programmers who built it were not really thinking about my well-being. I see it as a yet another bug in the system, one that helps me clearly define the real quality of the software.
Software is a static list of instructions, which we are constantly changing.
Wednesday, September 30, 2009
Sunday, September 13, 2009
bXML
Lately, I've been feeling undecided. There are so many great things I want to write about, but I fear that there are fewer and fewer people who want to read them. I'm split between not caring and just doing my own thing, or trying harder to pick more accessible topics to gain acceptance.
We're drifting farther and farther away from wanting to know -- really know -- about stuff, and getting more into just having quick but meaningless details at our finger tips. It seems to be a dangerous side-effect of the Information Age. Still most people will eventually come to realize that even if you have the facts, without any context, you can't put them into perspective. They are useless on their own.
bXML
For this post, I figured I'd just dump out an old idea that I had a long time ago for some simple development. The need, requirements and drivers for the project are all long gone, ancient history now, leaving only this little orphaned idea.
We're drifting farther and farther away from wanting to know -- really know -- about stuff, and getting more into just having quick but meaningless details at our finger tips. It seems to be a dangerous side-effect of the Information Age. Still most people will eventually come to realize that even if you have the facts, without any context, you can't put them into perspective. They are useless on their own.
bXML
For this post, I figured I'd just dump out an old idea that I had a long time ago for some simple development. The need, requirements and drivers for the project are all long gone, ancient history now, leaving only this little orphaned idea.
It's really just a simple spin on XML, one that compacts the format and allows it to contain binary data. It can be indexed as well, using an extra small amount of space to allow for faster internal access for large files. Split into two related files, it could even anchor some novel type of database.
The acronym I like is bXML, but the full name is probably better described as "indexed binary XML". Since the need and usefulness of the index mechanism is crucial to this being more than just XML.
My inspiration comes from several of the better bits from the PDF format, which are quite intelligent and elegant, all things considered. PDFs marched out of nowhere to become a major important technology in a huge number of fields including both the Internet and the commercial printing industry. Internally they are complex, yet they do not require the demented rocket science understanding necessary for really bad spaghetti-data formats such as Microsoft Word files. They are easy to learn, and easy to generate.
ON LARGE
I do like XML, it has that great quality were it is mostly simple, yet it contains a significant amount of power in its expressiveness. Most of the best technologies have that property.
Still, there is a necessary redundancy to it, driven by its roots from the text-based format SGML. The all-text aspect is fine, but sometimes we want to contain a more condensed larger amount of data, and even a 20% overhead in size is too significant a price to pay.
No matter how much faster they are getting, machines are always way too limited to compute a lot of what we know would be interesting. Moore's law may help with video games, but it can not exceed our expectation for data processing. There are just too many things to capture, and too many ways to analyze the results. We're a long way away from gaining our freedom from the hardware, if it's even possible.
As such, the first and most important point in bXML is to find the smallest possible representation that still matches the power and expressiveness of XML, but also maintain that degree of simplicity.
Some intensely powerful representation that is also very obfuscated would not be particularly useful. Simplicity, especially initially, is an all important aspect for this to work.
To compact XML data we need two things. Cheaply parsable data, and a compact format for the file structure.
PARSABLE DATA
The first thing we need is to make sure the resulting format is "reasonably" parsable. Since this format can also contain binary data, variable length elements become an interesting issue.
The programming language Pascal created variable length strings that were preceded by the length of characters in the string. The C language was the opposite, it put an "in-band signal" -- the null character -- into the string as a terminator. These two languages represent the main ways of managing variability within a computer. An explicit size or a terminator.
In-band signals are easier and often more flexible, but with binary data there is always the chance that the terminator itself can appear in the data.
A way around this could be to use a custom variable length terminator. One that is explicitly picked each time because it does not belong in the current stream of binary data.
While it is a neat idea, it's complexity does seem to be higher than just specifying the size at the beginning. So we should probably take the simple size road, for all variable data in the file. It's slightly more expensive in size, but much less in CPU, and it is simple.
For now along with variable length data, we can also have a list of well know types and their explicit size and interpretation. Things such as shorts, integers, dates and floating point numbers.
The acronym I like is bXML, but the full name is probably better described as "indexed binary XML". Since the need and usefulness of the index mechanism is crucial to this being more than just XML.
My inspiration comes from several of the better bits from the PDF format, which are quite intelligent and elegant, all things considered. PDFs marched out of nowhere to become a major important technology in a huge number of fields including both the Internet and the commercial printing industry. Internally they are complex, yet they do not require the demented rocket science understanding necessary for really bad spaghetti-data formats such as Microsoft Word files. They are easy to learn, and easy to generate.
ON LARGE
I do like XML, it has that great quality were it is mostly simple, yet it contains a significant amount of power in its expressiveness. Most of the best technologies have that property.
Still, there is a necessary redundancy to it, driven by its roots from the text-based format SGML. The all-text aspect is fine, but sometimes we want to contain a more condensed larger amount of data, and even a 20% overhead in size is too significant a price to pay.
No matter how much faster they are getting, machines are always way too limited to compute a lot of what we know would be interesting. Moore's law may help with video games, but it can not exceed our expectation for data processing. There are just too many things to capture, and too many ways to analyze the results. We're a long way away from gaining our freedom from the hardware, if it's even possible.
As such, the first and most important point in bXML is to find the smallest possible representation that still matches the power and expressiveness of XML, but also maintain that degree of simplicity.
Some intensely powerful representation that is also very obfuscated would not be particularly useful. Simplicity, especially initially, is an all important aspect for this to work.
To compact XML data we need two things. Cheaply parsable data, and a compact format for the file structure.
PARSABLE DATA
The first thing we need is to make sure the resulting format is "reasonably" parsable. Since this format can also contain binary data, variable length elements become an interesting issue.
The programming language Pascal created variable length strings that were preceded by the length of characters in the string. The C language was the opposite, it put an "in-band signal" -- the null character -- into the string as a terminator. These two languages represent the main ways of managing variability within a computer. An explicit size or a terminator.
In-band signals are easier and often more flexible, but with binary data there is always the chance that the terminator itself can appear in the data.
A way around this could be to use a custom variable length terminator. One that is explicitly picked each time because it does not belong in the current stream of binary data.
While it is a neat idea, it's complexity does seem to be higher than just specifying the size at the beginning. So we should probably take the simple size road, for all variable data in the file. It's slightly more expensive in size, but much less in CPU, and it is simple.
For now along with variable length data, we can also have a list of well know types and their explicit size and interpretation. Things such as shorts, integers, dates and floating point numbers.
That type of list tends to grow with the specification getting longer and longer as it ages. Still, well-defined, compact types represent an easy and reliable way to tightly pack data.
STRUCTURE
The second thing we need is to somehow pack the tags into an arrangement that isn't highly replicated throughout the entire document.
If there are a lot of small elements with long names in a large XML file, the structural burden of the document can easily exceed the content one. For small data, this difference is not significant, but this is a real show-stopper for big things.
The best way to handle this is by stealing a few tricks from PDFs. I've always rather admired how they managed to create a binary format that was still mostly readable in an editor. It can be very useful in many operational circumstances.
They use a reference table, often stored at the end of the document, to list out the index (in characters) for all of the internal objects in the file. While each object is delineated by itself -- you don't need the reference table to parse -- it can be quickly accesses from the table. The whole underlying scheme has a quality to it that we can use.
THE DOCUMENT
The overall document should be pretty simple, with a header, footer, reference table and of course, the data itself.
For the header and footer we can use some syntax similar to PDFs. The PDFs syntax was really based around Postscript syntax, but for this format, the similarly is just for show.
We can start by defining the fact that the first line of an bXML file references its type and version, and the second line is just a set of four binary values (like PDF) to insure that the data-type of the document is correctly understood.
%bXML-1.0
%âãÏÓ
Would do nicely for a header.
The next thing we want, is for some of the last bytes in the file (the footer) to be an integer that lists out the starting position for a reference table. Since we can use network encoded 4 byte integers directly, we could finish the file with a consistent 11 bytes:
BBBB\n
%%EOF\n
Where BBBB is the four-byte integer character offset of the start of the reference table.
In between the header and the footer, is first the data, and then a reference table to allow the data to be interpreted correctly.
THE REFERENCE TABLE
The reference table consists of one line for each unique element encoded in the data. Element attributes would be dropped into full tags, as part of normalizing the data, so the following:
<painting>
<img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
<caption>This is Raphael's "Foligno" Madonna, painted in
<date>1511</date>-<date>1512</date>.</caption>
</painting>
Would have a list of element names:
painting
img
src
alt
caption
date
Now, since we want to be able to identify these tags in the data as we encounter then, we can implicitly associate each tag with an index into the reference table. Since the number of different tag types could be small, or it could be large, we can to use a mildly sophisticated method for encoding them.
In the data, each tag is one or more bytes. If the 8th bit is set on a byte, then the following byte is also part of the tag as well. In that case, the first 7 bits allows us 128 unique tags, then the next 14 allows us 16384 tags, and so on. In this scheme, there is no limit to the number of tags, but each latter tag starts to eat up more space.
The algorithm for finding the index is fairly simple. Keep reading bytes until one doesn't have an eigth bit set. Pack all of the 7-bit pieces together, and then convert to the index number. Since it is always positive integers, we can do a straight binary to decimal flip.
Now, of course, some of the element names are duplicated, over and over again. It is up to the creator of the file as to whether these repeated elements have separate tags or not. Usually 'not' would be better as it is more space-efficient.
This leaves us with a reference table that would look something like:
crossref
painting
img
src
alt
caption
date
end-crossref
Where each entry is separated for convenience by a newline (just to make it partially readable).
STRUCTURE
The second thing we need is to somehow pack the tags into an arrangement that isn't highly replicated throughout the entire document.
If there are a lot of small elements with long names in a large XML file, the structural burden of the document can easily exceed the content one. For small data, this difference is not significant, but this is a real show-stopper for big things.
The best way to handle this is by stealing a few tricks from PDFs. I've always rather admired how they managed to create a binary format that was still mostly readable in an editor. It can be very useful in many operational circumstances.
They use a reference table, often stored at the end of the document, to list out the index (in characters) for all of the internal objects in the file. While each object is delineated by itself -- you don't need the reference table to parse -- it can be quickly accesses from the table. The whole underlying scheme has a quality to it that we can use.
THE DOCUMENT
The overall document should be pretty simple, with a header, footer, reference table and of course, the data itself.
For the header and footer we can use some syntax similar to PDFs. The PDFs syntax was really based around Postscript syntax, but for this format, the similarly is just for show.
We can start by defining the fact that the first line of an bXML file references its type and version, and the second line is just a set of four binary values (like PDF) to insure that the data-type of the document is correctly understood.
%bXML-1.0
%âãÏÓ
Would do nicely for a header.
The next thing we want, is for some of the last bytes in the file (the footer) to be an integer that lists out the starting position for a reference table. Since we can use network encoded 4 byte integers directly, we could finish the file with a consistent 11 bytes:
BBBB\n
%%EOF\n
Where BBBB is the four-byte integer character offset of the start of the reference table.
In between the header and the footer, is first the data, and then a reference table to allow the data to be interpreted correctly.
THE REFERENCE TABLE
The reference table consists of one line for each unique element encoded in the data. Element attributes would be dropped into full tags, as part of normalizing the data, so the following:
<painting>
<img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
<caption>This is Raphael's "Foligno" Madonna, painted in
<date>1511</date>-<date>1512</date>.</caption>
</painting>
Would have a list of element names:
painting
img
src
alt
caption
date
Now, since we want to be able to identify these tags in the data as we encounter then, we can implicitly associate each tag with an index into the reference table. Since the number of different tag types could be small, or it could be large, we can to use a mildly sophisticated method for encoding them.
In the data, each tag is one or more bytes. If the 8th bit is set on a byte, then the following byte is also part of the tag as well. In that case, the first 7 bits allows us 128 unique tags, then the next 14 allows us 16384 tags, and so on. In this scheme, there is no limit to the number of tags, but each latter tag starts to eat up more space.
The algorithm for finding the index is fairly simple. Keep reading bytes until one doesn't have an eigth bit set. Pack all of the 7-bit pieces together, and then convert to the index number. Since it is always positive integers, we can do a straight binary to decimal flip.
Now, of course, some of the element names are duplicated, over and over again. It is up to the creator of the file as to whether these repeated elements have separate tags or not. Usually 'not' would be better as it is more space-efficient.
This leaves us with a reference table that would look something like:
crossref
painting
img
src
alt
caption
date
end-crossref
Where each entry is separated for convenience by a newline (just to make it partially readable).
MORE PACKING
So we have a reference table that relates a variable string back to a byte tag embedded in the data.
In many XML files, there are often larger hierarchies of tags that refer to the same structure of elements, over and over again. If we know the underlying element types, we can parse them out individually without having to have a separate tag for each sub-element. In complex, highly repetitive data is a considerable saving.
To accomplish this, in the reference table, we can just add a value that says our "depth" in a structural tree.
If our depth is not listed or 0, we are essentially able to be parsed as a top-level (explicit) tag. If the depth is higher, then we are an implicit tag packed in sequence behind our parent tag (which may also be implicit). In that case, the two elements of data appear side-by-side without any intervening tag (and empty elements still need a zero size indicator).
Another modification we can do is to first assume that all data in the file is variable length. In that case, each element looks like:
BBBBdddddddddddddddddddddddddddd
Where BBBB is a four-byte network encoded integer, followed by a variable array of data.
Now for some data types, we may know right away that they have a fixed size and are a well-known data type. As such, in the reference entry, while the default is variable binary data, other types can be specified too such as network encoded integer, short or double.
Any type really, so long as the length is fixed and there is some reference tag for that type added to the bXML definition.
All of this fiddling with extra information gives us a rather arbitrary length reference table, where each entry contains a variable-length string, and possibly some other values.
Since each table entry is already variable, we might as well accept that. A separator character like # can be used to delineate these fields. A starting and ending string, also help to delineate the boundaries of the table. "crossref" and "end-crossref" are easily readable.
Also, since all of the entries and parameters in the table are explicitly separated, we can use the ASCII representation for numbers, instead of something less readable like network encoded floats. The small extra size is a reasonable trade-off for easy readability.
Another thing that would be nice, would be a bunch of index references for the main tags. We'd like to know where they are in the file. Since this is a one-to-many relationship, we can just use square braces and commas to delineate a list of character positions for the tags.
So a more complete reference file might look something like:
crossref
painting##empty#[0]
img#1#empty#[1]
src#2###[1]
alt#2###[16]
caption#1##[46,103,113]
date#2#year#[98,108]
end-crossref
Given that, we can then encoded the file as:
{1}BBBBmadonna.jpgBBBBFoligno Madonna, by RaphaelBBBBThis is Raphael's "Foligno" Madonna, painted in Y1511BBBB-Y1512BBBB.
Where I've used {N} to represent the byte(s) needed to encode the tag index, Y1511 to represent a "year" data-type and BBBB to represent a four-byte size.
Notice that we don't need the quotes and that the caption is broken up into three parts.
All and all it is a pretty compact representation. Of course, the reference table overhead is fairly large for small data, so in a simple example with one entry XML is a far better representation. Still, it makes for a nearly readable file.
With the exception perhaps of pure binary data (such as an embedded JPEG), this format is entirely translatable into standard XML, and any standard XML is translatable into this format.
CONTAINING THINGS
Still, the really clever reader will have no doubt noticed a fatal flaw in the currently described scheme.
We can delineate variable elements with a fixed size that tells us the end of the data. We can specify tags, but the current mechanism doesn't really support any type of reasonable containment.
Composite tags do have structure (because of the depth indication), but even there it does not help with the positioning and embedding of the data.
This layout how no real way of embedding an arbitrary depth of explicit tags within tags. That type of nesting isn't included, but it is crucial to making XML useful so we will have to add it.
Mostly to define some container scheme (particularly if it is recursive), we generally need two explicit tokens in the data, a start one and an end one. In this case, although it is tempting to use the element tag itself as the start token that assumes that most things will be containers.
For this scheme, since we are using it to deal with massive amounts of data, it is best to assume that most tags are not actually composite. That is, that most of them contain a one-to-one relationship with some discrete piece of data. If the structure outweighs the data, then pure XML is a better choice for representation.
Now, to keep this simple, we'd really only like to introduce the most minimal syntax to accommodate the container start and ends.
One easy and inexpensive way is to directly embed them into the existing tag mechanics. We simply just create a one-byte tag for each of them, drawing very little from the reference table. Thus for the original example, we could have a table like:
crossref
__BEGIN__###[]
__END__###[]
painting##empty#[]
img##empty#[]
src####[]
alt####[]
caption###[]
date##date#[]
end-crossref
Where we've picked up the convention of wrapping system-tokens in preceding and proceeding double underscores. Then BEGIN and END just delineate the boundaries for any tag container.
This can encode our earlier example as:
{3}{1}{4}{1}{5}BBBBmadonna.jpg{6}BBBBFoligno Madonna, by Raphael{2}{7}{1}BBBBThis is Raphael's "Foligno" Madonna, painted in {8}Y1511BBBB-{8}Y1512BBBB.{2}{2}
And of course, the first and second tag are only a byte each.
Now, of course, BEGIN and END can float to other locations in the table. Any interpreter much just bind back the syntax __BEGIN__ to the operation of pushing a new context on the stack, and __END__ of popping it. Un-tagged data is then just matched to the current context. Tags, values, data, etc are all converted back to XML based exactly on their explicit positions in the data.
SUMMARY
With the base information and the reference table we can encode all of the information from an XML file into several different forms of tighter representation, but still, allow it some minimal aspects of readability. PDFs too can be quickly examined, even though chunks of them can be quite unreadable to the human eye.
There are a whole lot of other games we could play to further bring down the data size, but the most important quality in this format is to still make it mostly readable.
Convenience in an operational environment, for quickly viewing files and diagnosing problems should not be underestimated in value. Our systems are so flaky that anything we can do to make it even the tiniest bit easier to find problems is going to help ensure that these technologies don't end up in the bit bucket with the rest of the might-have-beens.
When I was first considering this representation, my ultimate goal was to essentially replace "table" in a relational database with an "XML tree". To create a new type of database.
SQL crafts set-oriented arrangements that return tables while operating on tables. We could keep the same "set" mechanics, but return a tree while operating on forests (sets of trees).
We could also consider an XML tree to be a table with an infinite number of columns and some inter-column relationship.
I'm not sure how these ideas would work in practice, but by keeping the data and index separate in "forest" files, and then using bXML as the basis for storage mechanism, the actual coding of this esoteric type of database should be fairly well understood.
Another point is that it should be fairly trivial to create some bXML -> XML and vice versa tools. In that way, if this format is being used to move around massive data then it won't require new special parts to get it embedded into the infrastructure. It's almost the equivalent of just zipping up the file, accept that it still mains some readability while in progress.
So we have a reference table that relates a variable string back to a byte tag embedded in the data.
In many XML files, there are often larger hierarchies of tags that refer to the same structure of elements, over and over again. If we know the underlying element types, we can parse them out individually without having to have a separate tag for each sub-element. In complex, highly repetitive data is a considerable saving.
To accomplish this, in the reference table, we can just add a value that says our "depth" in a structural tree.
If our depth is not listed or 0, we are essentially able to be parsed as a top-level (explicit) tag. If the depth is higher, then we are an implicit tag packed in sequence behind our parent tag (which may also be implicit). In that case, the two elements of data appear side-by-side without any intervening tag (and empty elements still need a zero size indicator).
Another modification we can do is to first assume that all data in the file is variable length. In that case, each element looks like:
BBBBdddddddddddddddddddddddddddd
Where BBBB is a four-byte network encoded integer, followed by a variable array of data.
Now for some data types, we may know right away that they have a fixed size and are a well-known data type. As such, in the reference entry, while the default is variable binary data, other types can be specified too such as network encoded integer, short or double.
Any type really, so long as the length is fixed and there is some reference tag for that type added to the bXML definition.
All of this fiddling with extra information gives us a rather arbitrary length reference table, where each entry contains a variable-length string, and possibly some other values.
Since each table entry is already variable, we might as well accept that. A separator character like # can be used to delineate these fields. A starting and ending string, also help to delineate the boundaries of the table. "crossref" and "end-crossref" are easily readable.
Also, since all of the entries and parameters in the table are explicitly separated, we can use the ASCII representation for numbers, instead of something less readable like network encoded floats. The small extra size is a reasonable trade-off for easy readability.
Another thing that would be nice, would be a bunch of index references for the main tags. We'd like to know where they are in the file. Since this is a one-to-many relationship, we can just use square braces and commas to delineate a list of character positions for the tags.
So a more complete reference file might look something like:
crossref
painting##empty#[0]
img#1#empty#[1]
src#2###[1]
alt#2###[16]
caption#1##[46,103,113]
date#2#year#[98,108]
end-crossref
Given that, we can then encoded the file as:
{1}BBBBmadonna.jpgBBBBFoligno Madonna, by RaphaelBBBBThis is Raphael's "Foligno" Madonna, painted in Y1511BBBB-Y1512BBBB.
Where I've used {N} to represent the byte(s) needed to encode the tag index, Y1511 to represent a "year" data-type and BBBB to represent a four-byte size.
Notice that we don't need the quotes and that the caption is broken up into three parts.
All and all it is a pretty compact representation. Of course, the reference table overhead is fairly large for small data, so in a simple example with one entry XML is a far better representation. Still, it makes for a nearly readable file.
With the exception perhaps of pure binary data (such as an embedded JPEG), this format is entirely translatable into standard XML, and any standard XML is translatable into this format.
CONTAINING THINGS
Still, the really clever reader will have no doubt noticed a fatal flaw in the currently described scheme.
We can delineate variable elements with a fixed size that tells us the end of the data. We can specify tags, but the current mechanism doesn't really support any type of reasonable containment.
Composite tags do have structure (because of the depth indication), but even there it does not help with the positioning and embedding of the data.
This layout how no real way of embedding an arbitrary depth of explicit tags within tags. That type of nesting isn't included, but it is crucial to making XML useful so we will have to add it.
Mostly to define some container scheme (particularly if it is recursive), we generally need two explicit tokens in the data, a start one and an end one. In this case, although it is tempting to use the element tag itself as the start token that assumes that most things will be containers.
For this scheme, since we are using it to deal with massive amounts of data, it is best to assume that most tags are not actually composite. That is, that most of them contain a one-to-one relationship with some discrete piece of data. If the structure outweighs the data, then pure XML is a better choice for representation.
Now, to keep this simple, we'd really only like to introduce the most minimal syntax to accommodate the container start and ends.
One easy and inexpensive way is to directly embed them into the existing tag mechanics. We simply just create a one-byte tag for each of them, drawing very little from the reference table. Thus for the original example, we could have a table like:
crossref
__BEGIN__###[]
__END__###[]
painting##empty#[]
img##empty#[]
src####[]
alt####[]
caption###[]
date##date#[]
end-crossref
Where we've picked up the convention of wrapping system-tokens in preceding and proceeding double underscores. Then BEGIN and END just delineate the boundaries for any tag container.
This can encode our earlier example as:
{3}{1}{4}{1}{5}BBBBmadonna.jpg{6}BBBBFoligno Madonna, by Raphael{2}{7}{1}BBBBThis is Raphael's "Foligno" Madonna, painted in {8}Y1511BBBB-{8}Y1512BBBB.{2}{2}
And of course, the first and second tag are only a byte each.
Now, of course, BEGIN and END can float to other locations in the table. Any interpreter much just bind back the syntax __BEGIN__ to the operation of pushing a new context on the stack, and __END__ of popping it. Un-tagged data is then just matched to the current context. Tags, values, data, etc are all converted back to XML based exactly on their explicit positions in the data.
SUMMARY
With the base information and the reference table we can encode all of the information from an XML file into several different forms of tighter representation, but still, allow it some minimal aspects of readability. PDFs too can be quickly examined, even though chunks of them can be quite unreadable to the human eye.
There are a whole lot of other games we could play to further bring down the data size, but the most important quality in this format is to still make it mostly readable.
Convenience in an operational environment, for quickly viewing files and diagnosing problems should not be underestimated in value. Our systems are so flaky that anything we can do to make it even the tiniest bit easier to find problems is going to help ensure that these technologies don't end up in the bit bucket with the rest of the might-have-beens.
When I was first considering this representation, my ultimate goal was to essentially replace "table" in a relational database with an "XML tree". To create a new type of database.
SQL crafts set-oriented arrangements that return tables while operating on tables. We could keep the same "set" mechanics, but return a tree while operating on forests (sets of trees).
We could also consider an XML tree to be a table with an infinite number of columns and some inter-column relationship.
I'm not sure how these ideas would work in practice, but by keeping the data and index separate in "forest" files, and then using bXML as the basis for storage mechanism, the actual coding of this esoteric type of database should be fairly well understood.
Another point is that it should be fairly trivial to create some bXML -> XML and vice versa tools. In that way, if this format is being used to move around massive data then it won't require new special parts to get it embedded into the infrastructure. It's almost the equivalent of just zipping up the file, accept that it still mains some readability while in progress.
Subscribe to:
Posts (Atom)