So this morning, waiting to start work, I cruised through the sites I can actually visit at work, and came upon this article on cNet.
The headline is alarming…ly funny. Back in 1998, this tiny company out of Toronto called i4i patented a most interesting method of data storage that, it seems, Microsoft uses (allegedly) in Word 2003 and later for its XML export.
The news reports I’ve seen so far make a big deal about this, but I must point out, this has nothing to do with XML. Pure and simple. It’s also nothing like XML. At all. Two separate concepts. XML is a language, this is a mechanism for more efficient parsing.
The patent describes a “method and system for manipulating the architecture and the content of a document separately from each other.”
XML (eXtensible Markup Language) describes the structure of a document. It’s a human-readable language for rendering text-based data. The actual tagging that describes this structure can be whatever you want, but it has to follow a few very strict but simple rules in order to be parsed (read and interpreted) correctly by a machine. It’s a subset of SGML (Standard Generalized Markup Language), which was developed years ago to provide a general way of noting structural data in the text stream of a document.
Which brings me to the need to explain why this patent is so full of awesome. To me, anyway. And maybe also to give an idea why I think Microsoft might have exploited this for more than just XML exporting, for which it’s perfect.
Consider an average document containing, say, colored text. What you see in the previous sentence is exactly what the colored words in the sentence describe: Text that is of a certain color. In order for that color information to survive quitting the program running the document, the information about the color has to be preserved somehow. How that happens in HTML is that there are tags placed before and after the words in question. So, <span style=”color: #ff6600;”>colored text.</span> is how the information is saved in an HTML-formatted text file.
When you want to open the document, the program doing the opening has to parse (read and interpret) the text in order to find any instances of, say, colored text, in order to render it properly. So, it has to literally pick up each letter and space in the document and check each letter and space to find tagging. in this case it’s looking for text that begins with a less-than character, <, and ends with a greater-than character, >. Whatever falls in between those characters is considered “tagging,” and is then dealt with by the program. Incidentally, if it so happens that you want to use a < or a > in your document, the program has to store those as < or >, so the heft of the file gets greater and greater. This is why plain-text is so much smaller than its styled counterparts: All the extra heft is to save this kind of extra information. It gets even more involved when you start considering styles.
This is me getting real elementary kids, so forgive me. I’m not sure how much you know.
Right now, we’re talking about a document with colored text. If that colored text is supposed to mean something–like it’s supposed to denote the beginning of a chapter–then other tagging exists that would need to be parsed for in order to denote structure. That’s why XML came into being: A simple, robust (mostly) way of rendering structure. But that’s all that XML does.
So what does all this have to do with the patent, and where pray tell is the awesome?
The bitch of opening a document with style and/or structure is the parsing. When you open a document, whatever the format, the parsing step happens first. For DOC files and other such proprietary formats, the mechanism and the file structure are optimized as much as possible (hence the proprietary part) to make that parse happen quick. For open standards like HTML or XML (or semi-open standards like Rich Text Format (RTF)), there are really two operations that happen. The first one is to chug through the file character by character to find the tagging and the text–which takes forever from a programming standpoint–then to load that information into whatever internal model you’re using for navigation and manipulation.
This patent describes a way of structuring a file so that the tagging does not have to lay in the text stream with the text. Literally, instead of having to chug-chug-chug through a file to find the tags and then create your internal map to keep track of where the tagging is in the text, the map is kept in a separate location in the file from the text. For example, in a document that reads…
The text portion of the document would just be
A nine-character chunk of data like you’d find in a plain text file. The map would basically say that at character zero (before the first letter in this case) the tag <span style=”color: #ff6600;”> would exist, and at character ten (after the exclamation point), the tag </span> would exist.
The beauty of this is that, without having to add any special characters to the text stream itself, or any extra characters at all, you can map out all kinds of descriptive information related to this chunk of plain text. You could, conceivably, put mapping in the document that tells us that at character zero would be a structure tag like <para> and at character ten another structure tag like </para>.
That’s where the XML wrinkle comes in. What the suit alleges is that, when Microsoft (fucking finally) added an XML export to Word in 2003, they basically used either i4i’s engine or their concept in making that export process happen. As this cat states, this method of data storage is a “logical thing to do.”
Which brings me further into the awesome. This mapping method is pretty ingenious, and frankly, it’s how I thought Word’s file structure actually worked in the first place starting back in Word 2000. A lot of stuff changed at that time, you see, and the Document Object Model (part of the apparatus to programmatically navigate Word and its documents) changed especially. Scanning through a DOC file using a hex editor (a tool to actually look at a file as it exists as raw data), I remember noticing that the text part of the DOC seemed to be a block of essentially plain, unformatted text.
Now I remember thinking that it was odd that there should be this block of plain text in the middle of the DOC file. There’s really no good reason why such a thing should be there, because here I was thinking the text content of the DOC was embedded somewhere in all the other mess that is a DOC file in its true form.
Reading the patent presentation this morning, it clicked. That parsing mechanism would work nice for XML, sure. Or HTML or SGML or whatever, really. If I can tell you one type of tag is somewhere in a document, why not another type of tag? That’s nothing.
But why can’t I use that same mapping method to tell me where activedocument.paragraph(1) begins? Or to otherwise make the job of populating the Word Document Object Model easier and faster?
Just a thought. One that for me contained a lot of awesome.
Note: Two relics of my past life still fascinate me to this day: XML (and relating structure in a document), and automating Photoshop. One day, maybe I’ll share more about the other one.