Wednesday, February 18, 2009

A database proposal

A sore point in the mailnews code is the message database code. Most of the backend ends up being a single amorphous blob with fuzzy boundaries amassing huge couplings between a server's implementation and the database. Add into account the fact that the database documentation (like most of mailnews, but worse) is often either poorly documented or sometimes just plain wrong, and you get a recipe for disaster. There's also the issue, probably the most important one, that the database has grown past its original intent.

Originally, the message was merely a cache of the information for the display. Since it was only a cache, it doesn't matter that much if it is blown away and reparsed from the original source. Well, there's a little matter of the ability to set an arbitrary property that isn't reflected in the mbox source. This capability, among other features, has made the message database a ticking time bomb. And, in essence, the bomb recently exploded when I attempted to make it usable from JavaScript.

So, in the mid-to-long-term, the database needs serious fixing, not the incremental band-aids applied all over it. It needs a real design to fit its modern and future purposes. Naturally, the first question is what does a database need to do. Following are salient points:

The database is really multiple, distinct components.
One part of the database is a relational database: metadata for a message that is not reflected in the message itself. If an extension wants to keep information on certain message properties (like how junk-y it is), it would stick the information in this relational database. The second part of the database is a combination of the message store and cache. This part is what the database used to be: a store of information easily recoverable from the message store. Note that this part of the database needs to be at least partially coupled with the message store, more on this later.
The relational database is separate from the cache database.
The cache database exposes a unique, persistent identifier for messages.
While the cache database can, and probably will, be regenerated often, the relational database is permanent. Indeed, the cache database blowing itself away should not cause the relational database to have to do anything. At present, the cache uses ephemeral IDs as unique identifers: IMAP UIDs (destroyed if UIDVALIDITY is sent), mbox file offsets (destroyed if the mbox changes), or NNTP article keys (can of worms there [1]). In my proposal, the cache would map these IDs to more persistent ones. Yes, it makes reconstructing the database more difficult, but it makes everyone else's lives easier.
The cache database may be rebuilt if the store is newer.
The cache database rebuild should be incremental.
The relational database should not be ever automatically rebuilt.
One of the main problems as it stands is the rebuild of the cache database. It has been, in general, assumed that rebuilding the database would never lose information, but the database has become the only store of some information. I am not certain of technical feasibility, but there is in general no need to reparse a 2GB mbox file if you compact out a few messages. Even in an IMAP UIDVALIDITY event, I expect that not all of the UIDs would be changed. Incrementalism would make the database more usable during painful rebuilds, but, naturally, it would require more careful coding.
The cache database's rebuild policy is caller-configurable.
What I mean about this is that the cache database will be accessible via one of three calls: an immediate call that will get the database, even if invalid; a call which will get the database but spawn an asynchronous rebuild event [2]; and a call that will block until the database finishes rebuilding, if necessary. The implications of having asynchronous rebuild would require the database to be thread-safe, but I expect that the future of the database already includes asynchronous calls. At the very least, it might help in some cases where we've run into thread safety issues in the past (such as import).
The cache database has access to the message store.
There are three types of store: local-only, local caching remote, and remote-only.
The folder can only access the store through the database.
These points are probably the ones I'm least comfortable with, but I think it's necessary. In the long-term, pluggable message stores and the store-specific mechanisms of database means that the cache database needs to have intimate access with the store. Having explicit interfaces for the message store should allow us to avoid having to subclass nsIMsgDatabase for the different storage types. Limiting access via the folder should help cut down the bloat on nsIMsgFolder. On the other hand, it would probably make the code do a lot more round-tripping, which could lead to more leaks.
The cache database is per-account, not per-folder.
A cleverly-designed per-account store could alleviate some problems. It would make working with cross-posted messages easier, and could, in principle, use less disk if you move messages between folders on the same local stores or caches. Copied messages could point to the same messages (in the spirit of filesystem hard links), so long as we don't permit local message modification.

If I haven't missed anything, that is how I see a future database system. Obviously, implementation would not be easy; I expect it would take at least a year or even two years of concentrated work to produce something close to these ideals. There are incremental steps to this future, but they seem to me to be towering steps at many cases (for example, introducing the database to the store, or making it usable from different threads). In any case, I'm interested in hearing feedback on this proposal.

[1]In recent months, some of Giganews' binary newsgroups had begun to press distressingly close to the signed-32 bit limit, which raised the question of what to do. One proposal would have been to reset the ids or maybe wrap around. A news client should be able to handle this case if practical to do so, IMO.

[2]I expect that this method would use the invalid database, although it could be implemented by having the various method calls block until validity. Since it's possible that a caller could use the blocking-get-database call as well, this approach makes significantly less sense to me.