Friday, May 21, 2010

Developing new account types, Part 3: Updating folders (part 2)

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

This blog post is a continuation of my previous post, which is being broken up into multiple segments to lower the amount of text one has to read in a single sitting. The current step is to actually implement the folder update.

Folder updating

To actually achieve our goal of getting a correct message list, we are going to modify the implementation of updateFolder. This function is called whenever a folder is selected in the folder pane; conceptually, you can view the function as causing the cached database to be resynchronized with the actual folder. For example, this is where a local folder would actually reparse the mailbox if the database was incorrect or missing.

This function essentially consists of three steps: figure out new messages, process them (i.e., apply filters), and then announce to the world that they exist. Some account types (like IMAP) may need to do more involved message processing, but this is the general gist of what goes on [4]. I'll ignore the processing step until I start talking about filters.

Database Details Devil

To start with, I'll cover the last step. Announcing to the world that a message exist boils down to adding a new header to the database. So how do you add a new header to the database? It requires three easy steps: create the header, populate the fields, and then add it to the database. With the proper listener setup, all of the other notification is done for you automatically. But as they say, the devil is in the details.

Let me begin by explaining some things about messages. There are five different representations of the message: the message key, the message header, the message ID, the message URI, and the necko URL object. Siddharth Agarwal has a nice diagram that shows how to convert between these representations. The last two are more concerned with displaying messages; it is the first three that are interesting right now.

Message keys are the internal database key for a message; the tuple (folder, key) is guaranteed to be unique by the database. Message keys are unsigned 32-bit integers (with 0xFFFFFFFF, or -1 in 2's complement, reserved as the "no message here" key). In general, any time a property needs to refer to another message, the message key is used; as a consequence, it means that such properties cannot refer to stuff across folders.

Message IDs are the RFC 5322 identifier for a message. These identifiers are supposed to be unique (for logical messages, not in a "the message at offset 0x234f3d in this file" sense). The most important use case for message IDs is that they are a critical component for threading.

The message header object is an object of type nsIMsgDBHdr. These are objects are directly backed by the database. However, many of the properties do not notify the database of changes, so you generally do not want to actually set them. Like all generalities [5], there are exceptions to this rule. Right now, we want to manipulate headers before adding them to database, and therefore we do not want to notify people of changes to not-yet-existing headers, so we want to actually use the fields of nsIMsgDBHdr.

So, the first thing you need to do is to decide what your message key is. Message keys are going to be used to get the message URI, so it should be a property that is easy to associate with methods. IMAP uses message UIDS, local folders the offset into the mbox [6], and NNTP uses the key numbers in the group. In my case, it appears that the forum assigns each post a unique number, so that is what I'll use.

After the message key, the most important properties are the major ones for display. The author attribute correlates to the "From" header, subject to the "Subject" header, and date to the "Date" header. All of these will be used to generate values in the thread pane columns; things would look strange without these.

The other major property in the display is flags. Flags, as the name implies, is an integer where each bit corresponds to a different flag. The most important of these are probably HasRe, Flagged, and New. Flags should be set with OrFlags and AndFlags instead of manipulating the value directly. And don't set these values with the mark* methods, as these cause notifications to be fired (remember that we haven't added the message to the database yet).

If you want to do real threading, you will want to set message IDs and references [7]. The References header is a space-separated list of message ID tokens (wrapped in angle brackets), although the parser routine in the database does a pretty good job of ignoring any random crap. The list is in the reverse order of hierarchy, so the last element is the message's parent, second-to-last the grandparent, etc.

Threading is implemented in the following manner. First, the database attempts to find a message for each message ID in reverse order. If it finds one, that is made the parent header and threading stops. Otherwise, if correct threading is enabled, an attempt is to made to find a thread which has that message ID. Otherwise, if use strict threading is not enabled, a thread that has a message which has the same subject (without Re) is used as the thread. If threading without re is disabled, the message has to have the HasRe flag checked to perform the last step. Finally, if a thread could not be found by this point, a new one is created.

To combine messages in a thread, then, the References field needs to be set for the messages. If people enable correct threading (this is done by default), you can use a simple trick: create a valid message ID for each thread and stuff that as the References header.

A practical example

In my case, I have an author (without email addresses), a subject (with possible non-ASCII text but without Re: stuff), a date in a standard format, as well as a simple per-thread unique identifier for message keys. I also want to make threads—although this will only be two-level threads. Ideally, I should also be flagging the sticky threads, but I'll leave that for a later version. So what does this code look like?

_loadThread: function (document, firstMsgId) {
  let database = this._folder.getDatabase();
  let conv = Cc['@mozilla.org/messenger/mimeconverter;1']
               .getService(Ci.nsIMimeConverter);
  let subject = /* one for the thread */
  let hostname = this._folder.server.hostName;
  let charset = document.characterSet;
  /* for each new message */ {
    let postID = /* generate msg key */;
    let author = /* get author name */;
    let date = new Date(/* get text string*/);
    let msgHdr = database.CreateNewHdr(postID);
    // The | is to prevent accidental message delivery
    msgHdr.author = conv.encodeMimePartIIStr_UTF8(
      author + " <" + author + "@" + hostname + "|>", true, charset, 0, 72);
    msgHdr.subject = conv.encodeMimePartIIStr_UTF8(subject, false, charset,
      0, 72);
    // PRTime is in µs, JS date in ms
    msgHdr.date = date * 1000;
    msgHdr.Charset = charset;
    msgHdr.messageId = postID + "@" + document.documentURI;
    if (firstMsgId) {
      msgHdr.setReferences("<" + firstMsgId + ">");
      msgHdr.OrFlags(Ci.nsMsgMessageFlags.HasRe);
    } else {
      firstMsgId = msgHdr.messageId;
   }
   msgHdr.OrFlags(Ci.nsMsgMessageFlags.New);
   database.AddNewHdrToDB(msgHdr, true);
  }
}

First, we get a reference to the database. Remember we implemented this in our last step, so this shouldn't present any problems. We also get the things that are shared in this thread: the subject, hostname of the server, and the charset. For each of the posts, we collect the post ID, the author, and the date of the post as text strings, and then convert them into an integer, string, and a date respectively.

Using the CreateNewHdr function, we get a new message header that we can manipulate. Since I'm trying to be aware of non-ASCII text, I'm using the MIME encoding strings to prepare the author and subject. Remember that the MIME specifications want you to encode non-ASCII text in the headers; the function we use is the simplest way to do the encoding.

If you're not working with actual email, the from string can be contorted. What I did was to create a fictituous email that could be theoretically tied back to the author in a systematic way (for a possible future compose code that does forum private messaging). The purpose of the pipe character at the end is to prevent accidental mail delivery; I also used the hostName and not the realHostName, so this email address would be traceable even if the user changes the host name on me.

The message date I have is a formatted string; the Date constructor is pretty handy at converting most forms of these strings into a usable JS Date object. Then I have a JS Date object, which is measured in milliseconds, whereas the date attribute is a PRTime, which is measured in microseconds, so I need to multiply by 1000 to actually set the property. Ironically, the date is actually stored in seconds in database and is converted to and from microseconds on the fly.

The Charset attribute, apparently only used for search right now, is derived from the character set as reported by the DOM. This means that it is the same character set as would be assumed by the layout engine, including character set overrides.

The message ID is simpler to generate: valid URIs are pretty much valid right-hand-sides of a message ID. A post is pretty much representable as a tuple of the thread page and the path to the post in the DOM, so this message ID is also an easy way to get to the message. References are also generated as I described above; in a later version, I may try to do sniffing to figure out from quoting who is replying to whom and recreate actual threads. Note that when setting the message ID, the outer angle brackets are optional.

The last thing I set is the flags. A complete listing of flags can be found on MDC. In this case, the only flags I care about are HasRe (since I want to generate "Re:" headers) and New; most of the others will probably be set by the user in the UI.

Finally, we add the header to the database. The last parameter tells the database to tell anyone listening that we have a new message. After we have loaded all of the messages, we need to commit the database:

database.Commit(Ci.nsMsgDBCommitType.kLargeCommit);

A brief note to make here: it doesn't really matter if you do a large or session commit, they both end up doing the same thing. Small commits end up doing nothing.

Notes

  1. Like most synchronization stuff, you theoretically also have to deal with deletion on the remote side as well as read changes, etc. The more I think about it, the more I'm torn on whether or not I should implement it. For now, I'll recommend that you weigh the cost of trying to determine deleted messages versus the commonality of deletion or other modification.
  2. Except, I am told, that all words that end in -tion in French are female.
  3. Incidentally, this is a major part of the reason why there is a 4 GiB limit on mailbox size in Thunderbird and SeaMonkey.
  4. What about In-Reply-To, you may ask. This information is pretty much redundant with References, so what happens is that, for the purposes of computing threading, this header is appended to the References header. And you do this before calling on the database header.

Wednesday, May 12, 2010

Developing new account types, Part 3: Updating folders (part 1)

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

In the previous blog post, I showed how to get an empty message list displayed in the folder pane. The next step is to actually implement the folder update. Since this task involves several tasks, I will be breaking this step into multiple blog posts.

Getting a DOM for HTML

In terms of webscraping, I treat the first step as simply turning a URI into a DOM. The developer center actually has some good resources on this, if you have access to a document object. The issue, though, is getting a document object, since your code will likely be running from an XPCOM component [1]. What is needed then, is a utility method for loading the DOM. This is the code I've been using:

function asyncLoadDom(uri, callback) {
  let doc = Cc['@mozilla.org/appshell/window-mediator;1']
              .getService(Ci.nsIWindowMediator)
              .getMostRecentWindow("mail:3pane").document;
  let frame = doc.createElement("iframe");
  frame.setAttribute("type", "content");
  frame.setAttribute("collapsed", "true");
  doc.documentElement.appendChild(frame);
  let ds = frame.webNavigation;
  ds.allowPlugins = ds.allowJavascript = ds.allowImages = false;
  ds.allowSubframes = false;
  ds.allowMetaRedirects = true;
  frame.addEventListener("load", function (event) {
    if (event.originalTarget.location.href == "about:blank") return;
    callback(frame.contentDocument);
    doc.documentElement.removeChild(frame);
  }, true);
  frame.contentDocument.location.href = uri;
}

The first argument is the URI to load, as a string, and the second argument is the function to be called back with the DOM document as its sole argument. An added benefit to this method is that it also uses an asynchronous callback method, so you're not blocking the UI while you wait for the page to download. This code will likely not be called except by the protocol object, though, since we probably want to throttle the number of pages loaded up at once.

The protocol object

Earlier, I mentioned that one of the implemented objects wasn't actually mandatory. This object was the protocol object. An instance of this object is meant to wrap around an actual connection to the server; where you don't need to connect to a server, this object might not be worth implementing. In reality, it is still a useful thing to have if you have a non-trivial account type—any time a task is more complicated than "load this thing and use it," a protocol object can help with managing multiple subtasks.

For a wire protocol, the implementation of this object should be straightforward. It would essentially be a state machine, with an idle state entered after setting up the connection during which the instance can accept tasks to do. A state machine could also be done for webscraping-based account types, but I am using a more queue-based approach due to how I have structured the web loads.

At a high level, server requests are chunked at two levels. On the higher level, the application makes calls to functions like updateFolder; these calls I have decided to term tasks. The lower level requests are the requests you communicate to the server; for lack of any better terminology, I will refer to these as states[2]. In my implementation, I keep two queues, one for each of these.

Managing the queue for tasks is best done at the server. The overall logic is actually rather simple:

const kMaxProtocols = 2;
wfServer.prototype = {
  /* Queued tasks to run on the next open protocol */
  _queuedTasks: [],
  _protocols: [],
  runTask: function (task) {
    if (this._protocols.length < kMaxProtocols) {
      let protocol = new wfProtocol(this);
      protocol.loadTask(task);
      this._protocols.push(protocol);
      return;
    }
    for (let i = 0; i < this._protocols.length; i++) {
      if (!this._protocols[i].isRunning) {
        this._protocols[i].loadTask(task);
        return;
      }
    }
    this._queuedTasks.push(task);
  },
  getNextTask: function (task) {
    if (this._queuedTasks.length > 0)
      return this._queuedTasks.shift();
    return null;
 },
};

The runTask method is designed to be called with a task object; for the core mailnews protocols, this is primarily being called by the service [3]. For now, I've made the value for the maximum number of protocol objects unchangeable, but it is probably better to allow this value to be configurable via a per-server preference.

The core implementation of the protocol running object for webscraping is not too difficult:

const kMaxLoads = 4;
function wfProtocol(server) {
  this._server = server;
}
wfProtocol.prototype = {
  /// Queued URLs; first kMaxLoads are the currently running
  _urls: [],
  /// The current task
  _task: null,
  /// Load the next URL; if all URLs are finished, finish the task
  onUrlLoaded: function (url) {
    if (this._urls.length > kMaxLoads)
      this._urls[kMaxLoads].runUrl();
    this._urls.shift();
    if (this._urls.length == 0)
      this.finishTask();
  },
  /**
   * Queue the next URL to load.
   * Any extra arguments will be passed to the callback method.
   * The callback is called with this protocol as the this object.
   */
  loadUrl: function (url, callback) {
    let closure = this;
    let task = new UrlRunner(url, this);
    let argcalls = [null];
    for (let i = 2; i < arguments.length; i++)
      argcalls.push(arguments[i]);
    task.onUrlLoad = function (dom) {
      argcalls[0] = dom;
      callback.apply(closure, argcalls);
    };
    this._urls.push(task);
    if (this._urls.length <= kMaxLoads)
      task.runUrl();
  },
  /// Run the task
  loadTask: function (task) {
    this._task = task;
    this._task.runTask(this);
  },
  /// Handle a completed task
  finishTask: function () {
    let task = this._server.getNextTask();
    this._task.onTaskCompleted(this);
    if (task)
      this.loadTask(task);
  }
};
/// An object that represents a URL to be run
function UrlRunner(url, protocol) {
  this._url = url;
  this._protocol = protocol;
}
UrlRunner.prototype = {
  runUrl: function () {
    let real = this;
    asyncLoadDom(this._url, function (dom) {
      real.onUrlLoad(dom);
      real._protocol.onUrlLoaded(real._url);
    });
  },
  onUrlLoad: function (dom) {}
};

The protocol is initialized by calling loadTask, which calls runTask on the task object. This would make some calls to loadUrl which will load it (since the max has not been loaded yet). When the function is loaded, via UrlRunner.runUrl, the callback function is called and then the onUrlLoaded function is called to clean up the URL from the queue and run any more. When this function detects that there are no more URLs are being loaded—hence why the callback is called before this function is—finishTask is called on the task object.

The working of loadUrl bears special mention. The first argument is the URL (as a string) to be loaded. The second argument is the method on wfProtocol to be called when the URL is loaded. This implies that the actual code for implementing tasks is mostly contained on wfProtocol as opposed to the task objects. All subsequent arguments are passed in as arguments to the callback function; the first argument to this function is the DOM document.

Notes

  1. Well, there is an nsIDOMParser which can turn text into a DOM without needing a document object. Unfortunately, it only supports XML. There is a patch for making it parse HTML, but it has gotten no traction in recent months.
  2. Just to muddle it all up, the URL instances in most mailnews implementations are actually how the tasks are implemented, although I internally use a URL to represent a state (kind of). A potentially clarifying discussion can be found in mozilla.dev.apps.thunderbird.
  3. I am not totally happy with the current model of the protocol system in mailnews, particularly with the technique of crossing over to the service to make the calls to the protocol. In my implementation, I've made those functions static functions on the protocol object. Since this is somewhat different from the current implementations and I'm not sure I want to keep this, I've couched my statements of how things work.