Saturday, September 4, 2010

Usage share of newsreaders

I have noted before, by a nonscientific and utterly biased survey, that Thunderbird appeared to account for a significant share of the newsreader market (testing bug 16913 was what caused me to discover this fact). But actually finding any attempts to measure usage share of newsreaders via Google has actually been rather frustrating. You can easily find market shares of web browsers, desktop operating systems, server operating systems (though the numbers vary wildly?), and mobile platforms. But not things like email client shares or newsreader shares.

Okay, I am not about to find market shares of email clients. I have no access to anywhere near enough a representative sample that could work. But collecting newsreader market shares should not be that hard. After all, pretty much anyone can pick up a large, representative sample of news postings... connect to a NNTP server of your choice. So, seeing as how it's a three-day weekend, I thought I might as well collect the data myself. The other reason for my collecting this data was to demonstrate that a significant number of Thunderbird users are NNTP, so removing NNTP support would adversely affect the userbase.

Methodology

First off, I have to define what I mean by "usage share." Unlike other mediums, a relatively small numbers of users account for a relatively large share of NNTP postings. I've decided to measure it by the number of posts generated by each NNTP client, since it's easier to calculate, and I think it is more informative than the measuring by individual users.

I also have to pick the subset to log. For this set of data, I collected every single news article in the Big-8 newsgroups on my school's NNTP server (news.gatech.edu), which has a retention time of a month (30 days is the exact number, I think). I did not even attempt to filter out spam messages, and I did not account for cross-posting (my script managed to crash due to races a few times, so the totalized data which accounted for cross-posting was unreported).

Essentially, this is what I did. I ran a python which collected every group in the big 8 (determined by LIST ACTIVE wildmats) on the server. Then, it entered every group, performed an XOVER to find all messages, and then XHDR'd User-Agent, X-Mailer, and X-Newsreader to figure out what the user agent was. This script output, for every group, a total for each full string that represented the UA into a ~2MB csv file.

I collected all of the csv data into OpenOffice.org Calc and then ran a macro which attempted to collect the program number and the version from the UA strings. Unsurprisingly, I had to do some hacks to get it to recognize SeaMonkey and Thunderbird correctly (Mnenhy was not helping). I output tables that broke readers down by versions and by total program counts.

Results

It turns out that there is an incredibly long tail of newsreaders. There are about 250 different UA strings I found. Excluding one particularly prevalent bot and those postings for which I could not find a UA string, I found around 430,000 messages (there may be some other things dropped by copy-paste errors). Of these, the top 5 newsreaders account for just 79% of the total count. By contrast, the top 5 web browsers account for very nearly 100% of the total. Some interesting newsreaders I found in the long tail:

  • Mozilla 4.8 [en] (Windows NT 5.0; U)
  • Mozilla 3.04 (WinNT; U)
  • trn 4.0-test70 (17 January 1999)
  • Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/0.8.12
  • MyBB

Finally, here is the table of the top newsreaders:

NewsreaderTotalPercent
Google Groups18953643.94%
Thunderbird5225812.11%
Forte Agent4710010.92%
Microsoft Outlook Express410429.51%
Microsoft Windows Live Mail111962.60%
MT-NewsWatcher108722.52%
Other7935918.40%

That Google Groups has the highest market share is not surprising, but I was surprised by the strong showing of Forte Agent and the poor showing of traditional newsreaders (e.g., tin, rn-based newsreaders). I guess this goes to show you that Windows has a surprisingly large market share in the Big 8 newsgroups. For SeaMonkey enthusiasts, your newsreader has a mere 4,187 postings (with another ~5K provided by other Mozilla distributions, some of whom cannot be determined... Mnenhy made processing UA strings difficult).

In terms of individual versions, one of Outlook Express's versions clocks in #1 at 31,617 total posts, with Thunderbird 3.1.2 trailing at a "mere" 23,661. Thunderbird has around 14,000 on the 2.x branch, 11,000 on the 3.0.x branch, and 25,000 on the 3.1.x branch. There is apparently some spoofing going on for SeaMonkey users as well (I found a dozen or so Firefox entries, which I presumed is a SeaMonkey-spoofed UA string).

Another datum incidentally collected was the number of postings in each hierarchy. Here they are:

HierarchyCountLargest Newsgroup
comp.*64,360comp.soft-sys.matlab (8,399)
humanities.*2,460humanities.lit.authors.shakespeare (1,455)
misc.*28,518misc.test (8,796)
news.*31,635news.list.filters (26,238)
rec.*217,548rec.games.pinball (14,707)
sci.*47,948sci.electronics.design (6,076)
soc.*87,192soc.retirement (6,053)
talk.*12,513talk.origins (6,498)

Remind me again why we have the humanities hierarchy? Almost 60% of its messages come from a single newsgroup, and it has just 8 newsgroups.

Future Work

What could be done in the future is to expand this research into binary newsgroups. However, merely counting posts becomes a more inappropriate metric because binary newsgroups use a lot of multipart messages, so just because someone uploads a ginormous binary does not mean it should be counted 50 times. I also don't have access to any binary newsservers.

Another opportunity for fixing is to discount spam. As a brief test, I looked into only those newsgroups which had the name `moderated'--this resulted in a paltry sample of 3,272 messages. The statistics also appear to not change much, but the newsgroups are likely not a representative sample of the Big 8 anyways.

Finally, this needs to be broadened and run repeatedly so it can collect snapshots of the data across time. This metric suffers poorly at capturing historical data, but it could be an excellent way to get data every few months from the future, so long as someone collects all of the data in the future.