Thursday, January 23, 2014

Charsets and NNTP

Recently, the question of charsets came up within the context of necessary decoder support for Thunderbird. After much hemming and hawing about how to find this out (which included a plea to the IMAP-protocol list for data), I remembered that I actually had this data. Long-time readers of this blog may recall that I did a study several years ago on the usage share of newsreaders. After that, I was motivated to take my data collection to the most extreme way possible. Instead of considering only the "official" Big-8 newsgroups, I looked at all of them on the news server I use (effectively, all but alt.binaries). Instead of relying on pulling the data from the server for the headers I needed, I grabbed all of them—the script literally runs HEAD and saves the results in a database. And instead of a month of results, I grabbed the results for the entire year of 2011. And then I sat on the data.

After recalling Henri Svinonen's pesterings about data, I decided to see the suitability of my dataset for this task. For data management reasons, I only grabbed the data from the second half of the year (about 10 million messages). I know from memory that the quality of Python's message parser (which was used to extract data in the first place) is surprisingly poor, which introduces bias of unknown consequence to my data. Since I only extracted headers, I can't identify charsets for anything which was sent as, say, multipart/alternative (which is more common than you'd think), which introduces further systematic bias. The end result is approximately 9.6M messages that I could extract charsets from and thence do further research.

Discussions revealed one particularly surprising tidbit of information. The most popular charset not accounted for by the Encoding specification was IBM437. Henri Sivonen speculated that the cause was some crufty old NNTP client on Windows using that encoding, so I endeavored to build a correlation database to check that assumption. Using the wonderful magic of d3, I produced a heatmap comparing distributions of charsets among various user agents. Details about the visualization may be found on that page, but it does refute Henri's claim when you dig into the data (it appears to be caused by specific BBS-to-news gateways, and is mostly localized in particular BBS newsgroups).

Also found on that page are some fun discoveries of just what kind of crap people try to pass off as valid headers. Some of those User-Agents are clearly spoofs (Outlook Express and family used the X-Newsreader header, not the User-Agent header). There also appears to be a fair amount of mojibake in headers (one of them appeared to be venerable double mojibake). The charsets also have some interesting labels to them: the "big5\n" and the "(null)" illustrate that some people don't double check their code very well, and not shown are the 5 examples of people who think charset names have spaces in them. A few people appear to have mixed up POSIX locales with charsets as well.

9 comments:

Unknown said...

Definitely interesting data, thanks for sharing.

Aatif said...

Impressive blog post, Really thanks a ton to you for sharing valuable information. Pharmaceutical Development Group (PDG) assists FDA regulated firms in the navigation of the U.S. submission, approval and post-marketing procedures across a variety of dosage forms and therapeutic areas. https://pharmdevgroup.com/investigational-new-drug-application-indspecial-protocol-assessment/

unmendra said...

Thanks for sharing such nice blog post. If you are getting bored and want your lonely night turn into joyful, instant get in touch with most reputed escort agency Escorts Manchester.

Cheshire Escort
Bolton Escort
Liverpool Escort
Warrington Escort
Stockport Escort
Wigan Escort
Salford Escort
Blackburn Escort

Tommy(R) said...

Many men search for girls to continue, but everyone has different preferences and a girl in real life isn't always around. I will also give you a way out. Via this platform legit mail order brides I met my girlfriend and I have been with her for more than a year. This place is highly sought after in ladies, it raises the likelihood that you can find one. So I suggest that everyone search here for their soul buddy.

Jimmy said...

You can then easily run to college paper writing service reviews if you are not aware of the business you have discovered. After all, the material about the business is still up-to-date and honest.

Dim4ksan said...

Most men are alone, surely men, but still they need women's help and I know this personally. But as long as a lovely platform is accessible, not all of them are still missing! Also I could find there in my 30s a great find a russian bride, we met her, we had a coffee, and all is very well! Many thanks for the service and the workers. The latter answering very easily and helping often.

kalamena said...

At the time the Internet was made available to the people, a person began to conduct a fair sex online dating since it was so handy! On latina wife websites, women build their own identities and seek out a foreign spouse for a hundred reasons! On this SEARCH website, I located my own wife. I recommend registering today and finding your other half!

Tommy(R) said...

I remember how much I was worried before passing the online exams, I passed Russian, although it was very difficult. But with mathematics I decided to play it safe and pay someone to take my online exam. I found services on the Internet, on one good site, the service was of course paid, but the fee was not very high. As a result, in mathematics for the exam I have 5. Therefore, I recommend this site to everyone, only professionals work there.

zarkazijar said...

slu cut off mark

Thank you for taking your time and real hard work to make such a great post in this fantastic blog, really appreciate you, thanks for sharing. slu cut off mark