Click above for a PDF of my slides (sorry, I used Keynote for several reasons, and its generated HTML is huge and not likely to work well with WP). Long-timer readers may notice that I re-presented a few still-on-point slides from my TxJS 2011 talk. Video will come in a few weeks, the organizers say.
dotJS 2017 was a terrific TED-style conference with top speakers and a smart, friendly audience. Special shout-out to Christophe Porteneuve for a great intro and for MCing in style — and of course many thanks to Sylvain Zimmer and team for the whole event.
“… to render and remix simulated reality as effortlessly as the web did for text and digital media.
Now, OTOY is building RNDR, its own GPU-cloud rendering token, to decentralize AR/VR, game, and movie rendering to over 7 million GPUs owners. The advantages of RNDR align with those of the BAT:
Efficiency: Utility tokens unlock access to idle or mispriced resources, e.g., GPUs for RNDR, user attention for BAT.
Fraud resistance: Tokens act as a low-fraud unit of account with payment only after blockchain-attested verification of work.
Social credit: Pre-creating a pool of tokens not for sale endows users with tokens by fiat.
(A speculative technical note on efficiency: RNDR looks well-timed, in view of GPU service lifetimes, to improve efficiency further by replacing Ethereum’s Proof of Work mining jobs with rendering jobs, as I suggested on Twitter.)
Decentralized rendering requires verification of results. Tokens do not flow until the render-job’s author confirms quality of results via sampling and testing. Renderers gain and lose scored reputation based on degree and quality of jobs completed.
Decentralized rendering requires confidentiality. In one of those technological ironies for which I live, the same kind of security hardware (e.g., ARM TrustZone) created for DRM helps solve the confidentiality problem.
In my original blog posts about OTOY, I advocated watermarking as inevitable and superior to DRM. OTOY has been developing and deploying watermarking since 2009. Large shared AR/VR worlds cannot possibly “encrypt what you see” (as DRM for fixed media tries to do). Yet creators of models and art in such shared virtual worlds need effective and fair protection, as experience from Second Life shows.
Indelible watermarking is a key part of the solution. See the “Watermarking and Encrypted Escrow Transactions” section at RNDR: A Photon-Driven Economy for more details.
I’m honored to be an advisor to OTOY and RNDR. I’m thrilled by the prospects for RNDR, BAT, and other domain-specific tokens that will bind the Metaverse economy into a coherent and equitable, yet decentralized, whole. The future will be tokenized!
On Monday, I, along with several million other people, decided to view the Great American Eclipse. Since I presently live in Urbana, IL, that meant getting in my car and driving down I-57 towards Carbondale. This route is also what people from Chicago or Milwaukee would have taken, which means traffic was heavy. I ended up leaving around 5:45 AM, which puts me around the last clutch of people leaving.
Our original destination was Goreville, IL (specifically, Ferne Clyffe State Park), but some people who arrived earlier got dissatisfied with the predicted cloudy forecast, so we moved the destination out to Cerulean, KY, which meant I ended up arriving around 11:00 AM, not much time before the partial eclipse started.
Partial eclipses are neat, but they're very much a see-them-once affair. When the moon first entered the sun, you get a flurry of activity as everyone puts on the glasses, sees it, and then retreats back into the shade (it was 90°F, not at all comfortable in the sun). Then the temperature starts to drop—is that the eclipse, or this breeze that started up? As more and more gets covered, then it starts to dim: I had the impression that a cloud had just passed in front of the sun, and I wanted to turn and look at that non-existent cloud. And as the sun really gets covered, then trees start acting as pinhole cameras and the shadows take on a distinctive scalloped pattern.
A total eclipse though? Completely different. The immediate reaction of everyone in the group was to start planning to see the 2024 eclipse. For those of us who spent 10, 15, 20 hours trying to see 2-3 minutes of glory, the sentiment was not only that it was time well spent, but that it was worth doing again. If you missed the 2017 eclipse and are able to see the 2024 eclipse, I urge you to do so. Words and pictures simply do not do it justice.
What is the eclipse like? In the last seconds of partiality, everyone has their eyes, eclipse glasses on of course, staring at the sun. The thin crescent looks first like a side picture of an eyeball. As the time ticks by, the tendrils of orange slowly diminish until nothing can be seen—totality. Cries come out that it's safe to take the glasses off, but everyone is ripping them off anyways. Out come the camera phones, trying to capture that captivating image. That not-quite-perfect disk of black, floating in a sea of bright white wisps of the corona, not so much a circle as a stretched oval. For those who were quick enough, the Baily's beads can be seen. The photos, of course, are crap: the corona is still bright enough to blot out the dark disk of the moon.
Then, our attention is drawn away from the sun. It's cold. It's suddenly cold; the last moment of totality makes a huge difference. Probably something like 20°F off the normal high in that moment? Of course, it's dark. Not midnight, all-you-see-are-stars dark; it's more like a dusk dark. But unlike normal dusk, you can see the fringes of daylight in all directions. You can see some stars (or maybe that's just Venus; astronomy is not my strong suit), and of course a few planes are in the sky. One of them is just a moving, blinking light in the distance; another (chasing the eclipse?) is clearly visible with its contrail. And the silence. You don't notice the usual cacophony of sounds most of the time, but when everyone shushes for a moment, you hear the deafening silence of insects, of birds, of everything.
Naturally, we all point back to the total eclipse and stare at it for most of the short time. Everything else is just a distraction, after all. How long do we have? A minute. Still more time for staring. A running commentary on everything I've mentioned, all while that neck is craned skyward and away from the people you're talking to. When is it no longer safe to keep looking? Is it still safe—no orange in the eclipse glasses, should still be fine. How long do we need to look at the sun to damage our eyes? Have we done that already? Are the glasses themselves safe? As the moon moves off the sun, hold that stare until that last possible moment, catch the return of the Baily's beads. A bright spark of sun, the photosphere is made visible again, and then clamp the eyes shut as hard as possible while you fumble the glasses back on to confirm that orange is once again visible.
Finally, the rush out of town. There's a reason why everyone leaves after totality is over. Partial eclipses really aren't worth seeing twice, and we just saw one not five minutes ago. It's just the same thing in reverse. (And it's nice to get back in the car before the temperature gets warm again; my dark grey car was quite cool to the touch despite sitting in the sun for 2½ hours). Forget trying to beat the traffic; you've got a 5-hour drive ahead of you anyways, and the traffic is going to keep pouring onto the roads over the next several hours anyways (10 hours later, as I write this, the traffic is still bad on the eclipse exit routes). If you want to avoid it, you have to plan your route away from it instead.
I ended up using this route to get back, taking 5 hours 41 minutes and 51 seconds including a refueling stop and a bathroom break. So I don't know how bad I-57 was (I did hear there was a crash on I-57 pretty much just before I got on the road, but I didn't know that at the time), although I did see that I-69 was completely stopped when I crossed it. There were small slowdowns on the major Illinois state roads every time there was a stop sign that could have been mitigated by sitting police cars at those intersections and effectively temporarily signalizing them, but other than that, my trip home was free-flowing at speed limit the entire route.
Some things I've learned:
It's useful to have a GPS that doesn't require cellphone coverage to figure out your route.
It's useful to have paper maps to help plan a trip that's taking you well off the beaten path.
It's even more useful to have paper maps of the states you're in when doing that.
The Ohio River is much prettier near Cairo, IL than it is near Wheeling, WV.
The Tennessee River dam is also pretty.
Driving directions need to make the "I'm trying to avoid anything that smells like a freeway because it's going to be completely packed and impassable" mode easier to access.
Passing a car by crossing the broken yellow median will never not be scary.
Being passed by a car crossing the broken yellow median is still scary.
Driving on obscure Kentucky state roads while you're playing music is oddly peaceful and relaxing.
The best test for road hypnosis is seeing how you can drive a long, straight, flat, featureless road. You have not seen a long, straight, flat, featureless road until you've driven something like an obscure Illinois county road where the "long, straight" bit means "20 miles without even a hint of a curve" and the "featureless" means "you don't even see a house, shed, barn, or grain elevator to break up corn and soy fields." Interstates break up the straight bit a lot, and state highways tend to have lots of houses and small settlements on them that break up endless farm fields.
Police direction may not permit you to make your intended route work.
tl;dr I’m burying the lede with context and catch-up material first, so impatient or already-clued-in readers should skip to below the videos for today’s big news. Or just read Luke Wagner‘s blog post right now.
My Fluent 2015 “ECMAScript Harmony: Rise of the Compilers” talk given on April 21st:
Jeremy Ashkenas picked up this ball and ran into the next field’s end zone two days later in Brooklyn:
My slides for the talk I gave at ModernWeb.tw on May 15th:
Perhaps you detect a theme, beyond “JS turned 20”. Something about compilers rising, Java bytecode and asm.js, shared memory threads and SIMD. The usual retort, recurrent on Hacker News, looks like this:
In an HN sub-thread sparked by my ModernWeb.tw slides, I concurred with Patrick Walton on analogies between browsers and the Unix kernel, JS and C. One does not get to write kernel code in any language, at least not yet. Elsewhere in that sub-thread the analogue of JS was identified as architecture-specific assembly code, where the CPU architecture is fixed by physical reality, while software is written in virtual languages compiled to machine code, or run through interpreters.
In any event, whether analogous to C or assembly, JS has had a lock on the “kernel” of the client side. Web “userland code”, in contrast, can be written in a great many languages. Adding a second (bytecode) syntax has looked like a “now you have two problems” loss, against the alternative of one syntax. Until now.
It’s by now a cliché that JS has become the assembly language of the Web. Rather, JS is one syntax for a portable and safe machine language, let’s say. Today I’m pleased to announce that cross-browser work has begun on WebAssembly, a new intermediate representation for safe code on the Web.
What: WebAssembly, “wasm” for short, .wasm filename suffix, a new binary syntax for low-level safe code, initially co-expressive with asm.js, but in the long run able to diverge from JS’s semantics, in order to best serve as common object-level format for multiple source-level programming languages.
It’s crucial that wasm and asm stay equivalent for a decent interval, to support polyfilling of wasm support via JS. This remains crucial even as JS and asm.js evolve to sprout shared memory threads and SIMD support. Examples of possible longer-term divergence: zero-cost exceptions, dynamic linking, call/cc. Yes, we are aiming to develop the Web’s polyglot-programming-language object-file format.
Why: asm.js is great, but once engines optimize for it, the parser becomes the hot spot — very hot on mobile devices. Transport compression is required and saves bandwidth, but decompression before parsing hurts. A secondary consideration: JS has a few awkward corners even in its asm.js subset. Finally, once browsers support the WebAssembly syntax natively, JS and wasm can diverge, without introducing unsafe or inappropriate features into JS just for use by compilers sourcing a few radically different programming languages.
See the FAQ for more nuance and detail. No, JS isn’t going away in any foreseeable future. Yes, wasm should relieve JS from having to serve two masters. This is a win-win plan.
These prototypes will evolve and track draft specification changes as WebAssembly matures and receives progressively more developer testing. Other compilers than Emscripten, and other compiler frameworks than LLVM, will join the mix. I expect all engines will support fast native decoders. All these parts are means to the end of a common WebAssembly standard.
Who: A W3C Community Group, the WebAssembly CG, open to all. As you can see from the github logs, WebAssembly has so far been a joint effort among Google, Microsoft, Mozilla, and a few other folks. I’m sorry the work was done via a private github account at first, but that was a temporary measure to help the several big companies reach consensus and buy into the long-term cooperative game that must be played to pull this off.
Here you can see JF Bastien of Google’s PNaCl team barely able to keep a lid on the secret. Good thing I replied with a comic book reference as plausible cover. Close one!
Having both the PNaCl team and the V8 team from Google, along with key people from Microsoft and the asm.js and Emscripten gurus from Mozilla, collaborating closely once everyone saw the light, has been inspiring. I’d like to single out for highest praise JF Bastien, K. Gadd, and Ben Titzer of Google; Dan Gohman of Mozilla; Abhijith Chatra and Michael Holman of Microsoft; Alon Zakai of asm.js & Emscripten fame; Filip Pizlo for JavaScriptCore/WebKit; and especially asm.js/OdinMonkey mastermind Luke Wagner.
The Big Picture
In the history of computing, the dream of a universal, language-neutral intermediate form goes back to well before Melvin Conway‘s UNCOL (1958, the same year LISP was born). I remember ANDF from the ’80s, and U-something from the ’70s. No one wants the “N x M” language sources vs. machine targets problem.
Although it won’t be the only compiler framework used to generate wasm, LLVM has been a boon to this project, as to wasm’s Emscripten and PNaCl progenitors. Kudos to Chris Lattner and team.
The PNaCl folks I know are good natured, but I think some are sore at me for playing Cassandra to their Troy over the years on HN. So I want to give them major credit for wringing much undefined behavior (UB) out of LLVM — really out of all levels of abstraction, from C/C++ specs down to hardware. That kind of thankless work is a big help for WebAssembly, which must be mostly-determinstic and well-defined, for interoperation and security.
Bottom line: with co-evolution of JS and wasm, in a few years I believe all the top browsers will sport JS engines that have become truly polyglot virtual machines. I predict that JS over the same timespan will endure and evolve to absorb more APIs and hardware-based affordances — but not all, where wasm carries the weight.
Again we see how the Web favors a succession of “little bangs”:
to hold community consensus along the evolutionary path, by
testing each hop for fitness with developers and implementors, and
supporting usable polyfills for older deployed browsers.
I usually finish with a joke: “Always bet on JS”. I look forward to working “and wasm” into that line — no joke.
It was brought to my attention recently by
reputablesources
that the recent announcement of increased usage in recent
years produced an internal firestorm within Mozilla. Key
figures raised alarm that some of the tech press had interpreted the blog post
as a sign that Thunderbird was not, in fact, dead. As a result, they asked
Thunderbird community members to make corrections to emphasize that Mozilla was
trying to kill Thunderbird.
The primary fear, it seems, is that knowledge that the largest open-source email
client was still receiving regular updates would impel its userbase to agitate
for increased funding and maintenance of the client to help forestall
potentialthreats
to the open nature of email as well as to
innovate in the space of providing usable and
private
communication channels. Such funding, however, would
be an unaffordable luxury and would only distract Mozilla from its
central goal of
buildingdeveloperproductivitytooling.
Persistent
rumors that Mozilla would be willing to fund Thunderbird were it renamed
Firefox Email were finally addressed with the comment, "such a renaming would
violate our current policy that allprojects be named Persona."
This post is part 8 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. Part 6 discusses how email security works in practice. Part 7 discusses the problem of trust. This part discusses why email security has largely failed.
At the end of the last part in this series, I posed the question, "Which email security protocol is most popular?" The answer to the question is actually neither S/MIME nor PGP, but a third protocol, DKIM. I haven't brought up DKIM until now because DKIM doesn't try to secure email in the same vein as S/MIME or PGP, but I still consider it relevant to discussing email security.
Unquestionably, DKIM is the only security protocol for email that can be considered successful. There are perhaps 4 billion active email addresses [1]. Of these, about 1-2 billion use DKIM. In contrast, S/MIME can count a few million users, and PGP at best a few hundred thousand. No other security protocols have really caught on past these three. Why did DKIM succeed where the others fail?
DKIM's success stems from its relatively narrow focus. It is nothing more than a cryptographic signature of the message body and a smattering of headers, and is itself stuck in the DKIM-Signature header. It is meant to be applied to messages only on outgoing servers and read and processed at the recipient mail server—it completely bypasses clients. That it bypasses clients allows it to solve the problem of key discovery and key management very easily (public keys are stored in DNS, which is already a key part of mail delivery), and its role in spam filtering is strong motivation to get it implemented quickly (it is 7 years old as of this writing). It's also simple: this one paragraph description is basically all you need to know [2].
The failure of S/MIME and PGP to see large deployment is certainly a large topic of discussion on myriads of cryptography enthusiast mailing lists, which often like to partake in propositions of new end-to-end encryption of email paradigms, such as the recent DIME proposal. Quite frankly, all of these solutions suffer broadly from at least the same 5 fundamental weaknesses, and I see it unlikely that a protocol will come about that can fix these weaknesses well enough to become successful.
The first weakness, and one I've harped about many times already, is UI. Most email security UI is abysmal and generally at best usable only by enthusiasts. At least some of this is endemic to security: while it mean seem obvious how to convey what an email signature or an encrypted email signifies, how do you convey the distinctions between sign-and-encrypt, encrypt-and-sign, or an S/MIME triple wrap? The Web of Trust model used by PGP (and many other proposals) is even worse, in that inherently requires users to do other actions out-of-band of email to work properly.
Trust is the second weakness. Consider that, for all intents and purposes, the email address is the unique identifier on the Internet. By extension, that implies that a lot of services are ultimately predicated on the notion that the ability to receive and respond to an email is a sufficient means to identify an individual. However, the entire purpose of secure email, or at least of end-to-end encryption, is subtly based on the fact that other people in fact have access to your mailbox, thus destroying the most natural ways to build trust models on the Internet. The quest for anonymity or privacy also renders untenable many other plausible ways to establish trust (e.g., phone verification or government-issued ID cards).
Key discovery is another weakness, although it's arguably the easiest one to solve. If you try to keep discovery independent of trust, the problem of key discovery is merely picking a protocol to publish and another one to find keys. Some of these already exist: PGP key servers, for example, or using DANE to publish S/MIME or PGP keys.
Key management, on the other hand, is a more troubling weakness. S/MIME, for example, basically works without issue if you have a certificate, but managing to get an S/MIME certificate is a daunting task (necessitated, in part, by its trust model—see how these issues all intertwine?). This is also where it's easy to say that webmail is an unsolvable problem, but on further reflection, I'm not sure I agree with that statement anymore. One solution is just storing the private key with the webmail provider (you're trusting them as an email client, after all), but it's also not impossible to imagine using phones or flash drives as keystores. Other key management factors are more difficult to solve: people who lose their private keys or key rollover create thorny issues. There is also the difficulty of managing user expectations: if I forget my password to most sites (even my email provider), I can usually get it reset somehow, but when a private key is lost, the user is totally and completely out of luck.
Of course, there is one glaring and almost completely insurmountable problem. Encrypted email fundamentally precludes certain features that we have come to take for granted. The lesser known is server-side search and filtration. While there exist some mechanisms to do search on encrypted text, those mechanisms rely on the fact that you can manipulate the text to change the message, destroying the integrity feature of secure email. They also tend to be fairly expensive. It's easy to just say "who needs server-side stuff?", but the contingent of people who do email on smartphones would not be happy to have to pay the transfer rates to download all the messages in their folder just to find one little email, nor the energy costs of doing it on the phone. And those who have really large folders—Fastmail has a design point of 1,000,000 in a single folder—would still prefer to not have to transfer all their mail even on desktops.
The more well-known feature that would disappear is spam filtration. Consider that 90% of all email is spam, and if you think your spam folder is too slim for that to be true, it's because your spam folder only contains messages that your email provider wasn't sure were spam. The loss of server-side spam filtering would dramatically increase the cost of spam (a 10% reduction in efficiency would double the amount of server storage, per my calculations), and client-side spam filtering is quite literally too slow [3] and too costly (remember smartphones? Imagine having your email take 10 times as much energy and bandwidth) to be a tenable option. And privacy or anonymity tends to be an invitation to abuse (cf. Tor and Wikipedia). Proposed solutions to the spam problem are so common that there is a checklist containing most of the objections.
When you consider all of those weaknesses, it is easy to be pessimistic about the possibility of wide deployment of powerful email security solutions. The strongest future—all email is encrypted, including metadata—is probably impossible or at least woefully impractical. That said, if you weaken some of the assumptions (say, don't desire all or most traffic to be encrypted), then solutions seem possible if difficult.
This concludes my discussion of email security, at least until things change for the better. I don't have a topic for the next part in this series picked out (this part actually concludes the set I knew I wanted to discuss when I started), although OAuth and DMARC are two topics that have been bugging me enough recently to consider writing about. They also have the unfortunate side effect of being things likely to see changes in the near future, unlike most of the topics I've discussed so far. But rest assured that I will find more difficulties in the email infrastructure to write about before long!
[1] All of these numbers are crude estimates and are accurate to only an order of magnitude. To justify my choices: I assume 1 email address per Internet user (this overestimates the developing world and underestimates the developed world). The largest webmail providers have given numbers that claim to be 1 billion active accounts between them, and all of them use DKIM. S/MIME is guessed by assuming that any smartcard deployment supports S/MIME, and noting that the US Department of Defense and Estonia's digital ID project are both heavy users of such smartcards. PGP is estimated from the size of the strong set and old numbers on the reachable set from the core Web of Trust.
[2] Ever since last April, it's become impossible to mention DKIM without referring to DMARC, as a result of Yahoo's controversial DMARC policy. A proper discussion of DMARC (and why what Yahoo did was controversial) requires explaining the mail transmission architecture and spam, however, so I'll defer that to a later post. It's also possible that changes in this space could happen within the next year.
[3] According to a former GMail spam employee, if it takes you as long as three minutes to calculate reputation, the spammer wins.
Several years back, Ehsan and Jeff Muizelaar attempted to build a unified history of mozilla-central across the Mercurial era and the CVS era. Their result is now used in the gecko-dev repository. While being distracted on yet another side project, I thought that I might want to do the same for comm-central. It turns out that building a unified history for comm-central makes mozilla-central look easy: mozilla-central merely had one import from CVS. In contrast, comm-central imported twice from CVS (the calendar code came later), four times from mozilla-central (once with converted history), and imported twice from Instantbird's repository (once with converted history). Three of those conversions also involved moving paths. But I've worked through all of those issues to provide a nice snapshot of the repository [1]. And since I've been frustrated by failing to find good documentation on how this sort of process went for mozilla-central, I'll provide details on the process for comm-central.
The first step and probably the hardest is getting the CVS history in DVCS form (I use hg because I'm more comfortable it, but there's effectively no difference between hg, git, or bzr here). There is a git version of mozilla's CVS tree available, but I've noticed after doing research that its last revision is about a month before the revision I need for Calendar's import. The documentation for how that repo was built is no longer on the web, although we eventually found a copy after I wrote this post on git.mozilla.org. I tried doing another conversion using hg convert to get CVS tags, but that rudely blew up in my face. For now, I've filed a bug on getting an official, branchy-and-tag-filled version of this repository, while using the current lack of history as a base. Calendar people will have to suffer missing a month of history.
CVS is famously hard to convert to more modern repositories, and, as I've done my research, Mozilla's CVS looks like it uses those features which make it difficult. In particular, both the calendar CVS import and the comm-central initial CVS import used a CVS tag HG_COMM_INITIAL_IMPORT. That tagging was done, on only a small portion of the tree, twice, about two months apart. Fortunately, mailnews code was never touched on CVS trunk after the import (there appears to be one commit on calendar after the tagging), so it is probably possible to salvage a repository-wide consistent tag.
The start of my script for conversion looks like this:
Since I don't want to include the changesets useless to comm-central history, I trimmed the history by using hg convert to eliminate changesets that don't change the necessary files. Most of the files are simple directory-wide changes, but S/MIME only moved a few files over, so it requires a more complex way to grab the file list. In addition, I also replaced the % in the usernames with @ that they are used to appearing in hg. The relevant code is here:
# Step 1: Trim mozilla CVS history to include only the files we are ultimately
# interested in.
cat >$WORKDIR/convert-filemap.txt <<EOF# Revision e4f4569d451a
include directory/xpcom
include mail
include mailnews
include other-licenses/branding/thunderbird
include suite
# Revision 7c0bfdcda673
include calendar
include other-licenses/branding/sunbird
# Revision ee719a0502491fc663bda942dcfc52c0825938d3
include editor/ui
# Revision 52efa9789800829c6f0ee6a005f83ed45a250396
include db/mork/
include db/mdb/EOF# Add the S/MIME import files
hg -R $MC log -r "children($MC_SMIME_IMPORT)" \
--template "{file_dels % 'include {file}\n'}" >>$WORKDIR/convert-filemap.txt
if [ ! -e $WORKDIR/convert-authormap.txt ]; then
hg -R $HGCVS log --template "{email(author)}={sub('%', '@', email(author))}\n" \
| sort -u > $WORKDIR/convert-authormap.txt
ficd$WORKDIR
hg convert $HGCVS $OUTPUT --filemap convert-filemap.txt -A convert-authormap.txt
That last command provides us the subset of the CVS history that we need for unified history. Strictly speaking, I should be pulling a specific revision, but I happen to know that there's no need to (we're cloning the only head) in this case. At this point, we now need to pull in the mozilla-central changes before we pull in comm-central. Order is key; hg convert will only apply the graft points when converting the child changeset (which it does but once), and it needs the parents to exist before it can do that. We also need to ensure that the mozilla-central graft point is included before continuing, so we do that, and then pull mozilla-central:
CC_CVS_BASE=$(hg log -R $HGCVS -r 'tip' --template '{node}')
CC_CVS_BASE=$(grep $CC_CVS_BASE$OUTPUT/.hg/shamap | cut -d' ' -f2)
MC_CVS_BASE=$(hg log -R $HGCVS -r 'gitnode(215f52d06f4260fdcca797eebd78266524ea3d2c)' --template '{node}')
MC_CVS_BASE=$(grep $MC_CVS_BASE$OUTPUT/.hg/shamap | cut -d' ' -f2)
# Okay, now we need to build the map of revisions.
cat >$WORKDIR/convert-revmap.txt <<EOFe4f4569d451a5e0d12a6aa33ebd916f979dd8faa $CC_CVS_BASE # Thunderbird / Suite
7c0bfdcda6731e77303f3c47b01736aaa93d5534 d4b728dc9da418f8d5601ed6735e9a00ac963c4e, $CC_CVS_BASE # Calendar
9b2a99adc05e53cd4010de512f50118594756650 $MC_CVS_BASE # Mozilla graft point
ee719a0502491fc663bda942dcfc52c0825938d3 78b3d6c649f71eff41fe3f486c6cc4f4b899fd35, $MC_EDITOR_IMPORT # Editor
8cdfed92867f885fda98664395236b7829947a1d 4b5da7e5d0680c6617ec743109e6efc88ca413da, e4e612fcae9d0e5181a5543ed17f705a83a3de71 # ChatEOF# Next, import mozilla-central revisionsfor rev in$MC_MORK_IMPORT $MC_EDITOR_IMPORT $MC_SMIME_IMPORT; do
hg convert $MC$OUTPUT -r $rev --splicemap $WORKDIR/convert-revmap.txt \
--filemap $WORKDIR/convert-filemap.txt
done
Some notes about all of the revision ids in the script. The splicemap requires the full 40-character SHA ids; anything less and the thing complains. I also need to specify the parents of the revisions that deleted the code for the mozilla-central import, so if you go hunting for those revisions and are surprised that they don't remove the code in question, that's why.
I mentioned complications about the merges earlier. The Mork and S/MIME import codes here moved files, so that what was db/mdb in mozilla-central became db/mork. There's no support for causing the generated splice to record these as a move, so I have to manually construct those renamings:
# We need to execute a few hg move commands due to renamings.pushd$OUTPUT
hg update -r $(grep $MC_MORK_IMPORT .hg/shamap | cut -d' ' -f2)
(hg -R $MC log -r "children($MC_MORK_IMPORT)" \
--template "{file_dels % 'hg mv {file} {sub(\"db/mdb\", \"db/mork\", file)}\n'}") | bash
hg commit -m 'Pseudo-changeset to move Mork files' -d '2011-08-06 17:25:21 +0200'
MC_MORK_IMPORT=$(hg log -r tip --template '{node}')
hg update -r $(grep $MC_SMIME_IMPORT .hg/shamap | cut -d' ' -f2)
(hg -R $MC log -r "children($MC_SMIME_IMPORT)" \
--template "{file_dels % 'hg mv {file} {sub(\"security/manager/ssl\", \"mailnews/mime\", file)}\n'}") | bash
hg commit -m 'Pseudo-changeset to move S/MIME files' -d '2014-06-15 20:51:51 -0700'
MC_SMIME_IMPORT=$(hg log -r tip --template '{node}')
popd# Echo the new move commands to the changeset conversion map.
cat >>$WORKDIR/convert-revmap.txt <<EOF52efa9789800829c6f0ee6a005f83ed45a250396 abfd23d7c5042bc87502506c9f34c965fb9a09d1, $MC_MORK_IMPORT # Mork
50f5b5fc3f53c680dba4f237856e530e2097adfd 97253b3cca68f1c287eb5729647ba6f9a5dab08a, $MC_SMIME_IMPORT # S/MIMEEOF
Now that we have all of the graft points defined, and all of the external code ready, we can pull comm-central and do the conversion. That's not quite it, though—when we graft the S/MIME history to the original mozilla-central history, we have a small segment of abandoned converted history. A call to hg strip removes that.
# Now, import comm-central revisions that we need
hg convert $CC$OUTPUT --splicemap $WORKDIR/convert-revmap.txt
hg strip 2f69e0a3a05a
[1] I left out one of the graft points because I just didn't want to deal with it. I'll leave it as an exercise to the reader to figure out which one it was. Hint: it's the only one I didn't know about before I searched for the archive points [2].
[2] Since I wasn't sure I knew all of the graft points, I decided to try to comb through all of the changesets to figure out who imported code. It turns out that hg log -r 'adds("**")' narrows it down nicely (1667 changesets to look at instead of 17547), and using the {file_adds} template helps winnow it down more easily.
This post is part 7 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. Part 6 discusses how email security works in practice. This part discusses the problem of trust.
At a technical level, S/MIME and PGP (or at least PGP/MIME) use cryptography essentially identically. Yet the two are treated as radically different models of email security because they diverge on the most important question of public key cryptography: how do you trust the identity of a public key? Trust is critical, as it is the only way to stop an active, man-in-the-middle (MITM) attack. MITM attacks are actually easier to pull off in email, since all email messages effectively have to pass through both the sender's and the recipients' email servers [1], allowing attackers to be able to pull off permanent, long-lasting MITM attacks [2].
S/MIME uses the same trust model that SSL uses, based on X.509 certificates and certificate authorities. X.509 certificates effectively work by providing a certificate that says who you are which is signed by another authority. In the original concept (as you might guess from the name "X.509"), the trusted authority was your telecom provider, and the certificates were furthermore intended to be a part of the global X.500 directory—a natural extension of the OSI internet model. The OSI model of the internet never gained traction, and the trusted telecom providers were replaced with trusted root CAs.
PGP, by contrast, uses a trust model that's generally known as the Web of Trust. Every user has a PGP key (containing their identity and their public key), and users can sign others' public keys. Trust generally flows from these signatures: if you trust a user, you know the keys that they sign are correct. The name "Web of Trust" comes from the vision that trust flows along the paths of signatures, building a tight web of trust.
And now for the controversial part of the post, the comparisons and critiques of these trust models. A disclaimer: I am not a security expert, although I am a programmer who revels in dreaming up arcane edge cases. I also don't use PGP at all, and use S/MIME to a very limited extent for some Mozilla work [3], although I did try a few abortive attempts to dogfood it in the past. I've attempted to replace personal experience with comprehensive research [4], but most existing critiques and comparisons of these two trust models are about 10-15 years old and predate several changes to CA certificate practices.
A basic tenet of development that I have found is that the average user is fairly ignorant. At the same time, a lot of the defense of trust models, both CAs and Web of Trust, tends to hinge on configurability. How many people, for example, know how to add or remove a CA root from Firefox, Windows, or Android? Even among the subgroup of Mozilla developers, I suspect the number of people who know how to do so are rather few. Or in the case of PGP, how many people know how to change the maximum path length? Or even understand the security implications of doing so?
Seen in the light of ignorant users, the Web of Trust is a UX disaster. Its entire security model is predicated on having users precisely specify how much they trust other people to trust others (ultimate, full, marginal, none, unknown) and also on having them continually do out-of-band verification procedures and publicly reporting those steps. In 1998, a seminal paper on the usability of a GUI for PGP encryption came to the conclusion that the UI was effectively unusable for users, to the point that only a third of the users were able to send an encrypted email (and even then, only with significant help from the test administrators), and a quarter managed to publicly announce their private keys at some point, which is pretty much the worst thing you can do. They also noted that the complex trust UI was never used by participants, although the failure of many users to get that far makes generalization dangerous [5]. While newer versions of security UI have undoubtedly fixed many of the original issues found (in no small part due to the paper, one of the first to argue that usability is integral, not orthogonal, to security), I have yet to find an actual study on the usability of the trust model itself.
The Web of Trust has other faults. The notion of "marginal" trust it turns out is rather broken: if you marginally trust a user who has two keys who both signed another person's key, that's the same as fully trusting a user with one key who signed that key. There are several proposals for different trust formulas [6], but none of them have caught on in practice to my knowledge.
A hidden fault is associated with its manner of presentation: in sharp contrast to CAs, the Web of Trust appears to not delegate trust, but any practical widespread deployment needs to solve the problem of contacting people who have had no prior contact. Combined with the need to bootstrap new users, this implies that there needs to be some keys that have signed a lot of other keys that are essentially default-trusted—in other words, a CA, a fact sometimes lost on advocates of the Web of Trust.
That said, a valid point in favor of the Web of Trust is that it more easily allows people to distrust CAs if they wish to. While I'm skeptical of its utility to a broader audience, the ability to do so for is crucial for a not-insignificant portion of the population, and it's important enough to be explicitly called out.
X.509 certificates are most commonly discussed in the context of SSL/TLS connections, so I'll discuss them in that context as well, as the implications for S/MIME are mostly the same. Almost all criticism of this trust model essentially boils down to a single complaint: certificate authorities aren't trustworthy. A historical criticism is that the addition of CAs to the main root trust stores was ad-hoc. Since then, however, the main oligopoly of these root stores (Microsoft, Apple, Google, and Mozilla) have made theirpoliciespublic and clear [7]. The introduction of the CA/Browser Forum in 2005, with a collection of major CAs and the major browsers as members, and several [8] helps in articulating common policies. These policies, simplified immensely, boil down to:
You must verify information (depending on certificate type). This information must be relatively recent.
You must not use weak algorithms in your certificates (e.g., no MD5).
You must not make certificates that are valid for too long.
You must maintain revocation checking services.
You must have fairly stringent physical and digital security practices and intrusion detection mechanisms.
You must be [externally] audited every year that you follow the above rules.
If you screw up, we can kick you out.
I'm not going to claim that this is necessarily the best policy or even that any policy can feasibly stop intrusions from happening. But it's a policy, so CAs must abide by some set of rules.
Another CA criticism is the fear that they may be suborned by national government spy agencies. I find this claim underwhelming, considering that the number of certificates acquired by intrusions that were used in the wild is larger than the number of certificates acquired by national governments that were used in the wild: 1 and 0, respectively. Yet no one complains about the untrustworthiness of CAs due to their ability to be hacked by outsiders. Another attack is that CAs are controlled by profit-seeking corporations, which misses the point because the business of CAs is not selling certificates but selling their access to the root databases. As we will see shortly, jeopardizing that access is a great way for a CA to go out of business.
To understand issues involving CAs in greater detail, there are two CAs that are particularly useful to look at. The first is CACert. CACert is favored by many by its attempt to handle X.509 certificates in a Web of Trust model, so invariably every public discussion about CACert ends up devolving into an attack on other CAs for their perceived capture by national governments or corporate interests. Yet what many of the proponents for inclusion of CACert miss (or dismiss) is the fact that CACert actually failed the required audit, and it is unlikely to ever pass an audit. This shows a central failure of both CAs and Web of Trust: different people have different definitions of "trust," and in the case of CACert, some people are favoring a subjective definition (I trust their owners because they're not evil) when an objective definition fails (in this case, that the root signing key is securely kept).
The other CA of note here is DigiNotar. In July 2011, some hackers managed to acquire a few fraudulent certificates by hacking into DigiNotar's systems. By late August, people had become aware of these certificates being used in practice [9] to intercept communications, mostly in Iran. The use appears to have been caught after Chromium updates failed due to invalid certificate fingerprints. After it became clear that the fraudulent certificates were not limited to a single fake Google certificate, and that DigiNotar had failed to notify potentially affected companies of its breach, DigiNotar was swiftly removed from all of the trust databases. It ended up declaring bankruptcy within two weeks.
DigiNotar indicates several things. One, SSL MITM attacks are not theoretical (I have seen at least two or three security experts advising pre-DigiNotar that SSL MITM attacks are "theoretical" and therefore the wrong target for security mechanisms). Two, keeping the trust of browsers is necessary for commercial operation of CAs. Three, the notion that a CA is "too big to fail" is false: DigiNotar played an important role in the Dutch community as a major CA and the operator of Staat der Nederlanden. Yet when DigiNotar screwed up and lost its trust, it was swiftly kicked out despite this role. I suspect that even Verisign could be kicked out if it manages to screw up badly enough.
This isn't to say that the CA model isn't problematic. But the source of its problems is that delegating trust isn't a feasible model in the first place, a problem that it shares with the Web of Trust as well. Different notions of what "trust" actually means and the uncertainty that gets introduced as chains of trust get longer both make delegating trust weak to both social engineering and technical engineering attacks. There appears to be an increasing consensus that the best way forward is some variant of key pinning, much akin to how SSH works: once you know someone's public key, you complain if that public key appears to change, even if it appears to be "trusted." This does leave people open to attacks on first use, and the question of what to do when you need to legitimately re-key is not easy to solve.
In short, both CAs and the Web of Trust have issues. Whether or not you should prefer S/MIME or PGP ultimately comes down to the very conscious question of how you want to deal with trust—a question without a clear, obvious answer. If I appear to be painting CAs and S/MIME in a positive light and the Web of Trust and PGP in a negative one in this post, it is more because I am trying to focus on the positions less commonly taken to balance perspective on the internet. In my next post, I'll round out the discussion on email security by explaining why email security has seen poor uptake and answering the question as to which email security protocol is most popular. The answer may surprise you!
[1] Strictly speaking, you can bypass the sender's SMTP server. In practice, this is considered a hole in the SMTP system that email providers are trying to plug.
[2] I've had 13 different connections to the internet in the same time as I've had my main email address, not counting all the public wifis that I have used. Whereas an attacker would find it extraordinarily difficult to intercept all of my SSH sessions for a MITM attack, intercepting all of my email sessions is clearly far easier if the attacker were my email provider.
[3] Before you read too much into this personal choice of S/MIME over PGP, it's entirely motivated by a simple concern: S/MIME is built into Thunderbird; PGP is not. As someone who does a lot of Thunderbird development work that could easily break the Enigmail extension locally, needing to use an extension would be disruptive to workflow.
[4] This is not to say that I don't heavily research many of my other posts, but I did go so far for this one as to actually start going through a lot of published journals in an attempt to find information.
[5] It's questionable how well the usability of a trust model UI can be measured in a lab setting, since the observer effect is particularly strong for all metrics of trust.
[6] The web of trust makes a nice graph, and graphs invite lots of interesting mathematical metrics. I've always been partial to eigenvectors of the graph, myself.
[7] Mozilla's policy for addition to NSS is basically the standard policy adopted by all open-source Linux or BSD distributions, seeing as OpenSSL never attempted to produce a root database.
[8] It looks to me that it's the browsers who are more in charge in this forum than the CAs.
[9] To my knowledge, this is the first—and so far only—attempt to actively MITM an SSL connection.
This post is part 6 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. Part 5 discusses the more general problem of email headers. This part discusses how email security works in practice.
Email security is a rather wide-ranging topic, and one that I've wanted to cover for some time, well before several recent events that have made it come up in the wider public knowledge. There is no way I can hope to cover it in a single post (I think it would outpace even the length of my internationalization discussion), and there are definitely parts for which I am underqualified, as I am by no means an expert in cryptography. Instead, I will be discussing this over the course of several posts of which this is but the first; to ease up on the amount of background explanation, I will assume passing familiarity with cryptographic concepts like public keys, hash functions, as well as knowing what SSL and SSH are (though not necessarily how they work). If you don't have that knowledge, ask Wikipedia.
Before discussing how email security works, it is first necessary to ask what email security actually means. Unfortunately, the layman's interpretation is likely going to differ from the actual precise definition. Security is often treated by laymen as a boolean interpretation: something is either secure or insecure. The most prevalent model of security to people is SSL connections: these allow the establishment of a communication channel whose contents are secret to outside observers while also guaranteeing to the client the authenticity of the server. The server often then gets authenticity of the client via a more normal authentication scheme (i.e., the client sends a username and password). Thus there is, at the end, a channel that has both secrecy and authenticity [1]: channels with both of these are considered secure and channels without these are considered insecure [2].
In email, the situation becomes more difficult. Whereas an SSL connection is between a client and a server, the architecture of email is such that email providers must be considered as distinct entities from end users. In addition, messages can be sent from one person to multiple parties. Thus secure email is a more complex undertaking than just porting relevant details of SSL. There are two major cryptographic implementations of secure email [3]: S/MIME and PGP. In terms of implementation, they are basically the same [4], although PGP has an extra mode which wraps general ASCII (known as "ASCII-armor"), which I have been led to believe is less recommended these days. Since I know the S/MIME specifications better, I'll refer specifically to how S/MIME works.
S/MIME defines two main MIME types: multipart/signed, which contains the message text as a subpart followed by data indicating the cryptographic signature, and application/pkcs7-mime, which contains an encrypted MIME part. The important things to note about this delineation are that only the body data is encrypted [5], that it's theoretically possible to encrypt only part of a message's body, and that the signing and encryption constitute different steps. These factors combine to make for a potentially infuriating UI setup.
How does S/MIME tackle the challenges of encrypting email? First, rather than encrypting using recipients' public keys, the message is encrypted with a symmetric key. This symmetric key is then encrypted with each of the recipients' keys and then attached to the message. Second, by only signing or encrypting the body of the message, the transit headers are kept intact for the mail system to retain its ability to route, process, and deliver the message. The body is supposed to be prepared in the "safest" form before transit to avoid intermediate routers munging the contents. Finally, to actually ascertain what the recipients' public keys are, clients typically passively pull the information from signed emails. LDAP, unsurprisingly, contains an entry for a user's public key certificate, which could be useful in large enterprise deployments. There is also work ongoing right now to publish keys via DNS and DANE.
I mentioned before that S/MIME's use can present some interesting UI design decisions. I ended up actually testing some common email clients on how they handled S/MIME messages: Thunderbird, Apple Mail, Outlook [6], and Evolution. In my attempts to create a surreptitious signed part to confuse the UI, Outlook decided that the message had no body at all, and Thunderbird decided to ignore all indication of the existence of said part. Apple Mail managed to claim the message was signed in one of these scenarios, and Evolution took the cake by always agreeing that the message was signed [7]. It didn't even bother questioning the signature if the certificate's identity disagreed with the easily-spoofable From address. I was actually surprised by how well people did in my tests—I expected far more confusion among clients, particularly since the will to maintain S/MIME has clearly been relatively low, judging by poor support for "new" features such as triple-wrapping or header protection.
Another fault of S/MIME's design is that it makes the mistaken belief that composing a signing step and an encryption step is equivalent in strength to a simultaneous sign-and-encrypt. Another page describes this in far better detail than I have room to; note that this flaw is fixed via triple-wrapping (which has relatively poor support). This creates yet more UI burden into how to adequately describe in UI all the various minutiae in differing security guarantees. Considering that users already have a hard time even understanding that just because a message says it's from example@isp.invalid doesn't actually mean it's from example@isp.invalid, trying to develop UI that both adequately expresses the security issues and is understandable to end-users is an extreme challenge.
What we have in S/MIME (and PGP) is a system that allows for strong guarantees, if certain conditions are met, yet is also vulnerable to breaches of security if the message handling subsystems are poorly designed. Hopefully this is a sufficient guide to the technical impacts of secure email in the email world. My next post will discuss the most critical component of secure email: the trust model. After that, I will discuss why secure email has seen poor uptake and other relevant concerns on the future of email security.
[1] This is a bit of a lie: a channel that does secrecy and authentication at different times isn't as secure as one that does them at the same time.
[2] It is worth noting that authenticity is, in many respects, necessary to achieve secrecy.
[3] This, too, is a bit of a lie. More on this in a subsequent post.
[4] I'm very aware that S/MIME and PGP use radically different trust models. Trust models will be covered later.
[5] S/MIME 3.0 did add a provision stating that if the signed/encrypted part is a message/rfc822 part, the headers of that part should override the outer message's headers. However, I am not aware of a major email client that actually handles these kind of messages gracefully.
[6] Actually, I tested Windows Live Mail instead of Outlook, but given the presence of an official MIME-to-Microsoft's-internal-message-format document which seems to agree with what Windows Live Mail was doing, I figure their output would be identical.
[7] On a more careful examination after the fact, it appears that Evolution may have tried to indicate signedness on a part-by-part basis, but the UI was sufficiently confusing that ordinary users are going to be easily confused.
Previously, I've been developing JSMime as a subdirectory within comm-central. However, after discussions with other developers, I have moved the official repository of record for JSMime to its own repository, now found on GitHub. The repository has been cleaned up and the feature set for version 0.2 has been selected, so that the current tip on JSMime (also the initial version) is version 0.2. This contains the feature set I imported into Thunderbird's source code last night, which is to say support for parsing MIME messages into the MIME tree, as well as support for parsing and encoding email address headers.
Thunderbird doesn't actually use the new code quite yet (as my current tree is stuck on a mozilla-central build error, so I haven't had time to run those patches through a last minute sanity check before requesting review), but the intent is to replace the current C++ implementations of nsIMsgHeaderParser and nsIMimeConverter with JSMime instead. Once those are done, I will be moving forward with my structured header plans which more or less ought to make those interfaces obsolete.
Within JSMime itself, the pieces which I will be working on next will be rounding out the implementation of header parsing and encoding support (I have prototypes for Date headers and the infernal RFC 2231 encoding that Content-Disposition needs), as well as support for building MIME messages from their constituent parts (a feature which would be greatly appreciated in the depths of compose and import in Thunderbird). I also want to implement full IDN and EAI support, but that's hampered by the lack of a JS implementation I can use for IDN (yes, there's punycode.js, but that doesn't do StringPrep). The important task of converting the MIME tree to a list of body parts and attachments is something I do want to work on as well, but I've vacillated on the implementation here several times and I'm not sure I've found one I like yet.
JSMime, as its name implies, tries to work in as pure JS as possible, augmented with several web APIs as necessary (such as TextDecoder for charset decoding). I'm using ES6 as the base here, because it gives me several features I consider invaluable for implementing JavaScript: Promises, Map, generators, let. This means it can run on an unprivileged web page—I test JSMime using Firefox nightlies and the Firefox debugger where necessary. Unfortunately, it only really works in Firefox at the moment because V8 doesn't support many ES6 features yet (such as destructuring, which is annoying but simple enough to work around, or Map iteration, which is completely necessary for the code). I'm not opposed to changing it to make it work on Node.js or Chrome, but I don't realistically have the time to spend doing it myself; if someone else has the time, please feel free to contact me or send patches.
My talk was really more about the “network problem” than the “protocol problem”. Networks breed first- and second-mover winners and others path-dependent powers, until the next disruption. Users or rather their data get captured.
Privacy is only one concern among several, including how to realize economic value for many-yet-individually-weak users, not just for data-store/service owners or third parties. Can we do better with client-side and private-cloud tiers, zero-knowledge proofs and protocols, or other ideas?
In the end, I asked these four questions:
Can a browser/OS “unionize its users” to gain bargaining power vs. net super-powers?
To create a data commons with “API to me” and aggregated/clustered economics?
Open the walled gardens to put users first?
Still be usable and private-enough for most?
I think the answer is yes, but I’m not sure who will do this work. It is vitally important.
I may get to it, but not working at Mozilla. I’ve resigned as CEO and I’m leaving Mozilla to take a rest, take some trips with my family, look at problems from other angles, and see if the “network problem” has a solution that doesn’t require scaling up to hundreds of millions of users and winning their trust while somehow covering costs. That’s a rare, hard thing, which I’m proud to have done with Firefox at Mozilla.
I encourage all Mozillians to keep going. Firefox OS is even more daunting, and more important. Thanks indeed to all who have supported me, and to all my colleagues over the years, at Mozilla, in standards bodies, and at conferences around the world. I will be less visible online, but still around.
…unless you're an expert at assembly, that is. The title of this post was obviously meant to be an attention-grabber, but it is much truer than you might think: poorly-written assembly code will probably be slower than an optimizing compiler on well-written code (note that you may need to help the compiler along for things like vectorization). Now why is this?
Modern microarchitectures are incredibly complex. A modern x86 processor will be superscalar and use some form of compilation to microcode to do that. Desktop processors will undoubtedly have multiple instruction issues per cycle, forms of register renaming, branch predictors, etc. Minor changes—a misaligned instruction stream, a poor order of instructions, a bad instruction choice—could kill the ability to take advantages of these features. There are very few people who could accurately predict the performance of a given assembly stream (I myself wouldn't attempt it if the architecture can take advantage of ILP), and these people are disproportionately likely to be working on compiler optimizations. So unless you're knowledgeable enough about assembly to help work on a compiler, you probably shouldn't be hand-coding assembly to make code faster.
To give an example to elucidate this point (and the motivation for this blog post in the first place), I was given a link to an implementation of the N-queens problem in assembly. For various reasons, I decided to use this to start building a fine-grained performance measurement system. This system uses a high-resolution monotonic clock on Linux and runs the function 1000 times to warm up caches and counters and then runs the function 1000 more times, measuring each run independently and reporting the average runtime at the end. This is a single execution of the system; 20 executions of the system were used as the baseline for a t-test to determine statistical significance as well as visual estimation of normality of data. Since the runs observed about a constant 1-2 μs of noise, I ran all of my numbers on the 10-queens problem to better separate the data (total runtimes ended up being in the range of 200-300μs at this level). When I say that some versions are faster, the p-values for individual measurements are on the order of 10-20—meaning that there is a 1-in-100,000,000,000,000,000,000 chance that the observed speedups could be produced if the programs take the same amount of time to run.
The initial assembly version of the program took about 288μs to run. The first C++ version I coded, originating from the same genesis algorithm that the author of the assembly version used, ran in 275μs. A recursive program beat out a hand-written assembly block of code... and when I manually converted the recursive program into a single loop, the runtime improved to 249μs. It wasn't until I got rid of all of the assembly in the original code that I could get the program to beat the derecursified code (at 244μs)—so it's not the vectorization that's causing the code to be slow. Intrigued, I started to analyze why the original assembly was so slow.
It turns out that there are three main things that I think cause the slow speed of the original code. The first one is alignment of branches: the assembly code contains no instructions to align basic blocks on particular branches, whereas gcc happily emits these for some basic blocks. I mention this first as it is mere conjecture; I never made an attempt to measure the effects for myself. The other two causes are directly measured from observing runtime changes as I slowly replaced the assembly with code. When I replaced the use of push and pop instructions with a global static array, the runtime improved dramatically. This suggests that the alignment of the stack could be to blame (although the stack is still 8-byte aligned when I checked via gdb), which just goes to show you how much alignments really do matter in code.
The final, and by far most dramatic, effect I saw involves the use of three assembly instructions: bsf (find the index of the lowest bit that is set), btc (clear a specific bit index), and shl (left shift). When I replaced the use of these instructions with a more complicated expression int bit = x & -x and x = x - bit, the program's speed improved dramatically. And the rationale for why the speed improved won't be found in latency tables, although those will tell you that bsf is not a 1-cycle operation. Rather, it's in minutiae that's not immediately obvious.
The original program used the fact that bsf sets the zero flag if the input register is 0 as the condition to do the backtracking; the converted code just checked if the value was 0 (using a simple test instruction). The compare and the jump instructions are basically converted into a single instruction in the processor. In contrast, the bsf does not get to do this; combined with the lower latency of the instruction intrinsically, it means that empty loops take a lot longer to do nothing. The use of an 8-bit shift value is also interesting, as there is a rather severe penalty for using 8-bit registers in Intel processors as far as I can see.
Now, this isn't to say that the compiler will always produce the best code by itself. My final code wasn't above using x86 intrinsics for the vector instructions. Replacing the _mm_andnot_si128 intrinsic with an actual and-not on vectors caused gcc to use other, slower instructions instead of the vmovq to move the result out of the SSE registers for reasons I don't particularly want to track down. The use of the _mm_blend_epi16 and _mm_srli_si128 intrinsics can probably be replaced with __builtin_shuffle instead for more portability, but I was under the misapprehension that this was a clang-only intrinsic when I first played with the code so I never bothered to try that, and this code has passed out of my memory long enough that I don't want to try to mess with it now.
In short, compilers know things about optimizing for modern architectures that many general programmers don't. Compilers may have issues with autovectorization, but the existence of vector intrinsics allow you to force compilers to use vectorization while still giving them leeway to make decisions about instruction scheduling or code alignment which are easy to screw up in hand-written assembly. Also, compilers are liable to get better in the future, whereas hand-written assembly code is unlikely to get faster in the future. So only write assembly code if you really know what you're doing and you know you're better than the compiler.
I am deeply honored and humbled by the CEO role. I’m also grateful for the messages of support. At the same time, I know there are concerns about my commitment to fostering equality and welcome for LGBT individuals at Mozilla. I hope to lay those concerns to rest, first by making a set of commitments to you. More important, I want to lay them to rest by actions and results.
A number of Mozillians, including LGBT individuals and allies, have stepped forward to offer guidance and assistance in this. I cannot thank you enough, and I ask for your ongoing help to make Mozilla a place of equality and welcome for all. Here are my commitments, and here’s what you can expect:
Active commitment to equality in everything we do, from employment to events to community-building.
Working with LGBT communities and allies, to listen and learn what does and doesn’t make Mozilla supportive and welcoming.
My personal commitment to work on new initiatives to reach out to those who feel excluded or who have been marginalized in ways that makes their contributing to Mozilla and to open source difficult. More on this last item below.
I know some will be skeptical about this, and that words alone will not change anything. I can only ask for your support to have the time to “show, not tell”; and in the meantime express my sorrow at having caused pain.
Mozilla is a movement composed of different people around the world, working productively together on a common mission. This is important to our ability to work and grow around the world.
Many Mozillians and others know me as a colleague or a friend. They know that I take people as they come and work with anyone willing to contribute. At the same time, I don’t ask for trust free of context, or without a solid structure to support accountability. No leader or person who has a privileged position should. I want to be held accountable for what I do as CEO. I fully expect you all to do so.
I am committed to ensuring that Mozilla is, and will remain, a place that includes and supports everyone, regardless of sexual orientation, gender identity, age, race, ethnicity, economic status, or religion.
You will see exemplary behavior from me toward everyone in our community, no matter who they are; and the same toward all those whom we hope will join, and for those who use our products. Mozilla’s inclusive health benefits policies will not regress in any way. And I will not tolerate behavior among community members that violates our Community Participation Guidelines or (for employees) our inclusive and non-discriminatory employment policies.
You’ll also see more from Mozilla under my leadership in the way of efforts to include potential contributors, especially those who lack privilege. This entails several projects, starting with Project Ascend, which is being developed by Lukas Blakk. I intend to demonstrate with meaningful action my commitment to a Mozilla that lives up to its ideals, including that of being an open and inclusive community.
A quick note to update everyone on Mozilla news. Our Board of Directors has appointed me CEO of Mozilla, with immediate effect. I’m honored and humbled, and I promise to do everything I can to lead Mozilla to new heights in this role.
I would first like to thank Jay Sullivan for his contributions to Mozilla and to the Web. He has been a passionate force at Mozilla whose leadership, especially during the last year, has been important to our success, in particular with Firefox OS. Jay is helping with the CEO transition and will then leave to pursue new opportunities.
My co-founder and 15-year partner in Mozilla, Mitchell Baker, remains active as Executive Chairwoman of Mozilla. I could not do what I do for Mozilla without Mitchell, and I like to think she feels the same way about me ;-). We have worked together well since she took on management of the tiny mozilla.org staff fragment embedded in Netscape. At that time I was “acting manager” (more like method acting manager :-P). I’ve learned a lot from Mitchell and my other peers at Mozilla about management since then!
Mozilla is about people-power on the Web and Internet — putting individual users, who create as well as consume, above all other agendas. In this light, people-fu trumps my first love, which you might say is math-fu, code-fu or tech-fu (if I may appropriate the second syllable from kung fu). People around the world are our ultimate cause at Mozilla, as well as source of inspiration and ongoing help doing what we do.
Speaking of people a bit more, I’ll take this moment to introduce Li Gong as my incoming COO. Li set up Mozilla China and our Taipei office, and he has been a crucial partner in building up Firefox OS. If you don’t know him yet, you will probably get a chance if you pass through our headquarters, as Li will be moving back to the US to help manage here.
Mozilla remains a global public benefit organization, so I’m sure I will see all of you more as I travel: to all of our offices (I have not yet been to Beijing or Taipei), to the places where we are bringing Firefox OS and the $25 smartphone, and everywhere Mozillians, developers, and others are working to make the Web better for everyone.
Several years ago, I embarked on a project to collect the headers of all the messages I could reach on NNTP, with the original intent of studying the progression of the most common news clients. More recently, I used this dataset to attempt to discover the prevalence of charsets in email messages. In doing so, I identified a critical problem with the dataset: since it only contains headers, there is very little scope for actually understanding the full, sad story of charsets. So I've decided to rectify this problem.
This time, I modified my data-collection scripts to make it much easier to mass-download NNTP messages. The first script effectively lists all the newsgroups, and then all the message IDs in those newsgroups, stuffing the results in a set to remove duplicates (cross-posts). The second script uses Python's nntplib package to attempt to download all of those messages.
Of the 32,598,261 messages identified by the first set, I succeeded in obtaining 1,025,586 messages in full or in part. Some messages failed to download due to crashing nntplib (which appears to be unable to handle messages of unbounded length), and I suspect my newsserver connections may have just timed out in the middle of the download at times. Others failed due to expiring before I could download them. All in all, 19,288 messages were not downloaded.
Analysis of the contents of messages were hampered due to a strong desire to find techniques that could mangle messages as little as possible. Prior experience with Python's message-parsing libraries lend me to believe that they are rather poor at handling some of the crap that comes into existence, and the errors in nntplib suggest they haven't fixed them yet. The only message parsing framework I truly trust to give me the level of finess is the JSMime that I'm writing, but that happens to be in the wrong language for this project. After reading some blog posts of Jeffrey Stedfast, though, I decided I would give GMime a try instead of trying to rewrite ad-hoc MIME parser #N.
Ultimately, I wrote a program to investigate the following questions on how messages operate in practice:
What charsets are used in practice? How are these charsets named?
For undeclared charsets, what are the correct charsets?
For charsets unknown to a decoder, how often would ASCII suffice?
How prevalent are malformed RFC 2047 encoded words?
When HTML and MIME are mixed, who wins?
What is the state of 8-bit headers?
While those were the questions I seeked the answers to originally, I did come up with others as I worked on my tool, some in part due to what information I was basically already collecting. The tool I wrote primarily uses GMime to convert the body parts to 8-bit text (no charset conversion), as well as parse the Content-Type headers, which are really annoying to do without writing a full parser. I used ICU to handle charset conversion and detection. RFC 2047 decoding is done largely by hand since I needed very specific information that I couldn't convince GMime to give me. All code that I used is available upon request; the exact dataset is harder to transport, given that it is some 5.6GiB of data.
Other than GMime being built on GObject and exposing a C API, I can't complain much, although I didn't try to use it to do magic. Then again, in my experience (and as this post will probably convince you as well), you really want your MIME
library to do charset magic for you, so in doing well for my needs, it's actually not doing well for a larger audience. ICU's C API similarly makes me want to complain. However, I'm now very suspect of the quality of its charset detection code, which is the main reason I used it. Trying to figure out how to get it to handle the charset decoding errors also proved far more annoying than it really should.
Some final background regards the biases I expect to crop up in the dataset. As the approximately 1 million messages were drawn from the python set iterator, I suspect that there's no systematic bias towards or away from specific groups, excepting that the ~11K messages found in the eternal-september.* hierarchy are completely represented. The newsserver I used, Eternal September, has a respectably large set of newsgroups, although it is likely to be biased towards European languages and under-representing East Asians. The less well-connected South America, Africa, or central Asia are going to be almost completely unrepresented. The download process will be biased away towards particularly heinous messages (such as exceedingly long lines), since nntplib itself is failing.
This being news messages, I also expect that use of 8-bit will be far more common than would be the case in regular mail messages. On a related note, the use of 8-bit in headers would be commensurately elevated compared to normal email. What would be far less common is HTML. I also expect that undeclared charsets may be slightly higher.
Charsets
Charset data is mostly collected on the basis of individual body parts within body messages; some messages have more than one. Interestingly enough, the 1,025,587 messages yielded 1,016,765 body parts with some text data, which indicates that either the messages on the server had only headers in the first place or the download process somehow managed to only grab the headers. There were also 393 messages that I identified having parts with different charsets, which only further illustrates how annoying charsets are in messages.
The aliases in charsets are mostly uninteresting in variance, except for the various labels used for US-ASCII (us - ascii, 646, and ANSI_X3.4-1968 are the less-well-known aliases), as well as the list of charsets whose names ICU was incapable of recognizing, given below. Unknown charsets are treated as equivalent to undeclared charsets in further processing, as there were too few to merit separate handling (45 in all).
x-windows-949
isolatin
ISO-IR-111
Default
ISO-8859-1 format=flowed
X-UNKNOWN
x-user-defined
windows-874
3D"us-ascii"
x-koi8-ru
windows-1252 (fuer gute Newsreader)
LATIN-1#2 iso-8859-1
For the next step, I used ICU to attempt to detect the actual charset of the body parts. ICU's charset detector doesn't support the full gamut of charsets, though, so charset names not claimed to be detected were instead processed by checking if they decoded without error. Before using this detection, I detect if the text is pure ASCII (excluding control characters, to enable charsets like ISO-2022-JP, and +, if the charset we're trying to check is UTF-7). ICU has a mode which ignores all text in things that look like HTML tags, and this mode is set for all HTML body parts.
I don't quite believe ICU's charset detection results, so I've collapsed the results into a simpler table to capture the most salient feature. The correct column indicates the cases where the detected result was the declared charset. The ASCII column captures the fraction which were pure ASCII. The UTF-8 column indicates if ICU reported that the text was UTF-8 (it always seems to try this first). The Wrong C1 column refers to an ISO-8859-1 text being detected as windows-1252 or vice versa, which is set by ICU if it sees or doesn't see an octet in the appropriate range. The other column refers to all other cases, including invalid cases for charsets not supported by ICU.
Declared
Correct
ASCII
UTF-8
Wrong C1
Other
Total
ISO-8859-1
230,526
225,667
883
8,119
1,035
466,230
Undeclared
148,054
1,116
37,626
186,796
UTF-8
75,674
37,600
1,551
114,825
US-ASCII
98,238
0
304
98,542
ISO-8859-15
67,529
18,527
0
86,056
windows-1252
21,414
4,370
154
3,319
130
29,387
ISO-8859-2
18,647
2,138
70
71
2,319
23,245
KOI8-R
4,616
424
2
1,112
6,154
GB2312
1,307
59
0
112
1,478
Big5
622
608
0
174
1,404
windows-1256
343
10
0
45
398
IBM437
84
257
0
341
ISO-8859-13
311
6
0
317
windows-1251
131
97
1
61
290
windows-1250
69
69
0
14
101
253
ISO-8859-7
26
26
0
0
131
183
ISO-8859-9
127
11
0
0
17
155
ISO-2022-JP
76
69
0
3
148
macintosh
67
57
0
124
ISO-8859-16
0
15
101
116
UTF-7
51
4
0
55
x-mac-croatian
0
13
25
38
KOI8-U
28
2
0
30
windows-1255
0
18
0
0
6
24
ISO-8859-4
23
0
0
23
EUC-KR
0
3
0
16
19
ISO-8859-14
14
4
0
18
GB18030
14
3
0
0
17
ISO-8859-8
0
0
0
0
16
16
TIS-620
15
0
0
15
Shift_JIS
8
4
0
1
13
ISO-8859-3
9
1
1
11
ISO-8859-10
10
0
0
10
KSC_5601
3
6
0
9
GBK
4
2
0
6
windows-1253
0
3
0
0
2
5
ISO-8859-5
1
0
0
3
4
IBM850
0
4
0
4
windows-1257
0
3
0
3
ISO-2022-JP-2
2
0
0
2
ISO-8859-6
0
1
0
0
1
Total
421,751
536,373
2,226
11,523
44,892
1,016,765
The most obvious thing shown in this table is that the most common charsets remain ISO-8859-1, Windows-1252, US-ASCII, UTF-8, and ISO-8859-15, which is to be expected, given an expected prior bias to European languages in newsgroups. The low prevalence of ISO-2022-JP is surprising to me: it means a lower incidence of Japanese than I would have expected. Either that, or Japanese have switched to UTF-8 en masse, which I consider very unlikely given that Japanese have tended to resist the trend towards UTF-8 the most.
Beyond that, this dataset has caused me to lose trust in the ICU charset detectors. KOI8-R is recorded as being 18% malformed text, with most of that ICU believing to be ISO-8859-1 instead. Judging from the results, it appears that ICU has a bias towards guessing ISO-8859-1, which means I don't believe the numbers in the Other column to be accurate at all. For some reason, I don't
appear to have decoders for ISO-8859-16 or x-mac-croatian on my local machine, but running some tests by hand appear to indicate that they are valid and not incorrect.
Somewhere between 0.1% and 1.0% of all messages are subject to mojibake, depending on how much you trust the charset detector. The cases of UTF-8 being misdetected as non-UTF-8 could potentially be explained by having very few non-ASCII sequences (ICU requires four valid sequences before it confidently declares text UTF-8); someone who writes a post in English but has a non-ASCII signature (such as myself) could easily fall into this category. Despite this, however, it does suggest that there is enough mojibake around that users need to be able to override charset decisions.
The undeclared charsets are described, in descending order of popularity, by ISO-8859-1, Windows-1252, KOI8-R, ISO-8859-2, and UTF-8, describing 99% of all non-ASCII undeclared data. ISO-8859-1 and Windows-1252 are probably over-counted here, but the interesting tidbit is that KOI8-R is used half as much undeclared as it is declared, and I suspect it may be undercounted. The practice of using locale-default fallbacks that Thunderbird has been using appears to be the best way forward for now, although UTF-8 is growing enough in popularity that using a specialized detector that decodes as UTF-8 if possible may be worth investigating (3% of all non-ASCII, undeclared messages are UTF-8).
HTML
Unsuprisingly (considering I'm polling newsgroups), very few messages contained any HTML parts at all: there were only 1,032 parts in the total sample size, of which only 552 had non-ASCII characters and were therefore useful for the rest of this analysis. This means that I'm skeptical of generalizing the results of this to email in general, but I'll still summarize the findings.
HTML, unlike plain text, contains a mechanism to explicitly identify the charset of a message. The official algorithm for determining the charset of an HTML file can be described simply as "look for a <meta> tag in the first 1024 bytes.
If it can be found, attempt to extract a charset using one of several different techniques depending on what's present or not." Since doing this fully properly is complicated in library-less C++ code, I opted to look first for a <meta[ \t\r\n\f] production, guess the extent of the tag, and try to find a charset= string somewhere in that tag. This appears to be an approach which is more reflective of how this parsing is actually done in email clients than the proper HTML algorithm. One difference is that my regular expressions also support the newer <meta charset="UTF-8"/> construct, although I don't appear to see any use of this.
I found only 332 parts where the HTML declared a charset. Only 22 parts had a case where both a MIME charset and an HTML charset and the two disagreed with each other. I neglected to count how many messages had HTML charsets but no MIME charsets, but random sampling appeared to indicate that this is very rare on the data set (the same order of magnitude or less as those where they
disagreed).
As for the question of who wins: of the 552 non-ASCII HTML parts, only 71 messages did not have the MIME type be the valid charset. Then again, 71 messages did not have the HTML type be valid either, which strongly suggests that ICU was detecting the incorrect charset. Judging from manual inspection of such messages, it appears that the MIME charset ought to be preferred if it exists. There are also a large number of HTML charset specifications saying unicode, which ICU treats as UTF-16, which is most certainly wrong.
Headers
In the data set, 1,025,856 header blocks were processed for the following statistics. This is slightly more than the number of messages since the headers of contained message/rfc822 parts were also processed. The good news is that 97% (996,103) headers were completely ASCII. Of the remaining 29,753 headers, 3.6% (1,058) were UTF-8 and 43.6% (12,965) matched the declared
charset of the first body part. This leaves 52.9% (15,730) that did not match that charset, however.
Now, NNTP messages can generally be expected to have a higher 8-bit header ratio, so this is probably exaggerating the setup in most email messages. That said, the high incidence is definitely an indicator that even non-EAI-aware clients and servers cannot blindly presume that headers are 7-bit, nor can EAI-aware clients and servers presume that 8-bit headers are UTF-8. The high
incidence of mismatching the declared charset suggests that fallback-charset decoding of headers is a necessary step.
RFC 2047 encoded-words is also an interesting statistic to mine. I found 135,951 encoded-words in the data set, which is rather low, considering that messages can be reasonably expected to carry more than one encoded-word. This is likely an artifact of NNTP's tendency towards 8-bit instead of 7-bit communication and understates their presence in regular email.
Counting encoded-words can be difficult, since there is a mechanism to let them continue in multiple pieces. For the purposes of this count, a sequence of such words count as a single word, and I indicate the number of them that had more than one element in a sequence in the Continued column. The 2047 Violation column counts the number of sequences where decoding words individually does not yield the same result as decoding them as a whole, in violation of RFC 2047. The Only ASCII column counts those words containing nothing but ASCII symbols and where the encoding was thus (mostly) pointless. The Invalid column counts the number of sequences that had a decoder error.
Charset
Count
Continued
2047 Violation
Only ASCII
Invalid
ISO-8859-1
56,355
15,610
499
0
UTF-8
36,563
14,216
3,311
2,704
9,765
ISO-8859-15
20,699
5,695
40
0
ISO-8859-2
11,247
2,669
9
0
windows-1252
5,174
3,075
26
0
KOI8-R
3,523
1,203
12
0
windows-1256
765
568
0
0
Big5
511
46
28
0
171
ISO-8859-7
165
26
0
3
windows-1251
157
30
2
0
GB2312
126
35
6
0
51
ISO-2022-JP
102
8
5
0
49
ISO-8859-13
78
45
0
0
ISO-8859-9
76
21
0
0
ISO-8859-4
71
2
0
0
windows-1250
68
21
0
0
ISO-8859-5
66
20
0
0
US-ASCII
38
10
38
0
TIS-620
36
34
0
0
KOI8-U
25
11
0
0
ISO-8859-16
22
1
0
22
UTF-7
17
2
1
8
3
EUC-KR
17
4
4
0
9
x-mac-croatian
10
3
0
10
Shift_JIS
8
0
0
0
3
Unknown
7
2
0
7
ISO-2022-KR
7
0
0
0
0
GB18030
6
1
0
0
1
windows-1255
4
0
0
0
ISO-8859-14
3
0
0
0
ISO-8859-3
2
1
0
0
GBK
2
0
0
0
2
ISO-8859-6
1
1
0
0
Total
135,951
43,360
3,361
3,338
10,096
This table somewhat mirrors the distribution of regular charsets, with one major class of differences: charsets that represent non-Latin scripts (particularly Asian scripts) appear to be overdistributed compared to their corresponding use in body parts. The exception to this rule is GB2312 which is far lower than relative rankings would presume—I attribute this to people using GB2312 being more likely to use 8-bit headers instead of RFC 2047 encoding, although I don't have direct evidence.
Clearly continuations are common, which is to be relatively expected. The sad part is how few people bother to try to adhere to the specification here: out of 14,312 continuations in languages that could violate the specification, 23.5% of them violated the specification. The mode-shifting versions (ISO-2022-JP and EUC-KR) are basically all violated, which suggests that no one bothered to check if their encoder "returns to ASCII" at the end of the word (I know Thunderbird's does, but the other ones I checked don't appear to).
The number of invalid UTF-8 decoded words, 26.7%, seems impossibly high to me. A brief check of my code indicates that this is working incorrectly in the face of invalid continuations, which certainly exaggerates the effect but still leaves a value too high for my tastes. Of more note are the elevated counts for the East Asian charsets: Big5, GB2312, and ISO-2022-JP. I am not an expert in charsets, but I belive that Big5 and GB2312 in particular are a family of almost-but-not-quite-identical charsets and it may be that ICU is choosing the wrong candidate of each family for these instances.
There is a surprisingly large number of encoded words that encode only ASCII. When searching specifically for the ones that use the US-ASCII charset, I found that these can be divided into three categories. One set comes from a few people who apparently have an unsanitized whitespace (space and LF were the two I recall seeing) in the display name, producing encoded words like
=?us-ascii?Q?=09Edward_Rosten?=. Blame 40tude Dialog here. Another set encodes some basic characters (most commonly = and ?, although a few other interpreted characters popped up). The final set of errors were double-encoded words, such as =?us-ascii?Q?=3D=3FUTF-8=3FQ=3Ff=3DC3=3DBCr=3F=3D?=, which appear to be all generated by an Emacs-based newsreader.
One interesting thing when sifting the results is finding the crap that people produce in their tools. By far the worst single instance of an RFC 2047 encoded-word that I found is this one: Subject: Re: [Kitchen Nightmares]
Meow! Gordon Ramsay Is =?ISO-8859-1?B?UEgR lqZ VuIEhlYWQgVH rbGeOIFNob BJc
RP2JzZXNzZW?= With My =?ISO-8859-1?B?SHVzYmFuZ JzX0JhbGxzL JfU2F5c19BbXiScw==?=
Baking Company Owner (complete with embedded spaces), discovered by crashing my ad-hoc base64 decoder (due to the spaces). The interesting thing is that even after investigating the output encoding, it doesn't look like the text is actually correct ISO-8859-1... or any obvious charset for that matter.
I looked at the unknown charsets by hand. Most of them were actually empty charsets (looked like =??B?Sy4gSC4gdm9uIFLDvGRlbg==?=), and all but one of the outright empty ones were generated by KNode and really UTF-8. The other one was a Windows-1252 generated by a minor newsreader.
Another important aspect of headers is how to handle 8-bit headers. RFC 5322 blindly hopes that headers are pure ASCII, while RFC 6532 dictates that they are UTF-8. Indeed, 97% of headers are ASCII, leaving just 29,753 headers that are not. Of these, only 1,058 (3.6%) are UTF-8 per RFC 6532. Deducing which charset they are is difficult because the large amount of English text for header names and the important control values will greatly skew any charset detector, and there is too little text to give a charset detector confidence. The only metric I could easily apply was testing Thunderbird's heuristic as "the header blocks are the same charset as the message contents"—which only worked 45.2% of the time.
Encodings
While developing an earlier version of my scanning program, I was intrigued to know how often various content transfer encodings were used. I found 1,028,971 parts in all (1,027,474 of which are text parts). The transfer encoding of binary did manage to sneak in, with 57 such parts. Using 8-bit text was very popular, at 381,223 samples, second only to 7-bit at 496,114 samples.
Quoted-printable had 144,932 samples and base64 only 6,640 samples. Extremely interesting are the presence of 4 illegal transfer encodings in 5 messages, two of them obvious typos and the others appearing to be a client mangling header continuations into the transfer-encoding.
Conclusions
So, drawing from the body of this data, I would like to make the following conclusions as to using charsets in mail messages:
Have a fallback charset. Undeclared charsets are extremely common, and I'm skeptical that charset detectors are going to get this stuff right, particularly since email can more naturally combine multiple languages than other bodies of text (think signatures). Thunderbird currently uses a locale-dependent fallback charset, which roughly mirrors what Firefox and I think most web browsers do.
Let users override charsets when reading. On a similar token, mojibake text, while not particularly common, is common enough to make declared charsets sometimes unreliable. It's also possible that the fallback charset is wrong, so users may need to override the chosen charset.
Testing is mandatory. In this set of messages, I found base64 encoded words with spaces in them, encoded words without charsets (even UNKNOWN-8BIT), and clearly invalid Content-Transfer-Encodings. Real email messages that are flagrantly in violation of basic spec requirements exist, so you should make sure that your email parser and client can handle the weirdest edge cases.
Non-UTF-8, non-ASCII headers exist. EAI not withstanding, 8-bit headers are a reality. Combined with a predilection for saying ASCII when text is really ASCII, this means that there is often no good in-band information to tell you what charset is correct for headers, so you have to go back to a fallback charset.
US-ASCII really means ASCII. Email clients appear to do a very good job of only emitting US-ASCII as a charset label if it's US-ASCII. The sample size is too small for me to grasp what charset 8-bit characters should imply in US-ASCII.
Know your decoders. ISO-8859-1 actually means Windows-1252 in practice. Big5 and GB1232 are actually small families of charsets with slightly different meanings. ICU notably disagrees with some of these realities, so be sure to include in your tests various charset edge cases so you know that the decoders are correct.
UTF-7 is still relevant. Of the charsets I found not mentioned in the WHATWG encoding spec, IBM437 and x-mac-croatian are in use only due to specific circumstances that limit their generalizable presence. IBM850 is too rare. UTF-7 is common enough that you need to actually worry about it, as abominable and evil a charset it is.
HTML charsets may matter—but MIME matters more. I don't have enough data to say if charsets declared in HTML are needed to do proper decoding. I do have enough to say fairly conclusively that the MIME charset declaration is authoritative if HTML disagrees.
Charsets are not languages. The entire reason x-mac-croatian is used at all can be traced to Thunderbird displaying the charset as "Croatian," despite it being pretty clearly not a preferred charset. Similarly most charsets are often enough ASCII that, say, an instance of GB2312 is a poor indicator of whether or not the message is in English. Anyone trying to filter based on charsets is doing a really, really stupid thing.
RFCs reflect an ideal world, not reality. This is most notable in RFC 2047: the specification may state that encoded words are supposed to be independently decodable, but the evidence is pretty clear that more clients break this rule than uphold it.
Limit the charsets you support. Just because your library lets you emit a hundred charsets doesn't mean that you should let someone try to do it. You should emit US-ASCII or UTF-8 unless you have a really compelling reason not to, and those compelling reasons don't require obscure charsets. Some particularly annoying charsets should never be written: EBCDIC is already basically dead on the web, and I'd like to see UTF-7 die as well.
When I have time, I'm planning on taking some of the more egregious or interesting messages in my dataset and packaging them into a database of emails to help create testsuites on handling messages properly.
The Web is a big deal (as is the Internet on which it is built), I don’t need to tell you! But I did have a few thoughts, solicited by a friend who asked “where [do] you think the future of the Internet will take us in the next 25 years?”
My answer: 25 years is a long time. I expect some big changes (computers inside us monitoring body functions), while other things stay remarkably unchanged (no flying cars).
Even now people remark on how much more personal or intimate a smartphone is than a PC (that image still makes me laugh). Think about this when the Internet includes not just your house and most physical artifacts worth hooking up, but yourself.
In such a world, open systems built on open standards and open source are even more important, for all of these reasons:
interoperation among implementations;
freedom to migrate among different vendors’ systems;
ability to mix-and-match, hyperlink/transclude, copy-learn-and-hack, and monitor/audit against mistakes, malware, and surveillance.
We announced the $25 Firefox OS smartphone with Spreadtrum Communications, targeting retail channels in emerging markets, and attracting operator interest to boot. This is an upgrade for those channels at about the same price as the feature phones selling there today. (Yes, $25 is the target end-user price.)
We showed the Firefox OS smartphone portfolio growing upward too, with more and higher-end devices from existing and new OEM partners. Peter Bright’s piece for Ars Technica is excellent and has nice pictures of all the new devices.
As I’ve noted before, our success in attracting partners is due in part to our ability to innovate and standardize the heretofore-missing APIs needed to build fully-capable smartphones and other devices purely from web standards. To uphold tradition, here is another update to my progress reports from last year and from 2012.
First, and not yet a historical curiosity: the still-open tracking bug asking for “New” Web APIs, filed at the dawn of B2G by Andreas Gal.
Next, links for “Really-New” APIs, most making progress in standards bodies:
This is how the web evolves: by implementors championing and testing extensions, with emerging consensus if at all possible, else in a pref-enabled or certified-app sandbox if there’s no better way. We thank colleagues at W3C and elsewhere who are collaborating with us to uplift the Web to include APIs for all the modern mobile device sensors and features. We invite all parties working on similar systems not yet aligned with the emerging standards to join us.
This post is part 5 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. Part 4 discusses email addresses. This post discusses the more general problem of email headers.
Back in my first post, Ludovic kindly posted, in a comment, a link to a talk of someone else's email rant. And the best place to start this post is with a quote from that talk: "If you want to see an email programmer's face turn red, ask him about CFWS." CFWS is an acronym that stands for "comments and folded whitespace," and I can attest that the mere mention of CFWS is enough for me to start ranting. Comments in email headers are spans of text wrapped in parentheses, and the folding of whitespace refers to the ability to continue headers on multiple lines by inserting a newline before (but not in lieu of) a space.
I'll start by pointing out that there is little advantage to adding in free-form data to headers which are not going to be manually read in the vast majority of cases. In practice, I have seen comments used for only three headers on a reliable basis. One of these is the Date header, where a human-readable name of the timezone is sometimes included. The other two are the Received and Authentication-Results headers, where some debugging aids are thrown in. There would be no great loss in omitting any of this information; if information is really important, appending an X- header with that information is still a viable option (that's where most spam filtration notes get added, for example).
For this feature of questionable utility in the first place, the impact it has on parsing message headers is enormous. RFC 822 is specified in a manner that is familiar to anyone who reads language specifications: there is a low-level lexical scanning phase which feeds tokens into a secondary parsing phase. Like programming languages, comments and white space are semantically meaningless [1]. Unlike programming languages, however, comments can be nested—and therefore lexing an email header is not regular [2]. The problems of folding (a necessary evil thanks to the line length limit I keep complaining about) pale in comparison to comments, but it's extra complexity that makes machine-readability more difficult.
Fortunately, RFC 2822 made a drastic change to the specification that greatly limited where CFWS could be inserted into headers. For example, in the Date header, comments are allowed only following the timezone offset (and whitespace in a few specific places); in addressing headers, CFWS is not allowed within the email address itself [3]. One unanticipated downside is that it makes reading the other RFCs that specify mail headers more difficult: any version that predates RFC 2822 uses the syntax assumptions of RFC 822 (in particular, CFWS may occur between any listed tokens), whereas RFC 2822 and its descendants all explicitly enumerate where CFWS may occur.
Beyond the issues with CFWS, though, syntax is still problematic. The separation of distinct lexing and parsing phases means that you almost see what may be a hint of uniformity which turns out to be an ephemeral illusion. For example, the header parameters define in RFC 2045 for Content-Type and Content-Disposition set a tradition of ;-separated param=value attributes, which has been picked up by, say, the DKIM-Signature or Authentication-Results headers. Except a close look indicates that Authenticatin-Results allows two param=value pairs between semicolons. Another side effect was pointed out in my second post: you can't turn a generic 8-bit header into a 7-bit compatible header, since you can't tell without knowing the syntax of the header which parts can be specified as 2047 encoded-words and which ones can't.
There's more to headers than their syntax, though. Email headers are structured as a somewhat-unordered list of headers; this genericity gives rise to a very large number of headers, and that's just the list of official headers. There are unofficial headers whose use is generally agreed upon, such as X-Face, X-No-Archive, or X-Priority; other unofficial headers are used for internal tracking such as Mailman's X-BeenThere or Mozilla's X-Mozilla-Status headers. Choosing how to semantically interpret these headers (or even which headers to interpret!) can therefore be extremely daunting.
Some of the headers are specified in ways that would seem surprising to most users. For example, the venerable From header can represent anywhere between 0 mailboxes [4] to an arbitrarily large number—but most clients assume that only one exists. It's also worth noting that the Sender header is (if present) a better indication of message origin as far as tracing is concerned [5], but its relative rarity likely results in filtering applications not taking it into account. The suite of Resent-* headers also experiences similar issues.
Another impact of email headers is the degree to which they can be trusted. RFC 5322 gives some nice-sounding platitudes to how headers are supposed to be defined, but many of those interpretations turn out to be difficult to verify in practice. For example, Message-IDs are supposed to be globally unique, but they turn out to be extremely lousy UUIDs for emails on a local system, even if you allow for minor differences like adding trace headers [6].
More serious are the spam, phishing, etc. messages that lie as much as possible so as to be seen by end-users. Assuming that a message is hostile, the only header that can be actually guaranteed to be correct is the first Received header, which is added by the final user's mailserver [7]. Every other header, including the Date and From headers most notably, can be a complete and total lie. There's no real way to authenticate the headers or hide them from snoopers—this has critical consequences for both spam detection and email security.
There's more I could say on this topic (especially CFWS), but I don't think it's worth dwelling on. This is more of a preparatory post for the next entry in the series than a full compilation of complaints. Speaking of my next post, I don't think I'll be able to keep up my entirely-unintentional rate of posting one entry this series a month. I've exhausted the topics in email that I am intimately familiar with and thus have to move on to the ones I'm only familiar with.
[1] Some people attempt to be to zealous in following RFCs and ignore the distinction between syntax and semantics, as I complained about in part 4 when discussing the syntax of email addresses.
[2] I mean this in the theoretical sense of the definition. The proof that balanced parentheses is not a regular language is a standard exercise in use of the pumping lemma.
[3] Unless domain literals are involved. But domain literals are their own special category.
[4] Strictly speaking, the 0 value is intended to be used only when the email has been downgraded and the email address cannot be downgraded. Whether or not these will actually occur in practice is an unresolved question.
[5] Semantically speaking, Sender is the person who typed the message up and actually sent it out. From is the person who dictated the message. If the two headers would be the same, then Sender is omitted.
[6] Take a message that's cross-posted to two mailing lists. Each mailing list will generate copies of the message which end up being submitted back into the mail system and will typically avoid touching the Message-ID.
[7] Well, this assumes you trust your email provider. However, your email provider can do far worse to your messages than lie about the Received header…
Recently, the question of charsets came up within the context of necessary decoder support for Thunderbird. After much hemming and hawing about how to find this out (which included a plea to the IMAP-protocol list for data), I remembered that I actually had this data. Long-time readers of this blog may recall that I did a study several years ago on the usage share of newsreaders. After that, I was motivated to take my data collection to the most extreme way possible. Instead of considering only the "official" Big-8 newsgroups, I looked at all of them on the news server I use (effectively, all but alt.binaries). Instead of relying on pulling the data from the server for the headers I needed, I grabbed all of them—the script literally runs HEAD and saves the results in a database. And instead of a month of results, I grabbed the results for the entire year of 2011. And then I sat on the data.
After recalling Henri Svinonen's pesterings about data, I decided to see the suitability of my dataset for this task. For data management reasons, I only grabbed the data from the second half of the year (about 10 million messages). I know from memory that the quality of Python's message parser (which was used to extract data in the first place) is surprisingly poor, which introduces bias of unknown consequence to my data. Since I only extracted headers, I can't identify charsets for anything which was sent as, say, multipart/alternative (which is more common than you'd think), which introduces further systematic bias. The end result is approximately 9.6M messages that I could extract charsets from and thence do further research.
Discussions revealed one particularly surprising tidbit of information. The most popular charset not accounted for by the Encoding specification was IBM437. Henri Sivonen speculated that the cause was some crufty old NNTP client on Windows using that encoding, so I endeavored to build a correlation database to check that assumption. Using the wonderful magic of d3, I produced a heatmap comparing distributions of charsets among various user agents. Details about the visualization may be found on that page, but it does refute Henri's claim when you dig into the data (it appears to be caused by specific BBS-to-news gateways, and is mostly localized in particular BBS newsgroups).
Also found on that page are some fun discoveries of just what kind of crap people try to pass off as valid headers. Some of those User-Agents are clearly spoofs (Outlook Express and family used the X-Newsreader header, not the User-Agent header). There also appears to be a fair amount of mojibake in headers (one of them appeared to be venerable double mojibake). The charsets also have some interesting labels to them: the "big5\n" and the "(null)" illustrate that some people don't double check their code very well, and not shown are the 5 examples of people who think charset names have spaces in them. A few people appear to have mixed up POSIX locales with charsets as well.
It is becoming increasingly difficult to trust the privacy properties of software and services we rely on to use the Internet. Governments, companies, groups and individuals may be surveilling us without our knowledge. This is particularly troubling when such surveillance is done by governments under statutes that provide limited court oversight and almost no room for public scrutiny.
As a result of laws in the US and elsewhere, prudent users must interact with Internet services knowing that despite how much any cloud-service company wants to protect privacy, at the end of the day most big companies must comply with the law. The government can legally access user data in ways that might violate the privacy expectations of law-abiding users. Worse, the government may force service operators to enable surveillance (something that seems to have happened in the Lavabit case).
Worst of all, the government can do all of this without users ever finding out about it, due to gag orders.
Implications for Browsers
This creates a significant predicament for privacy and security on the Open Web. Every major browser today is distributed by an organization within reach of surveillance laws. As the Lavabit case suggests, the government may request that browser vendors secretly inject surveillance code into the browsers they distribute to users. We have no information that any browser vendor has ever received such a directive. However, if that were to happen, the public would likely not find out due to gag orders.
The unfortunate consequence is that software vendors — including browser vendors — must not be blindly trusted. Not because such vendors don’t want to protect user privacy. Rather, because a law might force vendors to secretly violate their own principles and do things they don’t want to do.
Why Mozilla is different
Mozilla has one critical advantage over all other browser vendors. Our products are truly open source. Internet Explorer is fully closed-source, and while the rendering engines WebKit and Blink (chromium) are open-source, the Safari and Chrome browsers that use them are not fully open-source. Both contain significant fractions of closed-source code.
Mozilla Firefox in contrast is 100% open source[1]. As Anthony Jones from our New Zealand office pointed out the other month, security researchers can use this fact to verify the executable bits contained in the browsers Mozilla is distributing, by building Firefox from source and comparing the built bits with our official distribution.
This will be the most effective on platforms where we already use open-source compilers to produce the executable, to avoid compiler-level attacks as shown in 1984 by Ken Thompson.
Call to Action
To ensure that no one can inject undetected surveillance code into Firefox, security researchers and organizations should:
regularly audit Mozilla source and verified builds by all effective means;
establish automated systems to verify official Mozilla builds from source; and
raise an alert if the verified bits differ from official bits.
In the best case, we will establish such a verification system at a global scale, with participants from many different geographic regions and political and strategic interests and affiliations.
Security is never “done” — it is a process, not a final rest-state. No silver bullets. All methods have limits. However, open-source auditability cleanly beats the lack of ability to audit source vs. binary.
Through international collaboration of independent entities we can give users the confidence that Firefox cannot be subverted without the world noticing, and offer a browser that verifiably meets users’ privacy expectations.
See bug 885777 to track our work on verifiable builds.
End-to-End Trust
Beyond this first step, can we use such audited browsers as trust anchors, to authenticate fully-audited open-source Internet services? This seems possible in theory. No one has built such a system to our knowledge, but we welcome precedent citations and experience reports, and encourage researchers to collaborate with us.
Brendan Eich, CTO and SVP Engineering, Mozilla Andreas Gal, VP Mobile and R&D, Mozilla
[1] Firefox on Linux is the best case, because the C/C++ compiler, runtime libraries, and OS kernel are all free and open source software. Note that even on Linux, certain hardware-vendor-supplied system software, e.g., OpenGL drivers, may be closed source.
This post is part 4 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discusses internationalization. Part 3 discusses MIME. This post discusses the problems with email addresses.
You might be surprised that I find email addresses difficult enough to warrant a post discussing only this single topic. However, this is a surprisingly complex topic, and one which is made much harder by the presence of a very large number of people purporting to know the answer who then proceed to do the wrong thing [0]. To understand why email addresses are complicated, and why people do the wrong thing, I pose the following challenge: write a regular expression that matches all valid email addresses and only valid email addresses. Go ahead, stop reading, and play with it for a few minutes, and then you can compare your answer with the correct answer.
Done yet? So, if you came up with a regular expression, you got the wrong answer. But that's because it's a trick question: I never defined what I meant by a valid email address. Still, if you're hoping for partial credit, you may able to get some by correctly matching one of the purported definitions I give below.
The most obvious definition meant by "valid email address" is text that matches the addr-spec production of RFC 822. No regular expression can match this definition, though—and I am aware of the enormous regular expression that is often purported to solve this problem. This is because comments can be nested, which means you would need to solve the "balanced parentheses" language, which is easily provable to be non-regular [2].
Matching the addr-spec production, though, is the wrong thing to do: the production dictates the possible syntax forms an address may have, when you arguably want a more semantic interpretation. As a case in point, the two email addresses example@test.invalid and example @ test . invalid are both meant to refer to the same thing. When you ignore the actual full grammar of an email address and instead read the prose, particularly of RFC 5322 instead of RFC 822, you'll realize that matching comments and whitespace are entirely the wrong thing to do in the email address.
Here, though, we run into another problem. Email addresses are split into local-parts and the domain, the text before and after the @ character; the format of the local-part is basically either a quoted string (to escape otherwise illegal characters in a local-part), or an unquoted "dot-atom" production. The quoting is meant to be semantically invisible: "example"@test.invalid is the same email address as example@test.invalid. Normally, I would say that the use of quoted strings is an artifact of the encoding form, but given the strong appetite for aggressively "correct" email validators that attempt to blindly match the specification, it seems to me that it is better to keep the local-parts quoted if they need to be quoted. The dot-atom production matches a sequence of atoms (spans of text excluding several special characters like [ or .) separated by . characters, with no intervening spaces or comments allowed anywhere.
RFC 5322 only specifies how to unfold the syntax into a semantic value, and it does not explain how to semantically interpret the values of an email address. For that, we must turn to SMTP's definition in RFC 5321, whose semantic definition clearly imparts requirements on the format of an email address not found in RFC 5322. On domains, RFC 5321 explains that the domain is either a standard domain name [3], or it is a domain literal which is either an IPv4 or an IPv6 address. Examples of the latter two forms are test@[127.0.0.1] and test@[IPv6:::1]. But when it comes to the local-parts, RFC 5321 decides to just give up and admit no interpretation except at the final host, advising only that servers should avoid local-parts that need to be quoted. In the context of email specification, this kind of recommendation is effectively a requirement to not use such email addresses, and (by implication) most client code can avoid supporting these email addresses [4].
The prospect of internationalized domain names and email addresses throws a massive wrench into the state affairs, however. I've talked at length in part 2 about the problems here; the lack of a definitive decision on Unicode normalization means that the future here is extremely uncertain, although RFC 6530 does implicitly advise that servers should accept that some (but not all) clients are going to do NFC or NFKC normalization on email addresses.
At this point, it should be clear that asking for a regular expression to validate email addresses is really asking the wrong question. I did it at the beginning of this post because that is how the question tends to be phrased. The real question that people should be asking is "what characters are valid in an email address?" (and more specifically, the left-hand side of the email address, since the right-hand side is obviously a domain name). The answer is simple: among the ASCII printable characters (Unicode is more difficult), all the characters but those in the following string: " \"\\[]();,@". Indeed, viewing an email address like this is exactly how HTML 5 specifies it in its definition of a format for <input type="email">
Another, much easier, more obvious, and simpler way to validate an email address relies on zero regular expressions and zero references to specifications. Just send an email to the purported address and ask the user to click on a unique link to complete registration. After all, the most common reason to request an email address is to be able to send messages to that email address, so if mail cannot be sent to it, the email address should be considered invalid, even if it is syntactically valid.
Unfortunately, people persist in trying to write buggy email validators. Some are too simple and ignore valid characters (or valid top-level domain names!). Others are too focused on trying to match the RFC addr-spec syntax that, while they will happily accept most or all addr-spec forms, they also result in email addresses which are very likely to weak havoc if you pass to another system to send email; cause various forms of SQL injection, XSS injection, or even shell injection attacks; and which are likely to confuse tools as to what the email address actually is. This can be ameliorated with complicated normalization functions for email addresses, but none of the email validators I've looked at actually do this (which, again, goes to show that they're missing the point).
Which brings me to a second quiz question: are email addresses case-insensitive? If you answered no, well, you're wrong. If you answered yes, you're also wrong. The local-part, as RFC 5321 emphasizes, is not to be interpreted by anyone but the final destination MTA server. A consequence is that it does not specify if they are case-sensitive or case-insensitive, which means that general code should not assume that it is case-insensitive. Domains, of course, are case-insensitive, unless you're talking about internationalized domain names [5]. In practice, though, RFC 5321 admits that servers should make the names case-insensitive. For everyone else who uses email addresses, the effective result of this admission is that email addresses should be stored in their original case but matched case-insensitively (effectively, code should be case-preserving).
Hopefully this gives you a sense of why email addresses are frustrating and much more complicated then they first appear. There are historical artifacts of email addresses I've decided not to address (the roles of ! and % in addresses), but since they only matter to some SMTP implementations, I'll discuss them when I pick up SMTP in a later part (if I ever discuss them). I've avoided discussing some major issues with the specification here, because they are much better handled as part of the issues with email headers in general.
Oh, and if you were expecting regular expression answers to the challenge I gave at the beginning of the post, here are the answers I threw together for my various definitions of "valid email address." I didn't test or even try to compile any of these regular expressions (as you should have gathered, regular expressions are not what you should be using), so caveat emptor.
[^\x00-\x20\x80-\x9f]()\[\]:;@\\,]+@[^\x00-\x20\x80-\x9f()\[\]:;@\\,]+, with the caveats that a dot does not begin or end the local-part, nor do two dots appear subsequent, the local part is in NFC or NFKC form, and the domain is a valid domain name.
[1] If you're trying to find guides on valid email addresses, a useful way to eliminate incorrect answers are the following litmus tests. First, if the guide mentions an RFC, but does not mention RFC 5321 (or RFC 2821, in a pinch), you can generally ignore it. If the email address test (not) @ example.com would be valid, then the author has clearly not carefully read and understood the specifications. If the guide mentions RFC 5321, RFC 5322, RFC 6530, and IDN, then the author clearly has taken the time to actually understand the subject matter and their opinion can be trusted.
[2] I'm using "regular" here in the sense of theoretical regular languages. Perl-compatible regular expressions can match non-regular languages (because of backreferences), but even backreferences can't solve the problem here. It appears that newer versions support a construct which can match balanced parentheses, but I'm going to discount that because by the time you're going to start using that feature, you have at least two problems.
[3] Specifically, if you want to get really technical, the domain name is going to be routed via MX records in DNS.
[4] RFC 5321 is the specification for SMTP, and, therefore, it is only truly binding for things that talk SMTP; likewise, RFC 5322 is only binding on people who speak email headers. When I say that systems can pretend that email addresses with domain literals or quoted local-parts don't exist, I'm excluding mail clients and mail servers. If you're writing a website and you need an email address, there is no need to support email addresses which don't exist on the open, public Internet.
[5] My usual approach to seeing internationalization at this point (if you haven't gathered from the lengthy second post of this series) is to assume that the specifications assume magic where case insensitivity is desired.
This post is part 3 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure. Part 2 discuses internationalization. This post discusses MIME, the mechanism by which email evolves beyond plain text.
MIME, which stands for Multipurpose Internet Mail Extensions, is primarily dictated by a set of 5 RFCs: RFC 2045, RFC 2046, RFC 2047, RFC 2048, and RFC 2049, although RFC 2048 (which governs registration procedures for new MIME types) was updated with newer versions. RFC 2045 covers the format of related headers, as well as the format of the encodings used to convert 8-bit data into 7-bit for transmission. RFC 2046 describes the basic set of MIME types, most importantly the format of multipart/ types. RFC 2047 was discussed in my part 2 of this series, as it discusses encoding internationalized data in headers. RFC 2049 describes a set of guidelines for how to be conformant when processing MIME; as you might imagine, these are woefully inadequate for modern processing anyways. In practice, it is only the first three documents that matter for building an email client.
There are two main contributions of MIME, which actually makes it a bit hard to know what is meant when people refer to MIME in the abstract. The first contribution, which is of interest mostly to email, is the development of a tree-based representation of email which allows for the inclusion of non-textual parts to messages. This tree is ultimately how attachments and other features are incorporated. The other contribution is the development of a registry of MIME types for different types of file contents. MIME types have promulgated far beyond just the email infrastructure: if you want to describe what kind of file binary blob is, you can refer to it by either a magic header sequence, a file extension, or a MIME type. Searching for terms like MIME libraries will sometimes refer to libraries that actually handle the so-called MIME sniffing process (guessing a MIME type from a file extension or the contents of a file).
MIME types are decomposable into two parts, a media type and a subtype. The type text/plain has a media type of text and a subtype of plain, for example. IANA maintains an official repository of MIME types. There are very few media types, and I would argue that there ought to be fewer. In practice, degradation of unknown MIME types means that there are essentially three "fundamental" types: text/plain (which represents plain, unformatted text and to which unknown text/* types degrade), multipart/mixed (the "default" version of multipart messages; more on this later), and application/octet-stream (which represents unknown, arbitrary binary data). I can understand the separation of the message media type for things which generally follow the basic format of headers+body akin to message/rfc822, although the presence of types like message/partial that don't follow the headers+body format and the requirement to downgrade to application/octet-stream mars usability here. The distinction between image, audio, video and application is petty when you consider that in practice, the distinction isn't going to be able to make clients give better recommendations for how to handle these kinds of content (which really means deciding if it can be displayed inline or if it needs to be handed off to an external client).
Is there a better way to label content types than MIME types? Probably not. X.400 (remember that from my first post?) uses OIDs, in line with the rest of the OSI model, and my limited workings with other systems that use these OIDs is that they are obtuse, effectively opaque identifiers with no inherent semantic meaning. People use file extensions in practice to distinguish between different file types, but not all content types are stored in files (such as multipart/mixed), and the MIME types is a finer granularity to distinguish when needing to guess the type from the start of a file. My only complaints about MIME types are petty and marginal, not about the idea itself.
No, the part of MIME that I have serious complaints with is the MIME tree structure. This allows you to represent emails in arbitrarily complex structures… and onto which the standard view of email as a body with associated attachments is poorly mapped. The heart of this structure is the multipart media type, for which the most important subtypes are mixed, alternative, related, signed, and encrypted. The last two types are meant for cryptographic security definitions [1], and I won't cover them further here. All multipart types have a format where the body consists of parts (each with their own headers) separated by a boundary string. There is space before and after the last parts which consists of semantically-meaningless text sometimes containing a message like "This is a MIME message." meant to be displayed to the now practically-non-existent crowd of people who use clients that don't support MIME.
The simplest type is multipart/mixed, which means that there is no inherent structure to the parts. Attachments to a message use this type: the type of the message is set to multipart/mixed, a body is added as (typically) the first part, and attachments are added as parts with types like image/png (for PNG images). It is also not uncommon to see multipart/mixed types that have a multipart/mixed part within them: some mailing list software attaches footers to messages by wrapping the original message inside a single part of a multipart/mixed message and then appending a text/plain footer.
multipart/related is intended to refer to an HTML page [2] where all of its external resources are included as additional parts. Linking all of these parts together is done by use of a cid: URL scheme. Generating and displaying these messages requires tracking down all URL references in an HTML page, which of course means that email clients that want full support for this feature also need robust HTML (and CSS!) knowledge, and future-proofing is hard. Since the primary body of this type appears first in the tree, it also makes handling this datatype in a streaming manner difficult, since the values to which URLs will be rewritten are not known until after the entire body is parsed.
In contrast, multipart/alternative is used to satisfy the plain-text-or-HTML debate by allowing one to provide a message that is either plain text or HTML [3]. It is also the third-biggest failure of the entire email infrastructure, in my opinion. The natural expectation would be that the parts should be listed in decreasing order of preference, so that streaming clients can reject all the data after it finds the part it will display. Instead, the parts are listed in increasing order of preference, which was done in order to make the plain text part be first in the list, which helps increase readability of MIME messages for those reading email without MIME-aware clients. As a result, streaming clients are unable to progressively display the contents of multipart/alternative until the entire message has been read.
Although multipart/alternative states that all parts must contain the same contents (to varying degrees of degradation), you shouldn't be surprised to learn that this is not exactly the case. There was a period in time when spam filterers looked at only the text/plain side of things, so spammers took to putting "innocuous" messages in the text/plain half and displaying the real spam in the text/html half [4] (this technique appears to have died off a long time ago, though). In another interesting case, I received a bug report with a message containing an image/jpeg and a text/html part within a multipart/alternative [5].
To be fair, the current concept of emails as a body with a set of attachments did not exist when MIME was originally specified. The definition of multipart/parallel plays into this a lot (it means what you think it does: show all of the parts in parallel… somehow). Reading between the lines of the specification also indicates a desire to create interactive emails (via application/postscript, of course). Given that email clients have trouble even displaying HTML properly [6], and the fact that interactivity has the potential to be a walking security hole, it is not hard to see why this functionality fell by the wayside.
The final major challenge that MIME solved was how to fit arbitrary data into a 7-bit format safe for transit. The two encoding schemes they came up with were quoted-printable (which retains most printable characters, but emits non-printable characters in a =XX format, where the Xs are hex characters), and base64 which reencodes every 3 bytes into 4 ASCII characters. Non-encoded data is separated into three categories: 7-bit (which uses only ASCII characters except NUL and bare CR or LF characters), 8-bit (which uses any character but NUL, bare CR, and bare LF), and binary (where everything is possible). A further limitation is placed on all encodings but binary: every line is at most 998 bytes long, not including the terminating CRLF.
A side-effect of these requirements is that all attachments must be considered binary data, even if they are textual formats (like source code), as end-of-line autoconversion is now considered a major misfeature. To make matters even worse, body text for formats with text written in scripts that don't use spaces (such as Japanese or Chinese) can sometimes be prohibited from using 8-bit transfer format due to overly long lines: you can reach the end of a line in as few as 249 characters (UTF-8, non-BMP characters, although Chinese and Japanese typically take three bytes per character). So a single long paragraph can force a message to be entirely encoded in a format with 33% overhead. There have been suggestions for a binary-to-8-bit encoding in the past, but no standardization effort has been made for one [7].
The binary encoding has none of these problems, but no one claims to support it. However, I suspect that violating maximum line length, or adding 8-bit characters to a quoted-printable part, are likely to make it through the mail system, in part because not doing so either increases your security vulnerabilities or requires more implementation effort. Sending lone CR or LF characters is probably fine so long as one is careful to assume that they may be treated as line breaks. Sending a NUL character I suspect could cause some issues due to lack of testing (but it also leaves room for security vulnerabilities to ignore it). In other words, binary-encoded messages probably already work to a large degree in the mail system. Which makes it extremely tempting (even for me) to ignore the specification requirements when composing messages; small wonder then that blatant violations of specifications are common.
This concludes my discussion of MIME. There are certainly many more complaints I have, but this should be sufficient to lay out why building a generic MIME-aware library by itself is hard, and why you do not want to write such a parser yourself. Too bad Thunderbird has at least two different ad-hoc parsers (not libmime or JSMime) that I can think of off the top of my head, both of which are wrong.
[1] I will be covering this in a later post, but the way that signed and encrypted data is represented in MIME actually makes it really easy to introduce flaws in cryptographic code (which, the last time I surveyed major email clients with support for cryptographic code, was done by all of them).
[2] Other types are of course possible in theory, but HTML is all anyone cares about in practice.
[3] There is also text/enriched, which was developed as a stopgap while HTML 3.2 was being developed. Its use in practice is exceedingly slim.
[4] This is one of the reasons I'm minded to make "prefer plain text" do degradation of natural HTML display instead of showing the plain text parts. Not that cleanly degrading HTML is easy.
[5] In the interests of full disclosure, the image/jpeg was actually a PNG image and the HTML claimed to be 7-bit UTF-8 but was actually 8-bit, and it contained a Unicode homograph attack.
[6] Of the major clients, Outlook uses Word's HTML rendering engine, which I recall once reading as being roughly equivalent to IE 5.5 in capability. Webmail is forced to do their own sanitization and sandboxing, and the output leaves something to desire; Gmail is the worst offender here, stripping out all but inline style. Thunderbird and SeaMonkey are nearly alone in using a high-quality layout engine: you can even send a <video> in an email to Thunderbird and have it work properly. :-)
[7] There is yEnc. Its mere existence does contradict several claims (for example, that adding new transfer encodings is infeasible due to install base of software), but it was developed for a slightly different purpose. Some implementation details are hostile to MIME, and although it has been discussed to death on the relevant mailing list several times, no draft was ever made that would integrate it into MIME properly.
This post is part 2 of an intermittent series exploring the difficulties of writing an email client. Part 1 describes a brief history of the infrastructure, as well as the issues I have with it. This post is discussing internationalization, specifically supporting non-ASCII characters in email.
Internationalization is not a simple task, even if the consideration is limited to "merely" the textual aspect [1]. Languages turn out to be incredibly diverse in their writing systems, so software that tries to support all writing systems equally well ends up running into several problems that admit no general solution. Unfortunately, I am ill-placed to be able to offer personal experience with internationalization concerns [2], so some of the information I give may well be wrong.
A word of caution: this post is rather long, even by my standards, since the problems of internationalization are legion. To help keep this post from being even longer, I'm going to assume passing familiarity with terms like ASCII, Unicode, and UTF-8.
The first issue I'll talk about is Unicode normalization, and it's an issue caused largely by Unicode itself. Unicode has two ways of making accented characters: precomposed characters (such as U+00F1, ñ) or a character followed by a combining character (U+006E, n, followed by U+0303, ◌̃). The display of both is the same: ñ versus ñ (read the HTML), and no one would disagree that the share the meaning. To let software detect that they are the same, Unicode prescribes four algorithms to normalize them. These four algorithms are defined on two axes: whether to prefer composed characters (like U+00F1) or prefer decomposed characters (U+006E U+0303), and whether to normalize by canonical equivalence (noting that, for example, U+212A Kelvin sign is equivalent to the Latin majuscule K) or by compatibility (e.g., superscript 2 to a regular 2).
Another issue is one that mostly affects display. Western European languages all use a left-to-right, top-to-bottom writing order. This isn't universal: Semitic languages like Hebrew or Arabic use right-to-left, top-to-bottom; Japanese and Chinese prefer a top-to-bottom, right-to-left order (although it is sometimes written left-to-right, top-to-bottom). It thus becomes an issue as to the proper order to store these languages using different writing orders in the actual text, although I believe the practice of always storing text in "start-to-finish" order, and reversing it for display, is nearly universal.
Now, both of those issues mentioned so far are minor in the grand scheme of things, in that you can ignore them and they will still probably work properly almost all of the time. Most text that is exposed to the web is already normalized to the same format, and web browsers have gotten away with not normalizing CSS or HTML identifiers with only theoretical objections raised. All of the other issues I'm going to discuss are things that cause problems and illustrate why properly internationalizing email is hard.
Another historical mistake of Unicode is one that we will likely be stuck with for decades, and I need to go into some history first. The first Unicode standard dates from 1991, and its original goal then was to collect all of the characters needed for modern transmission, which was judged to need only a 16-bit set of characters. Unfortunately, the needs of ideographic-centric Chinese, Japanese, and Korean writing systems, particularly rare family names, turns out to rather fill up that space. Thus, in 1996, Unicode was changed to permit more characters: 17 planes of 65,536 characters each, of which the original set was termed the "Basic Multilingual Plane" or BMP for short. Systems that chose to adopt Unicode in those intervening 5 years often adopted a 16-bit character model as their standard internal format, so as to keep the benefits of fixed-width character encodings. However, with the change to a larger format, their fixed-width character encoding is no longer fixed-width.
This issue plagues anybody who works with systems that considered internationalization in that unfortunate window, which notably includes prominent programming languages like C#, Java, and JavaScript. Many cross-platform C and C++ programs implicitly require UTF-16 due to its pervasive inclusion into the Windows operating system and common internationalization libraries [3]. Unsurprisingly, non-BMP characters tend to quickly run into all sorts of hangups by unaware code. For example, right now, it is possible to coax Thunderbird to render these characters unusable in, say, your subject string if the subject is just right, and I suspect similar bugs exist in a majority of email applications [4].
For all of the flaws of Unicode [5], there is a tacit agreement that UTF-8 should be the character set to use for anyone not burdened by legacy concerns. Unfortunately, email is burdened by legacy concerns, and the use of 8-bit characters in headers that are not UTF-8 is more prevalent than it ought to be, RFC 6532 notwithstanding. In any case, email explicitly provides for handling a wide variety of alternative character sets without saying which ones should be supported. The official list [6] contains about 200 of them (including the UNKNOWN-8BIT character set), but not all of them see widespread use. In practice, the ones that definitely need to be supported are the ISO 8859-* and ISO 2022-* charsets, the EUC-* charsets, Windows-* charsets, GB18030, GBK, Shift-JIS, KOI8-{R,U}, Big5, and of course UTF-8. There are two other major charsets that don't come up directly in email but are important for implementing the entire suite of protocols: UTF-7, used in IMAP (more on that later), and Punycode (more on that later, too).
The suite of character sets falls into three main categories. First is the set of fixed-width character sets, most notably ASCII and the ISO 8859 suite of charsets, as well as UCS-2 (2 bytes per character) and UTF-32 (4 bytes per character). Since the major East Asian languages are all ideographic, which require a rather large number of characters to be encoded, fixed-width character sets are infeasible. Instead, many choose to do a variable-width encoding: Shift-JIS lets some characters (notably ASCII characters and half-width katakana) remain a single byte and uses two bytes to encode all of its other characters. UTF-8 can use between 1 byte (for ASCII characters) and 4 bytes (for non-BMP characters) for a single character. The final set of character sets, such as the ISO 2022 ones, use escape sequences to change the interpretation of subsequent characters. As a result, taking the substring of an encoding string can change its interpretation while remaining valid. This will be important later.
Two more problems related to character sets are worth mentioning. The first is the byte-order mark, or BOM, which is used to distinguish whether UTF-16 is written on a little-endian or big-endian machine. It is also sometimes used in UTF-8 to indicate that the text is UTF-8 versus some unknown legacy encoding. It is also not supposed to appear in email, but I have done some experiments which suggest that people use software that adds it without realizing that this is happening. The second issue, unsurprisingly [7], is that for some character sets (Big5 in particular, I believe), not everyone agrees on how to interpret some of the characters.
The largest problem of internationalization that applies in a general sense is the problem of case insensitivity. The 26 basic Latin letters all map nicely to case, having a single uppercase and a single lowercase variant for each letter. This practice doesn't hold in general—languages like Japanese lack even the notion of case, although it does have two kana variants that hold semantic differences. Rather, there are three basic issues with case insensitivity which showcase enough of its problems to make you want to run away from it altogether [8].
The simplest issue is the Greek sigma. Greek has two lowercase variants of the sigma character: σ and &varsigma (the "final sigma"), but a single uppercase variant, Σ. Thus mapping a string s to uppercase and back to lowercase is not equivalent to mapping s directly to lower-case in some cases. Related to this issue is the story of German ß character. This character evolved as a ligature of a long and short 's', and its uppercase form is generally held to be SS. The existence of a capital form is in some dispute, and Unicode only recently added it (ẞ, if your software supports it). As a result, merely interconverting between uppercase and lowercase versions of a string does not necessarily lead to a simple fixed point. The third issue is the Turkish dotless i (ı), which is the lowercase variant of the ASCII uppercase I character to those who speak Turkish. So it turns out that case insensitivity isn't quite the same across all locales.
Again unsurprisingly in light of the issues, the general tendency towards case-folding or case-insensitive matching in internationalized-aware specifications is to ignore the issues entirely. For example, asking for clarity on the process of case-insensitive matching for IMAP folder names, the response I got was "don't do it." HTML and CSS moved to the cumbersomely-named variant known as "ASCII-subset case-insensitivity", where only the 26 basic Latin letters are mapped to their (English) variants in case. The solution for email is also a verbose variant of "unspecified," but that is only tradition for email (more on this later).
Now that you have a good idea of the general issues, it is time to delve into how the developers of email rose to the challenge of handling internationalization. It turns out that the developers of email have managed to craft one of the most perfect and exquisite examples I have seen of how to completely and utterly fail. The challenges of internationalized emails are so difficult that buggier implementations are probably more common than fully correct implementations, and any attempt to ignore the issue is completely and totally impossible. In fact, the faults of RFC 2047 are my personal least favorite part of email, and implementing it made me change the design of JSMime more than any other feature. It is probably the single hardest thing to implement correctly in an email client, and it is so broken that another specification was needed to be able to apply internationalization more widely (RFC 2231).
The basic problem RFC 2047 sets out to solve is how to reliably send non-ASCII characters across a medium where only 7-bit characters can be reliably sent. The solution that was set out in the original version, RFC 1342, is to encode specific strings in an "encoded-word" format: =?charset?encoding?encoded text?=. The encoding can either be a 'B' (for Base64) or a 'Q' (for quoted-printable). Except the quoted-printable encoding in this format isn't quite the same quoted-printable encoding used in bodies: the space character is encoded via a '_' character instead, as spaces aren't allowed in encoded-words. Naturally, the use of spaces in encoded-words is common enough to get at least one or two bugs filed a year about Thunderbird not supporting it, and I wonder if this subtle difference between two quoted-printable variants is what causes the prevalence of such emails.
One of my great hates with regard to email is the strict header line length limit. Since the encoded-word form can get naturally verbose, particularly when you consider languages like Chinese that are going to have little whitespace amenable for breaking lines, the ingenious solution is to have adjacent encoded-word tokens separated only by whitespace be treated as the same word. As RFC 6857 kindly summarizes, "whitespace behavior is somewhat unpredictable, in practice, when multiple encoded words are used." RFC 6857 also suggests that the requirement to limit encoded words to only 74 characters in length is also rather meaningless in practice.
A more serious problem arises when you consider the necessity of treating adjacent encoded-word tokens as a single unit. This one is so serious that it reaches the point where all of your options would break somebody. When implementing an RFC 2047 encoding algorithm, how do you write the code to break up a long span of text into multiple encoded words without ever violating the specification? The naive way of doing so is to encode the text once in one long string, and then break it into checks which are then converted into the encoded-word form as necessary. This is, of course, wrong, as it breaks two strictures of RFC 2047. The first is that you cannot split the middle of multibyte characters. The second is that mode-switching character sets must return to ASCII by the end of a single encoded-word [9]. The smarter way of building encoded-words is to encode words by trying to figure out how much text can be encoded before needing to switch, and breaking the encoded-words when length quotas are exceeded. This is also wrong, since you could end up violating the return-to-ASCII rule if your don't double-check your converters. Also, if UTF-16 is used as the basis for the string before charset conversion, the encoder stands a good chance of splitting up creating unpaired surrogates and a giant mess as a result.
For JSMime, the algorithm I chose to implement is specific to UTF-8, because I can use a property of the UTF-8 implementation to make encoding fast (every octet is looked at exactly three times: once to convert to UTF-8, once to count to know when to break, and once to encode into base64 or quoted-printable). The property of UTF-8 is that the second, third, and fourth octets of a multibyte character all start with the same two bits, and those bits never start the first octet of a character. Essentially, I convert the entire string to a binary buffer using UTF-8. I then pass through the buffer, keeping counters of the length that the buffer would be in base64 form and in quoted-printable form. When both counters are exceeded, I back up to the beginning of the character, and encode that entire buffer in a word and then move on. I made sure to test that I don't break surrogate characters by making liberal use of the non-BMP character U+1F4A9 [10] in my encoding tests.
The sheer ease of writing a broken encoder for RFC 2047 means that broken encodings exist in the wild, so an RFC 2047 decoder needs to support some level of broken RFC 2047 encoding. Unfortunately, to "fix" different kinds of broken encodings requires different support for decoders. Treating adjacent encoded-words as part of the same buffer when decoding makes split multibyte characters work properly but breaks non-return-to-ASCII issues; if they are decoded separately the reverse is true. Recovering issues with isolated surrogates is at best time-consuming and difficult and at worst impossible.
Yet another problem with the way encoded-words are defined is that they are defined as specific tokens in the grammar of structured address fields. This means that you can't hide RFC 2047 encoding or decoding as a final processing step when reading or writing messages. Instead you have to do it during or after parsing (or during or before emission). So the parser as a result becomes fully intertwined with support for encoded-words. Converting a fully UTF-8 message into a 7-bit form is thus a non-trivial operation: there is a specification solely designed to discuss how to do such downgrading, RFC 6857. It requires deducing what structure a header has, parsing that harder, and then reencoding the parsed header. This sort of complicated structure makes it much harder to write general-purpose email libraries: the process of emitting a message basically requires doing a generic UTF-8-to-7-bit conversion. Thus, what is supposed to be a more implementation detail of how to send out a message ends up permeating the entire stack.
Unfortunately, the developers of RFC 2047 were a bit too clever for their own good. The specification limits the encoded-words to occurring only inside of phrases (basically, display names for addresses), unstructured text (like the subject), or comments (…). I presume this was done to avoid requiring parsers to handle internationalization in email addresses themselves or possibly even things like MIME boundary delimiters. However, this list leaves out one common source of internationalized text: filenames of attachments. This was ultimately patched by RFC 2231.
RFC 2231 is by no means a simple specification, since it attempts to solve three problems simultaneously. The first is the use of non-ASCII characters in parameter values. Like RFC 2047, the excessively low header line length limit causes the second problem, the need to wrap parameter values across multiple line lengths. As a result, the encoding is complicated (it takes more lines of code to parse RFC 2231's new features alone than it does to parse the basic format [11]), but it's not particularly difficult.
The third problem RFC 2231 attempts to solve is a rather different issue altogether: it tries to conclusively assign a language tag to the encoded text and also provides a "fix" for this to RFC 2047's encoded-words. The stated rationale is to be able to have screen readers read the text aloud properly, but the other (much more tangible) benefit is to ameliorate the issues of Unicode's Han unification by clearly identifying if the text is Chinese, Japanese, or Korean. While it sounds like a nice idea, it suffers from a major flaw: there is no way to use this data without converting internal data structures from using flat strings to richer representations. Another issue is that actually setting this value correctly (especially if your goal is supporting screen readers' pronunciations) is difficult if not impossible. Fortunately, this is an entirely optional feature; though I do see very little email that needs to be concerned about internationalization, I have yet to find an example of someone using this in the wild.
If you're the sort of person who finds properly writing internationalized text via RFC 2231 or RFC 2047 too hard (or you don't realize that you need to actually worry about this sort of stuff), and you don't want to use any of the several dozen MIME libraries to do the hard stuff for you, then you will become the bane of everyone who writes email clients, because you've just handed us email messages that have 8-bit text in the headers. At which point everything goes mad, because we have no clue what charset you just used. Well, RFC 6532 says that headers are supposed to be UTF-8, but with the specification being only 19 months old and part of a system which is still (to my knowledge) not supported by any major clients, this should be taken with a grain of salt. UTF-8 has the very nice property that text that is valid UTF-8 is highly unlikely to be any other charset, even if you start considering the various East Asian multibyte charsets. Thus you can try decoding under the assumption that is UTF-8 and switch to a designated fallback charset if decoding fails. Of course, knowing which designated fallback to use is a different matter entirely.
Stepping outside email messages themselves, internationalization is still a concern. IMAP folder names are another well-known example. RFC 3501 specified that mailbox names should be in a modified version of UTF-7 in an awkward compromise. To my knowledge, this is the only remaining significant use of UTF-7, as many web browsers disabled support due to its use in security attacks. RFC 6855, another recent specification (6 months old as of this writing), finally allows UTF-8 mailbox names here, although it too is not yet in widespread usage.
You will note missing from the list so far is email addresses. The topic of email addresses is itself worthy of lengthy discussion, but for the purposes of a discussion on internationalization, all you need to know is that, according to RFCs 821 and 822 and their cleaned-up successors, everything to the right of the '@' is a domain name and everything to the left is basically an opaque ASCII string [12]. It is here that internationalization really runs headlong into an immovable obstacle, for the email address has become the de facto unique identifier of the web, and everyone has their own funky ideas of what an email address looks like. As a result, the motto of "be liberal in what you accept" really breaks down with email addresses, and the amount of software that needs to change to accept internationalization extends far beyond the small segment interested only in the handling of email itself. Unfortunately, the relative newness of the latest specifications and corresponding lack of implementations means that I am less intimately familiar with this aspect of internationalization. Indeed, the impetus for this entire blogpost was a day-long struggle with trying to ascertain when two email addresses are the same if internationalized email address are involved.
The email address is split nicely by the '@' symbol, and internationalization of the two sides happens at two different times. Domains were internationalized first, by RFC 3490, a specification with the mouthful of a name "Internationalizing Domain Names in Applications" [13], or IDNA2003 for short. I mention the proper name of the specification here to make a point: the underlying protocol is completely unchanged, and all the work is intended to happen at roughly the level of getaddrinfo—the internal DNS resolver is supposed to be involved, but the underlying DNS protocol and tools are expected to remain blissfully unaware of the issues involved. That I mention the year of the specification should tell you that this is going to be a bumpy ride.
An internationalized domain name (IDN for short) is a domain name that has some non-ASCII characters in it. Domain names, according to DNS, are labels terminated by '.' characters, where each label may consist of up to 63 characters. The repertoire of characters are the ASCII alphanumerics and the '-' character, and labels are of course case-insensitive like almost everything else on the Internet. Encoding non-ASCII characters into this small subset while meeting these requirements is difficult for other contemporary schemes: UTF-7 uses Base64, which means 'A' and 'a' are not equivalent; percent-encoding eats up characters extremely quickly. So IDN use a different specification for this purpose, called Punycode, which allows for a dense but utterly unreadable encoding. The basic algorithm of encoding an IDN is to take the input string, apply case-folding, normalize using NFKC, and then encode with Punycode.
Case folding, as I mentioned several paragraphs ago, turns out to have some issues. The ß and &varsigma characters were the ones that caused the most complaints. You see, if you were to register, say, www.weiß.de, you would actually be registering www.weiss.de. As there is no indication of Punycode involved in the name, browsers would show the domain in the ASCII variant. One way of fixing this problem would be to work with browser vendors to institute a "preferred name" specification for websites (much like there exists one for the little icons next to page titles), so that the world could know that the proper capitalization is of course www.GoOgle.com instead of www.google.com. Instead, the German and Greek registrars pushed for a change to IDNA, which they achieved in 2010 with IDNA2008.
IDNA2008 is defined principally in RFCs 5890-5895 and UTS #46. The principal change is that the normalization step no longer exists in the protocol and is instead supposed to be done by applications, in a possibly locale-specific manner, before looking up the domain name. One reason for doing this was to eliminate the hard dependency on a specific, outdated version of Unicode [14]. It also helps fix things like the Turkish dotless I issue, in theory at least. However, this different algorithm causes some domains to be processed differently from IDNA2003. UTS #46 specifies a "compatibility mode" which changes the algorithm to match IDNA2003 better in the important cases (specifically, ß, &varsigma, and ZWJ/ZWNJ), with a note expressing the hope that this will eventually become unnecessary. To handle the lack of normalization in the protocol, registrars are asked to automatically register all classes of equivalent domain names at the same time. I should note that most major browsers (and email clients, if they implement IDN at all) are still using IDNA2003: an easy test of this fact is to attempt to go to ☃.net, which is valid under IDNA2003 but not IDNA2008.
Unicode text processing is often vulnerable to an attack known as the "homograph attack." In most fonts, the Greek omicron and the Latin miniscule o will be displayed in exactly the same way, so an attacker could pretend to be from, say, Google while instead sending you to Gοogle—I used Latin in the first word and Greek in the second. The standard solution is to only display the Unicode form (and not the Punycode form) where this is not an issue; Firefox and Opera display Unicode only for a whitelist of registrars with acceptable polices, Chrome and Internet Explorer only permits scripts that the user claims to read, and Safari only permits scripts that don't permit the homograph attack (i.e., not Cyrillic or Greek). (Note: this information I've summarized from Chromium's documentation; forward any complaints of out-of-date information to them).
IDN satisfies the needs of internationalizing the second half of an email address, so a working group was commissioned to internationalize the first one. The result is EAI, which was first experimentally specified in RFCs 5335-5337, and the standards themselves are found in RFCs 6530-6533 and 6855-6858. The primary difference between the first, experimental version and the second, to-be-implemented version is the removal of attempts to downgrade emails in the middle of transit. In the experimental version, provisions were made to specify with every internalized address an alternate, fully ASCII address to which a downgraded message could be sent if SMTP servers couldn't support the new specifications. These were removed after the experiment found that such automatic downgrading didn't work as well as hoped.
With automatic downgrading removed from the underlying protocol, the onus is on people who generate the emails—mailing lists and email clients—to figure out who can and who can't receive messages and then downgrade messages as appropriate for the recipients of the message. However, the design of SMTP is such that it is impossible to automatically determine if the client can receive these new kinds of messages. Thus, the options are to send them and hope that it works or to rely on the (usually clueless) user to inform you if it works. Clearly an unpalatable set of options, but it is one that can't be avoided due to protocol design.
The largest change of EAI is that the local parts of addresses are specified as a sequence of UTF-8 characters, omitting only the control characters [15]. The working group responsible for the specification adamantly refused to define a Unicode-to-ASCII conversion process, and thus a mechanism to make downgrading work smoothly, for several reasons. First, they didn't want to specify a prefix which could change the meaning of existing local-parts (the structure of local-parts is much less discoverable than the structure of all domain names). Second, they felt that the lack of support for displaying the Unicode variants of Punycode meant that users would have a much worse experience. Finally, the transition period would be hopefully short (although messy), so designing a protocol that supports that short period would worsen it in the long term. Considering that, at the moment of writing, only one of the major SMTP implementations has even a bug filed to support it, I think the working group underestimates just how long transition periods can take.
As far as changes to the message format go, that change is the only real change, considering how much effort is needed to opt-in. Yes, headers are now supposed to be UTF-8, but, in practice, every production MIME parser needs to handle 8-bit characters in headers anyways. Yes, message/global can have MIME encoding applied to it (unlike message/rfc822), but, in practice, you already need to assume that people are going to MIME-encode message/rfc822 in violation of the specification. So, in practice, the changes needed to a parser are to add message/global as an alias to message/rfc822 [16] and possibly tweaking some charset detection heuristics to prefer UTF-8. I would very much have liked the restriction on header line length removed, but, alas, the working group did not feel moved to make those changes. Still, I look forward to the day when I never have to worry about encoding text into RFC 2047 encoded-words.
IMAP, POP, and SMTP are also all slightly modified to take account of the new specifications. Specifically, internationalized headers are supposed to be opt-in only—SMTP are supposed to reject sending to these messages if it doesn't support them in the first place, and IMAP and POP are supposed to downgrade messages when requested unless the client asks for them to not be. As there are no major server implementations yet, I don't know how well these requirements will be followed, especially given that most of the changes already need to be tolerated by clients in practice. The experimental version of internationalization specified a format which would have wreaked havoc to many current parsers, so I suspect some of the strict requirements may be a holdover from that version.
And thus ends my foray into email internationalization, a collection of bad solutions to hard problems. I have probably done a poor job of covering the complete set of inanities involved, but what I have covered are the ones that annoy me the most. This certainly isn't the last I'll talk about the impossibility of message parsing either, but it should be enough at least to convince you that you really don't want to write your own message parser.
[1] Date/time, numbers, and currency are the other major aspects of internalization.
[2] I am a native English speaker who converses with other people almost completely in English. That said, I can comprehend French, although I am not familiar with the finer points that come with fluency, such as collation concerns.
[3] C and C++ have a built-in internationalization and localization API, derived from POSIX. However, this API is generally unsuited to the full needs of people who actually care about these topics, so it's not really worth mentioning.
[4] The basic algorithm to encode RFC 2047 strings for any charset are to try to shift characters into the output string until you hit the maximum word length. If the internal character set for Unicode conversion is UTF-16 instead of UTF-32 and the code is ignorant of surrogate concerns, then this algorithm could break surrogates apart. This is exactly how the bug is triggered in Thunderbird.
[5] I'm not discussing Han unification, which is arguably the single most controversial aspect of Unicode.
[6] Official list here means the official set curated by IANA as valid for use in the charset="" parameter. The actual set of values likely to be acceptable to a majority of clients is rather different.
[7] If you've read this far and find internationalization inoperability surprising, you are either incredibly ignorant or incurably optimistic.
[8] I'm not discussing collation (sorting) or word-breaking issues as this post is long enough already. Nevertheless, these also help very much in making you want to run away from internationalization.
[9] I actually, when writing this post, went to double-check to see if Thunderbird correctly implements return-to-ASCII in its encoder, which I can only do by running tests, since I myself find its current encoder impenetrable. It turns out that it does, but it also looks like if we switched conversion to ICU (as many bugs suggest), we may break this part of the specification, since I don't see the ICU converters switching to ASCII at the end of conversion.
[10] Chosen as a very adequate description of what I think of RFC 2047. Look it up if you can't guess it from context.
[11] As measured by implementation in JSMime, comments and whitespace included. This is biased by the fact that I created a unified lexer for the header parser, which rather simplifies the implementation of the actual parsers themselves.
[12] This is, of course a gross oversimplification, so don't complain that I'm ignoring domain literals or the like. Email addresses will be covered later.
[13] A point of trivia: the 'I' in IDNA2003 is expanded as "Internationalizing" while the 'I' in IDNA2008 is for "Internationalized."
[14] For the technically-minded: IDNA2003 relied on a hard-coded list of banned codepoints in processing, while IDNA2008 derives its lists directly from Unicode codepoint categories, with a small set of hard-coded exceptions.
[15] Certain ASCII characters may require the local-part to be quoted, of course.
[16] Strictly speaking, message/rfc822 remains all-ASCII, and non-ASCII headers need message/global. Given the track record of message/news, I suspect that this distinction will, in practice, not remain for long.
Update, September 2014: Since this post is linked prominently from the Rust subreddit and other places, perhaps some context is in order. Here's a timeline of events:
On August 28, 2013, I said "please remember that not everyone in the IRC channel is a guy" to someone in #rust, in response to them addressing the people in the channel as "Guys". The person responded with "boobs or gtfo". I responded by asking the moderators to kick them out of the channel. (Kicking is a mild penalty; a kicked person can rejoin immediately.) They left the channel a short while later of their own accord. Later, I emailed some people on the Rust team about what happened, and one of them banned the user. You can find the chat logs linked below.
Several weeks later, in October 2013, I wrote this post. I wrote it because I have a long-running yearly tradition of writing about what goes on in my life during the last week in August -- or "had", I suppose; I haven't much felt like continuing the tradition since last year. For the first seven years that I wrote them, these last-week-of-August posts were largely about mundane, day-to-day things. I had expected 2013 to be no different. But when I sat down to write the post in October of that year, the main recollection I had from the last week of August was how bad it had felt to have someone say "boobs or gtfo" to me in the IRC channel of a project to which I had been contributing for two years. So, I just wrote about that.
The post got a lot of attention. A lot of people responded supportively. On the other hand, the top comment on Hacker News characterizes my saying "please remember that not everyone in the IRC channel is a guy" as an "attack", while, in the same breath, describing "boobs or gtfo" as "humor". So, you know, the usual.
Some time later, in April 2014, someone came into the Rust IRC channel and began spamming the channel with "boobs or gtfo" and "fuck lindsey" repeatedly. I tweeted about how I was getting IRC abuse for having written about IRC abuse. Various people who had previously been unaware of the whole saga found out about it. Again, I got a lot of attention, and a lot of people responded supportively, including the Rust team. (Among other things, they asked me to be a mod on #rust, and I accepted.) On the other hand, at least one person wrote a blog post about how quote-unquote "fucktarded" I am. (I won't be linking to it here; I'm sure you can find it if you really want to.) So, you know, the usual.
To sum up, there have been people who seem to think that I (or the #rust mods, or the Rust team) consider use of "guys" to be abusive, and who are upset with me (or the #rust mods, or the Rust team) for "attacking" people who use "guys". So, allow me to clarify:
"guys" is not abusive; "boobs or gtfo", however, is.
"please remember that not everyone in the IRC channel is a guy" is not an "attack"; "fuck lindsey", however, is.
Hope that clears things up a bit, folks! The original post from October 2013 follows.
I'm past due to write this year's My, What A Busy Week! post, but didn't feel like doing it in August and haven't felt like doing it since. Well, it had a good long run. I'd like to mark the passage of that week in what I hope is a less flip and more thoughtful way. The only really notable occurrence in my life during the last Monday-through-Saturday in August this year is that someone in the #rust channel on IRC said "boobs or gtfo" to me. So, let's talk about that for a minute.
I don't often choose the "don't use 'guys' to refer to mixed-gender groups" battle. Over the years, 'guys' has come to grate on me, but only in certain contexts, and even then, it only grates a tiny bit. It's almost never a battle worth choosing. So, when someone showed up in #rust and said [sic] "Guys, can you recommend library for getting something from iternets via http?", I almost let it go, because it just wasn't that big of a deal. But I was emboldened by my friend Adam's recent choice of the same battle on a (private) mailing list we're both on, so I decided to say something.
I said, out loud, "Holy shit!" My officemate, sitting across the room, said, "What?" But I couldn't actually make my voice work to tell him what had just happened, because by then, adrenaline was sweeping through me. It was ten seconds before I remembered how to talk. The only other times in my life that I can remember that happening have been in response to an immediate physical threat, like when a car was about to hit my bike.
It was unexpected. I hadn't been expecting this person to actually correct themselves, of course. But I had been expecting them to respond with something like "uh, 'guys' is a gender-neutral word", in which case my reaction would have been to shrug and say something like, "Yeah, I'm sure that's how you meant it." And I would have left it at that, because, again, it wasn't that big of a deal, and all I really aim to accomplish here is encourage people to think about it a little harder, maybe choose a different word next time. But that's not what happened. What happened was that when I made a polite request, a flood of hate came rushing out at me. And now it's hard for me to continue to pretend or assume that that hate doesn't boil under the surface of our community.
Later on, various members of the full-time Rust staff (who have ops in the channel) apologized to me, and banned the user, and added several more ops from various time zones so that it would be more likely that someone with ops would be around next time (since it had been lunchtime on the west coast and no one had been around). They also added a link to the conduct policy to the channel topic, to help reduce the chances of this sort of thing happening in the future. All that was great. The Rust team is awesome. But here's the thing: every time I point out something like this in a community I'm part of, whether it's the Rust community or any other, there's a part of me that insists on first checking to see how much social capital I have to spend there. How high up am I on the contributors list? Have I contributed to the next release yet? All right, I guess it's okay for me to say something -- as though it hurts the project to speak up about a community problem! And so I have a double-entry accounting system in my head for amount of code contributed and amount of abuse reported, and it's terrible and broken that I feel that that's necessary. The only qualification that any of us should need to be treated with humanity is that we are human.
I've been contributing to Rust for a long time. #rust is my community, and it's been my community since there were only twenty-odd people in it, instead of the three hundred-some we usually have today. I shouldn't have to worry about being attacked while standing on my own ground. And, indeed, right after it happened, a friend who worked on Rust with me last year privmsg'd me to say that it seemed to him that "something sacred had been violated".
Meanwhile, in the channel, my friends joked ruefully about "boobs or gtfo" being some sort of milestone -- that we hadn't been a real language before, and hooray, now we were. I laughed along with them. But I would like to make the wholly unoriginal and unrevolutionary suggestion that, if being a "real language" means that your longtime contributors get harassed by strangers in the official IRC channel for being female, then being a "real language" is a pretty fucking abysmal standard to aspire to. We'd all like to think that the Rust community is safe and welcoming to all, and for the most part, most of the time, I think we do okay -- or, at least, we do okay relative to the absymal standard that's been set. I almost feel sheepish complaining, since the occasional "boobs or gtfo" from some stranger on IRC is a laughably insignificant problem for our community to have. I still can barely believe it happened; it seems surreal. It's certainly not the #rust that I'm accustomed to. But if we want to keep it that way, or maybe even do better, then we'll need to work at it, and we'll certainly need to work harder now then we would have had to work two and a half years ago, and if we're very fortunate and Rust continues to grow in popularity, we'll have to work harder still.
And this is work that I believe is worth doing. As much as I liketo talk about pattern matching and default methods and (recently) hygienic macros and everything else in Rust that I like and care about, there's no getting around the fact that the most important feature of any programming language is its community. At the Haskell Symposium two weeks ago, Bastiaan Heeren said, "The strength and weakness of Haskell is GHC and its community." This followed Ken Shan's program chair report in which he exhorted the Haskell community to do a better job of treating one another with humanity. Rust's community is the strength and weakness of Rust, too, and we have a tremendous opportunity to make it more of a strength and less of a weakness, if we want to.
Which is harder, writing an email client or writing a web browser? Several years ago, I would have guessed the latter. Having worked on an email client for several years, I am now more inclined to guess that email is harder, although I never really worked on a web browser, so perhaps it's just bias. Nevertheless, HTML comes with a specification that tells you how to parse crap that pretends to be HTML; email messages come with no such specification, which forces people working with email to guess based on other implementations and bug reports. To vent some of my frustration with working with email, I've decided to post some of my thoughts on what email did wrong and why it is so hard to work with. Since there is so much to talk about, instead of devoting one post to it, I'll make it an ongoing series with occasional updates (i.e., updates will come out when I feel like it, so don't bother asking).
First off, what do I mean by an email client? The capabilities of, say, Outlook versus Gaia Email versus Thunderbird are all wildly different, and history has afforded many changes in support. I'll consider anything that someone might want to put in an email client as fodder for discussion in this series (so NNTP, RSS, LDAP, CalDAV, and maybe even IM stuff might find discussions later). What I won't consider are things likely to be found in a third-party library, so SSL, HTML, low-level networking, etc., are all out of scope, although I may mention them where relevant in later posts. If one is trying to build a client from scratch, the bare minimum one needs to understand first is the basic message formatting, MIME (which governs attachments), SMTP (email delivery), and either POP or IMAP (email receipt). Unfortunately, each of these requires cross-referencing a dozen RFCs individually when you start considering optional or not-really-optional features.
The current email architecture we work with today doesn't have a unique name, although "Internet email" [1] or "SMTP-based email" are probably the most appropriate appellations. Since there is only one in use in modern times, there is no real need to refer to it by anything other than "email." The reason for the use of SMTP in lieu of any other major protocol to describe the architecture is because the heart of the system is motivated by the need to support SMTP, and because SMTP is how email is delivered across organizational boundaries, even if other protocols (such as LMTP) are used internally.
Some history of email, at least that lead up to SMTP, is in order. In the days of mainframes, mail generally only meant communicating between different users on the same machine, and so a bevy of incompatible systems started to arise. These incompatible systems grew to support connections with other computers as networking computers became possible. The ARPANET project brought with it an attempt to standardize mail transfer on ARPANET, separated into two types of documents: those that standardized message formats, and those that standardized the message transfer. These would eventually culminate in RFC 822 and RFC 821, respectively. SMTP was designed in the context of ARPANET, and it was originally intended primarily to standardize the messages transferred only on this network. As a result, it was never intended to become the standard for modern email.
The main competitor to SMTP-based email that is worth discussing is X.400. X.400 was at one time expected to be the eventual global email interconnect protocol, and interoperability between SMTP and X.400 was a major focus in the 1980s and 1990s. SMTP has a glaring flaw, to those who work with it, in that it is not so much designed as evolved to meet new needs as they came up. In contrast, X.400 was designed to account for a lot of issues that SMTP hadn't dealt with yet, and included arguably better functionality than SMTP. However, it turned out to be a colossal failure, although theories differ as to why. The most convincing to me boils down to X.400 being developed at a time of great flux in computing (the shift from mainframes to networked PCs) combined with a development process that was ill-suited to reacting quickly to these changes.
I mentioned earlier that SMTP eventually culminates in RFC 821. This is a slight lie, for one of the key pieces of the Internet, and a core of the modern email architecture, didn't exist. That is DNS, which is the closest thing the Internet has to X.500 (a global, searchable directory of everything). Without DNS, figuring out how to route mail via SMTP is a bit of a challenge (hence why SMTP allowed explicit source routing, deprecated post-DNS in RFC 2821). The documents which lay out how to use DNS to route are RFC 974, RFC 1035, and RFC 1123. So it's fair to say that RFC 1123 is really the point at which modern SMTP was developed.
But enough about history, and on to the real topic of this post. The most important artifact of SMTP-based architecture is that different protocols are used to send email from the ones used to read email. This is both a good thing and a bad thing. On the one hand, it's easier to experiment with different ways of accessing mailboxes, or only supporting limited functionality where such is desired. On the other, the need to agree on a standard format still keeps all the protocols more or less intertwined, and it makes some heavily-desired features extremely difficult to implement. For example, there is still, thirty years later, no feasible way to send a mail and save it to a "Sent" folder on your IMAP mailbox without submitting it twice [2].
The greatest flaws in the modern architecture, I think, lie in particular in a bevy of historical design mistakes which remain unmitigated to this day, in particular in the base message format and MIME. Changing these specifications is not out of the question, but the rate at which the changes become adopted is agonizingly slow, to the point that changing is generally impossible unless necessary. Sending outright binary messages was proposed as experimental in 1995, proposed as a standard in 2000, and still remains relatively unsupported: the BINARYMIME SMTP keyword only exists on one of my 4 SMTP servers. Sending non-ASCII text is potentially possible, but it is still not used in major email clients to my knowledge (searching for "8BITMIME" leads to the top results generally being "how do I turn this off?"). It will be interesting to see how email address internationalization is handled, since it's the first major overhaul to email since the introduction of MIME—the first major overhaul in 16 years. Intriguingly enough, the NNTP and Usenet communities have shown themselves to be more adept to change: sending 8-bit Usenet messages generally works, and yEnc would have been a worthwhile addition to MIME if its author had ever attempted to push it through. His decision not to (with the weak excuses he claimed) is emblematic of the resistance of the architecture to change, even in cases where such change would be pretty beneficial.
My biggest complaint with the email architecture isn't actually really a flaw in the strict sense of the term but rather a disagreement. The core motto of email could perhaps be summed up with "Be liberal in what you accept and conservative in what you send." Now, I come from a compilers background, and the basic standpoint in compilers is, if a user does something wrong, to scream at them for being a bloody idiot and to reject their code. Actually, there's a tendency to do that even if they do something technically correct but possibly unintentionally wrong. I understand why people dislike this kind of strict checking, but I personally consider it to be a feature, not a bug. My experience with attempting to implement MIME is that accepting what amounts to complete crap not only means that everyone has to worry about parsing the crap, but it actually ends up encouraging it. The attitude people get in bugs starts becoming "this is supported by <insert other client>, and your client is broken for not supporting it," even when pointed out that their message is in flagrant violation of the specification. As I understand it, HTML 5 has the luxury of specifying a massively complex parser that makes /dev/urandom in theory reliably parsed across different implementations, but there is no such similar document for the modern email message. But we still have to deal with the utter crap people claim is a valid email message. Just this past week, upon sifting through my spam folder, I found a header which is best described as =?UTF-8?Q? ISO-8859-1, non-ASCII text ?= (spaces included). The only way people are going to realize that their tools are producing this kind of crap is if their tools stop working altogether.
These two issues come together most spectacularly when RFC 2047 is involved. This is worth a blog post by itself, but the very low technically-not-but-effectively-mandatory limit on the header length (to aide people who read email without clients) means that encoded words need to be split up to fit on header lines. If you're not careful, you can end up splitting multibyte characters between different encoded words. This unfortunately occurs in practice. Properly handling it in my new parser required completely reflowing the design of the innermost parsing function and greatly increasing implementation complexity. I would estimate that this single attempt to "gracefully" handle wrong-but-of-obvious-intent scenario is worth 15% or so of the total complexity of decoding RFC 2047-encoded text.
There are other issues with modern email, of course, but all of the ones that I've collected so far are not flaws in the architecture as a whole but rather flaws of individual portions of the architecture, so I'll leave them for later posts.
[1] The capital 'I' in "Internet email" is important, as it's referring to the "Internet" in "Internet Standard" or "Internet Engineering Task Force." Thus, "Internet email" means "the email standards developed for the Internet/by the IETF" and not "email used on the internet."
[2] Yes, I know about BURL. It doesn't count. Look at who supports it: almost nobody.
Among the build systems peer, I am very much a newcomer. Despite working with Thunderbird for over 5 years, I've only grown to understand the comm-central build system in gory detail in the past year. Most of my work before then was in trying to get random projects working; understanding it more thoroughly is a result of attempting to eliminate the need for comm-central to maintain its own build system. The goal of moving our build system description to a series of moz.build files has made this task much more urgent.
At a high level, the core of the comm-central build system is not a single build system but rather three copies of the same build system. In simple terms, there's a matrix on two accesses: which repository does the configuration of code (whose config.status invokes it), and which repository does the build (whose rules.mk is used). Most code is configured and built by mozilla-central. That comm-central code which is linked into libxul is configured by mozilla-central but built by comm-central. tier_app is configured and built by comm-central. This matrix of possibilities causes interesting bugs—like the bustage caused by the XPIDLSRCS migration, or issues I'm facing working with xpcshell manifests—but it is probably possible to make all code get configured by mozilla-central and eliminate several issues for once and all.
With that in mind, here is a step-by-step look at how the amorphous monster that is the comm-central build system works:
python client.py checkout
And comm-central starts with a step that is unknown in mozilla-central. Back when everyone was in CVS, the process of building started with "check out client.mk from the server, set up your mozconfig, and then run make -f client.mk checkout." The checkout would download exactly the directories needed to build the program you were trying to build. When mozilla-central moved to Mercurial, the only external projects in the tree that Firefox used were NSPR and NSS, both of which were set up to pull from a specific revision. The decision was made to import NSPR and NSS as snapshots on a regular basis, so there was no need for the everyday user to use this feature. Thunderbird, on the other hand, pulled in the LDAP code externally, as well as mozilla-central, while SeaMonkey also pulls in the DOM inspector, Venkman, and Chatzilla as extensions. Importing a snapshot was not a tenable option for mozilla-central, as it updates at an aggressive rate, so the functionality of checkout was ported to comm-central in a replacement python fashion.
./configure [comm-central]
The general purpose of configure is to discover the build system and enable or disable components based on options specified by the user. This results in a long list of variables which is read in by the build system. Recently, I changed the script to eliminate the need to port changes from mozilla-central. Instead, this script reads in a few variables and tweaks them slightly to produce a modified command line to call mozilla-central's configure...
./configure [mozilla-central]
... which does all the hard work. There are hooks in the configure script here to run a few extra commands for comm-central's need (primarily adding a few variables and configuring LDAP). This is done by running a bit of m4 over another file and invoking that as a shell script; the m4 is largely to make it look and feel "like" autoconf. At the end of the line, this dumps out all of the variables to a file called config.status; how these get configured in the script is not interesting.
./config.status [mozilla/comm-central]
But config.status is. At this point, we enter the mozbuild world and things become very insane; failure to appreciate what goes on here is a very good way to cause extended bustage for comm-central. The mozbuild code essentially starts at a directory and exhaustively walks it to build a map of all the code. One of the tasks of comm-central's configure is to alert mozbuild to the fact that some of our files use a different build system. We, however, also carefully hide some of our files from mozbuild, so we run another copy of config.status again to add in some more files (tier_app, in short). This results in our code having two different notions of how our build system is split, and was never designed that way. Originally, mozilla-central had no knowledge of the existence of comm-central, but some changes made in the Firefox 4 timeframe suddenly required Thunderbird and SeaMonkey to link all of the mailnews code into libxul, which forced this contorted process to come about.
make
Now that all of the Makefiles have bee generated, building can begin. The first directory entered is the top of comm-central, which proceeds to immediately make all of mozilla-central. How mozilla-central builds itself is perhaps an interesting discussion, but not for this article. The important part is that partway through building, mozilla-central will be told to make ../mailnews (or one of the other few directories). Under recursive make, the only way to "tell" which build system is being used is by the directory that the $(DEPTH) variable is pointing to, since $(DEPTH)/config/config.mk and $(DEPTH)/config/rules/mk are the files included to get the necessary rules. Since mozbuild was informed very early on that comm-central is special, the variables it points to in comm-central are different from those in mozilla-central—and thus comm-central's files are all invariably built with the comm-central build system despite being built from mozilla-central.
However, this is not true of all files. Some of the files, like the chat directory are never mentioned to mozilla-central. Thus, after the comm-central top-level build completes building mozilla-central, it proceeds to do a full build under what it thinks is the complete build system. It is here that later hacks to get things like xpcshell tests working correctly are done. Glossed over in this discussion is the fun process of tiers and other dependency voodoo tricks for a recursive make.
The future
With all of the changes going on, this guide is going to become obsolete quickly. I'm experimenting with eliminating one of our three build system clones by making all comm-central code get configured by mozilla-central, so that mozbuild gets a truly global view of what's going on—which would help not break comm-central for things like eliminating the master xpcshell manifest, or compiling IDLs in parallel. The long-term goal, of course, is to eliminate the ersatz comm-central build system altogether, although the final setup of how that build system works out is still not fully clear, as I'm still in the phase of "get it working when I symlink everything everywhere."
When running final tests for my latest patch queue, I discovered that someone has apparently added a new color to the repertoire: a hot pink. So I now present to you a snapshot of TBPL that uses all the colors except gray (running), gray (pending), and black (I've never seen this one):
This post is admittedly long overdue, but I kept wanting to delay this post until I actually had patches up for review. But I have the tendency to want to post nothing until I verify that the entire pipeline consisting of over 5,000 lines of changes is fully correct and documented. However, I'm now confident in all but roughly three of my major changes, so patch documentation and redistribution is (hopefully) all that remains before I start saturating all of the reviewers in Thunderbird with this code. An ironic thing to note is that these changes are actually largely a sidetrack from my original goal: I wanted to land my charset-conversion patch, but I first thought it would be helpful to test with nsIMimeConverter using the new method, which required me to implement header parsing, which is best tested with nsIMsgHeaderParser, which turns out to have needed very major changes.
As you might have gathered, I am getting ready to land a major set of changes. This set of changes is being tracked in bugs 790855, 842632, and 858337. These patches are implementing structured header parsing and emission, as well as RFC 2047 decoding and encoding. My goal still remains to land all of these changes by Thunderbird 24, reviewers permitting.
The first part of JSMime landed back in Thunderbird 21, so anyone using the betas is already using part of it. One of the small auxiliary interfaces (nsIMimeHeaders) was switched over to the JS implementation instead of libmime's implementation, as well as the ad-hoc ones used in our test suites. The currently pending changes would use JSMime for the other auxiliary interfaces, nsIMimeConverter (which does RFC 2047 processing) and nsIMsgHeaderParser (which does structured processing of the addressing headers). The changes to the latter are very much API-breaking, requiring me to audit and fix every single callsite in all of comm-central. On the plus side, thanks to my changes, I know I will incidentally be fixing several bugs such as quoting issues in the compose windows, a valgrind error in nsSmtpProtocol.cpp, or the space-in-2047-encoded-words issue.
It's not all the changes, although being able to outright remove 2000 lines of libmime is certainly a welcome change. The brunt of libmime remains the code that is related to the processing of email body parts into the final email display method, which is the next target of my patches and which I originally intended to fix before I got sidetracked. Getting sidetracked isn't altogether a bad thing, since, for the first time, it lets me identify things that can be done in parallel with this work.
A useful change I've identified that is even more invasive than everything else to date would be to alter our view of how message headers work. Right now, we tend to retrieve headers (from, say, nsIMsgDBHdr) as strings, where the consumer will use a standard API to reparse them before acting on their contents. A saner solution is to move the structured parsing into the retrieval APIs, by making an msgIStructuredHeaders interface, retrievable from nsIMsgDBHdr and nsIMsgCompFields from which you can manipulate headers in their structured form instead of their string from. It's even more useful on nsIMsgCompFields, where keeping things in structured form as long as possible is desirable (I particularly want to kill nsIMsgCompFields.splitRecipients as an API).
Another useful change is that our underlying parsing code can properly handle groups, which means we can start using groups to handle mailing lists in our code instead of…the mess we have now. The current implementation sticks mailing lists as individual addresses to be expanded by code in the middle of the compose sequence, which is fragile and suboptimal.
The last useful independent change I can think of is rewriting the addressing widget in the compose frontend to store things internally in a structured form instead of the MIME header kludge it currently uses; this kind of work could also be shared with the similar annoying mailing list editing UI.
As for myself, I will be working on the body part conversion process. I haven't yet finalized the API that extensions will get to use here, as I need to do a lot of playing around with the current implementation to see how flexible it is. The last two entry points into libmime, the stream converter and Gloda, will first be controlled by preference, so that I can land partially-working versions before I land everything that is necessary. My goal is to land a functionality-complete implementation by Thunderbird 31 (i.e., the next ESR branch after 24), so that I can remove the old implementation in Thunderbird 32, but that timescale may still be too aggressive.
If you do anything with web development, you are probably well aware that Opera recently announced that it was ditching its Presto layout engine and switching to Webkit. The reception of the blogosphere to this announcement has been decidedly mixed, but I am disheartened by the news for a very simple reason. The loss of one of the largest competitors in the mobile market risks entrenching a monoculture in web browsing.
Now, many people have attempted to argue against this risk by one of three arguments: that Webkit already is a monoculture on mobile browsing; that Webkit is open-source, so it can't be a "bad" monoculture; or that Webkit won't stagnant, so it can't be a "bad" monoculture. The first argument is rather specious, since it presumes that once a monoculture exists it is pointless to try to end it—walk through history, it's easier to cite examples that were broken than ones that weren't. The other two arguments are more dangerous, though, because they presume that a monoculture is bad only because of who is in charge of it, not because it is a monoculture.
The real reason why monocultures are bad are not because the people in control do bad things with it. It's because their implementations—particularly including their bugs—becomes the standards instead of the specifications themselves. And to be able to try to crack into that market, you have to be able to reverse engineer bugs. Reverse engineering bugs, even in open-source code, is far from trivial. Perhaps it's clearer to look at the problems of monoculture by case study.
In the web world, the most well-known monoculture is that of IE 6, which persisted as the sole major browser for about 4 years. One long-lasting ramification of IE is the necessity of all layout engines to support a document.all construct while pretending that they do not actually support it. This is a clear negative feature of monocultures: new things that get implemented become mandatory specifications, independent of the actual quality of their implementation. Now, some fanboys might proclaim that everything Microsoft does is inherently evil and that this is a bad example, but I will point out later known bad-behaviors of Webkit later.
What about open source monocultures? Perhaps the best example here is GCC, which was effectively the only important C compiler for Linux until about two years ago, when clang become self-hosting. This is probably the closest example I have to a mooted Webkit monoculture: a program that no one wants to write from scratch and that is maintained by necessity by a diverse group of contributors. So surely there are no aftereffects from compatibility problems for Clang, right? Well, to be able to compile code on Linux, Clang has to pretty much mimic GCC, down to command-line compatibility and implementing (almost) all of GCC's extensions to C. This also implies that you have to match various compiler intrinsics (such as those for atomic operations) exactly as GCC does: when GCC first implemented proper atomic operations for C++11, Clang was forced to change its syntax for intrinsic atomic operations to match as well.
The problem of implementations becoming the de facto standard becomes brutally clear when a radically different implementation is necessary and backwards compatibility cannot be sacrificed. IE 6's hasLayout bug is a good example here: Microsoft thought it easier to bundle an old version of the layout engine in their newest web browser to support old-compatibility webpages than to try to adaptively support it. It is much easier to justify sacking backwards compatibility in a vibrant polyculture: if a website works in only one layout engine when there are four major ones, then it is a good sign that the website is broken and needs to be fixed.
All of these may seem academic, theoretical objections, but I will point out that Webkit has already shown troubling signs that do not portend to it being a "good" monoculture. The decision to never retire old Webkit prefixes is nothing short of an arrogant practice, and clearly shows that backwards compatibility (even for nominally non-production features) will be difficult to sacrifice. The feature that became CSS gradients underwent cosmetic changes that made things easier for authors that wouldn't have happened in a monoculture layout engine world. Chrome explicitly went against the CSS specification (although they are trying to change it) in one feature with the rationale that is necessary for better scrolling performance in Webkit's architecture—which neither Gecko nor Presto seem to require. So a Webkit monoculture is not a good thing.
If you are a regular user of DXR, you may have noticed that the website today looks rather different from what you are used to. This is because it has finally been updated to a version dating back to mid-January, which means it also includes many of the changes developed over the course of the last summer, previously visible only on the development snapshot (which didn't update mozilla-central versions). In the coming months, I also plan to expand the repertoire of indexed repositories from one to two by including comm-central. Other changes that are currently being developed include a documentation output feature for DXR as well as an indexer that will grok our JS half of the codebase.
The original purpose of DXR, back when the "D" in its name wasn't a historical artifact, was to provide a replacement for MXR that grokked Mozilla's source code a lot better. In the intervening years, DXR has become a lot better at being an MXR replacement, so I think that perhaps it is worth thinking about ways that DXR can start going above and beyond MXR—beyond just letting you searched for things like "derived" and "calls" relationships.
The #ifdef problem
In discussing DXR at the 2011 LLVM Developers' Conference, perhaps the most common question I had was asking what it did about the #ifdef problem: how does it handle code present in the source files but excluded via conditional compilation constructs? The answer then, as it is now, was "nothing:" at present, it pretends that code not compiled doesn't really exist beyond some weak attempts to lex it for the purposes of syntax highlighting. One item that has been very low priority for several years was an idea to fix this issue by essentially building the code in all of its variations and merging the resulting database to produce a more complete picture of the code. I don't think it's a hard problem at all, but rather just an engineering concern that needs a lot of little details to be worked out, which makes it impractical to implement while the codebase is undergoing flux.
Documentation
Documentation is an intractable unsolved problem that makes me wonder why I bring it up here…oh wait, it's not. Still, from the poor quality of most documentation tools out there when it comes to grokking very large codebases (Doxygen, I'm looking at you), it's a wonder that no one has built a better one. Clang added a feature that lets it associate comments to AST elements, which means that DXR has all the information it needs to be able to build documentation from our in-tree documentation. With complete knowledge of the codebase and a C++ parser that won't get confused by macros, we have all the information we need to be able to make good documentation, and we also have a very good place to list all of this documentation.
Indexing dynamic languages
Here is where things get really hard. A language like Java or C# is very easy to index: every variable is statically typed and named, and fully-qualified names are generally sufficient for global uniqueness. C-based languages lose the last bit, since nothing enforces global uniqueness of type names. C++ templates are effectively another programming language that relies on duck-typing. However, that typing is still static and can probably be solved with some clever naming and UI; dynamic languages like JavaScript or Python make accurately finding the types of variables difficult to impossible.
Assigning static types to dynamic typing is a task I've given some thought to. The advantage in a tool like DXR is that we can afford to be marginally less accurate in typing in trade for precision. An example of such an inaccuracy would be ignoring what happens with JavaScript's eval function. Inaccuracies here could be thought of as inaccuracies resulting from a type-unsafe language (much like any C-based callgraph information is almost necessarily inaccurate due to problems inherent to pointer alias analysis). The actual underlying algorithms for recovering types appear known and documented in academic literature, so I don't think that actually doing this is theoretically hard. On the other hand, those are very famous last words…
I'm very excited to announce that I've started a new blog for writing about the research and research-related programactivities that have taken over this journal in the last few years. Also, I'm excited to have a domain that's less embarrassing than rockstargirl.org.
I hesitate a bit in calling it a "research blog" -- that sounds so bloodless. Rather, I want it to be a blog about my experiences as a researcher, which necessarily include research but also include a lot of other things. If you've been reading my research posts here, I hope you'll consider reading there, too.
As June came to an end, I went to back-to-back Mountain Goats concerts -- John Darnielle solo shows, really -- on the 27th and 28th, bringing Alex with me for the first (it was his first Mountain Goats show) and my friend Emily for the second (her first, as well). Because I lived in Mountain View, I spent twice as much time getting to and from these shows on public transit than I spent actually at the shows, but it was worth it, and I was really happy to get to give Alex and Emily a proper introduction to a band I've loved for a long time. Neither show quite left my head for months. (LMA has the first night's recording, which happened to be the night that JD played not one but two songs about professional wrestling.)
July arrived, and Ryan and I made a final push to finish up our POPL submission on time, with the help of many friends and colleagues who read drafts of it for us. Ken Shan spotted a couple of particularly subtle bugs.
Just before the deadline, a friend of Ryan's pointed us to some related work we hadn't seen yet: a very recent tech report from the group working on the Bloom language at Berkeley. Then, it turned out that my Mozilla colleague Niko had a friend who had worked on Bloom. Niko put me in touch with his friend, I started talking to the Bloom folks, various enthusiastic emails were exchanged, and I ended up being invited to go visit Berkeley and give a talk in August.
With the POPL deadline behind me, I took two weeks off from Mozilla to attend OPLSS in
Eugene, Oregon, an annual two-week-long summer school that draws PL folks from around the world. The lectures were intense, mind-bending, and entertaining, and I loved hanging out with the people at OPLSS.
But it was at around this point in the summer that I began to feel pulled apart at the seams, because there were so many things I had to do and they all seemed critically urgent. Although our paper was now done, I had returned to trying to fix the remaining bug in our determinism proof, so that Ryan and I could put up the tech report that was supposed to accompany the paper. I felt guilty for not doing the OPLSS homework -- after attending the lectures all day, whatever energy I had left was going toward working on the proof instead. On the other hand, I felt equally guilty for being at OPLSS at all instead of back at Mozilla hacking on traits, which had gone rather quickly in my mind from being a somewhat exotic, nice-to-have feature to an essential feature that the language sorely needed. (Indeed, Niko's type inference code was filled with plaintive comments about how the implementation could have been so much cleaner, if only we had traits.) And on top of everything else, I had to prepare for my upcoming talk at Berkeley.
After a few days at OPLSS, with all my various research and work obligations at war for my attention, I was a stressed-out wreck. Alex listened to my woes on the phone with characteristic patience and understanding. Another thing that helped was getting out of my dreary, isolating OPLSS dorm room (my roommate had been a no-show, so it was just me) and running on the beautiful network of trails around Eugene. Nevertheless, with all the other stuff I had to do, I began to fall behind on the 20-mile-a-week habit I would have had to keep up in order to make my 1000-mile goal for the year.
I began to make progress on the proof again; then, at the start of the second week of OPLSS, I got a big emotional boost: Ryan and I got official notification that our grant had been funded! This was pretty huge news for me: it meant that my plans for doing a Ph.D. with Ryan were starting to shape up in the way that I had hoped.
By the end of OPLSS, I had managed to isolate and finally really understand the renaming bug. I still hadn't managed to fix it properly, though, so Ryan and I decided to post an incomplete version of our tech report, and I went back to Mountain View with a vow to go back and fix it once my internship was over.
In August, back at Mozilla, I worked furiously to get at least some part of my traits proposal implemented before my internship ended. I had to give my final presentation on traits in my second-to-last week at Mozilla, before I had anything actually working yet; in it, I overoptimistically predicted that traits would be done by the 0.4 release. (Much of the work ended up being deferred to the 0.5 release, which just came out in December, and many bugs still remain to be addressed.)
But finally, on Monday of my last week, I successfully compiled and ran the first-ever Rust program with a Haskell-style default method. It was just a toy program, but I was pretty excited. I also felt heartened by all the enthusiasm I was seeing for traits, both on the team and in the Rust community beyond the core team. (As I said in my presentation, "I'm an academic -- I'm not used to working on things that people actually want.") For that matter, it was rather exciting that a Rust community beyond the core team could now actually be said to exist; that hadn't been the case back in 2011!
A few days before we had to head back to Indiana, I went to give my talk at Berkeley, where I had a great time chatting with Neil Conway, Joe Hellerstein and the rest of the Bloom and Daedalus folks. Their work on BloomL, which is an extension of Bloom, had turned out to be directly related to our work: what Ryan and I were doing for single-assignment parallel languages -- namely, parameterizing them by a user-specified lattice to give them a more flexible and general notion of monotonicity -- BloomL had already done for Bloom. Their goal was guaranteeing eventual consistency in distributed databases, rather than deterministic parallelism, but the math was largely the same.
Moreover -- and I might never have realized this, had we not come across the Bloom work -- the ideas about monotonicity that Ryan and I had been kicking around had already been in play in the distributed systems community for a while. The novel thing that I hope to be able to offer is a language-based account of those ideas, as part of a language-based approach to deterministic parallelism. But just because something looks like a language problem to me doesn't mean it's okay to succumb to PL myopia! We need to pay attention to what the distributed systems people are doing; it turns out they're working on really hard, interesting problems.
The highlight of my September was a trip to Denmark, where I attended ICFP in Copenhagen, then spent a few days adventuring with Chris. While ICFP was going on, Ryan and I received the reviews for our POPL submission and wrote our response to the reviews. I had been nervous that the reviewers would spot the remaining hole in our proof, but my fear turned out to be unfounded -- none of the reviewers took issue with the technical development. Rather, their complaints were that we needed to motivate the paper better and provide more real-world examples of why a new approach to deterministic parallelism was called for, which were both fair points.
Considering that it was my first time submitting a paper to POPL, I was actually quite happy with the outcome: the reviewers found our idea to be interesting and potentially useful, pending more examples. But since it looked like our paper wasn't going to be accepted to POPL, we started thinking about resubmitting it to ESOP. I also began talking to people about potential applications for our work, and Matt Might suggested using LVars for parallelizing CFAs, so I began thinking a bit about how that might work. (I didn't get particularly far with the idea, but as it turns out, two of my favorite colleagues, Will Byrd and Andy Keep, are just about to join Matt's research group as postdocs, and there's a possibility that we'll be collaborating on it in the future, which would be very cool.)
At the end of September, Alex and I went to Florida for a few days to be part of his sister Natalie's wedding, and happily welcomed her new husband, Bryan, into our family.
On October first, Ryan and I were notified that our paper didn't get into POPL, so we spent the first half of the month readying our ESOP submission. Since the ESOP page limit is shorter, we had to condense and cut a lot of the material, which was difficult, but I think ultimately improved the paper. Aaron Turon helped by commenting extensively on a draft. It was also during this time that I finally fixed the renaming bug, after some conversations with Amr and a few emails exchanged with Neel Krishnaswami (who had offered to help when we'd talked at ICFP). We posted an updated version of our tech report with the finished proof.
With the determinism proof finally complete, I went back to thinking about the frame-rule-like property we'd had to show along the way. Something about it was nagging at me. In my experience, people used frame rules to show that, if a program only used certain resources when it ran, then the resources it didn't use remained the same before and after the run. That was a useful property, but it seemed like only half the story. A frame rule could also show that if a program only used certain resources and produced a given result, then running it again with additional resources present would produce the same result.
For us, it was that second guarantee that was particularly important, because "the same result" was what determinism was all about! When I asked about it on cstheory.stackexchange, both Aaron and Neel had illuminating responses, and I realized that there was more to the frame rule than I'd thought. Conveniently, Larry Moss had just asked me to give a talk at the IU logic seminar in November, so I used the classic "agree to give a talk about something as a way of motivating yourself to learn about it" trick and told Larry that I would give a talk about frame properties, adding a paper that Aaron had recommended to my reading list.
Once we'd sent our paper off to ESOP, Ryan suggested that while we were waiting to hear back, I could help out with some monad-par hacking. He asked me to do what he thought of as a fairly minor task, but when I tried to dig into it, I was utterly lost. Since I had barely written any Haskell in my life up to that point, I decided to back up and work on actually learning the language first, which was a bit embarrassing, since Ryan had assumed that I was already pretty fluent with Haskell.
By this point in the year, I was so far behind on my 1000-mile running goal that I more or less gave up on it entirely, although I continued to accrue points on Fitocracy for my bike commutes to campus. Alex, though, continued to run regularly as preparation for another marathon he had coming up in November (he actually did a total of three marathons in 2012). He also became enamored of these things on a weekend trip we made to Atlanta to attend yet another wedding.
I crossed one minor academic cheevo off my list in October: the BloomL paper by Neil Conway et al. was published at SOCC, marking the first time that my work on LVars has been cited by someone else. Finally, in October Larry Moss and Ken Shan both agreed to join Ryan and Amr as members of my dissertation committee, bringing it up to the required total of four members.
As November began, my efforts to get better at Haskell limped along. I had frustrating interactions where I would say that I was struggling with Haskell, and then another grad student would proceed to start 'splaining basic type theory to me. The types weren't the hard part; the purity, the laziness, and the parallelism were the hard part! Eventually, I begged off of working on the monad-par project so that I could get back to my own stuff. I also spent some time organizing events for CS Club, especially the second annual CS Club "So
you want to go to grad school?" panel.
Over our Thanksgiving break, Alex and I made a road trip to Iowa to visit my parents and my grandpa Joe, my mom's dad. We spent a few days traipsing around Cedar Rapids, where I lived for a year growing up, and Grinnell, where I went to undergrad.
After Thanksgiving, I gave my logic seminar talk about separation logic and frame properties. While preparing for it, I chatted with Aaron a bit more, and he suggested that I take a look at a paper on "views" that was about to come out at POPL; he suspected there might be an interesting relationship between views and LVars. I began doing some more background reading on concurrent separation logic so that I'd be better able to make sense of the views paper.
Finally, at the end of the month, we got the reviews back from our ESOP submission and wrote our response. The reviewers had liked many things about the paper, but, just as with POPL, they wanted to see more convincing examples of what LVars were good for.
In December, while looking through some old emails, I realized that back in October, Neel Krishnaswami had independently recommended that I read the same paper on views that Aaron had pointed me to. (Dun-dun-DUNNNNNNNN.) I emailed Aaron and Neel to see if they were interested in working with me to investigate a connection between views and LVars, and they responded enthusiastically, inviting me to visit Saarbruecken right after POPL so that we could spend some time working together. There are a couple of other LVar-semantics ideas we're interested in pursuing, as well.
Needless to say, I'm incredibly excited about collaborating with Aaron and Neel. For one thing, I've been wanting to do something involving logical relations and LVars right from the start; for another, the fact that a frame property just fell out in the process of doing the LVar determinism proof suggests the possibility of some kind of separation logic for deterministic parallelism. (Also, I've never been to Germany, and I'm quite looking forward to visiting, especially since Ezgi Cicek alerted me to the existence of Flammkuchen.)
In mid-December, Ryan and I got a very nice rejection letter from the ESOP committee, in which they reiterated that they liked our idea but that they wanted more convincing examples. Although I'm of course disappointed that we haven't been able to publish our work yet, I think that the two rounds of reviewing we've been through have put us in a good place to move forward with it. The technical development is now quite solid, the idea is novel as far as we know, and people seem to want to be enthusiastic about it. But it's clear that we need to motivate the work better. Everyone knows that nondeterminism can be nasty, but why would someone want to use our thing instead of just using a single-assignment language? What kinds of programs are expressible (or conveniently expressed) with a monotonic-multiple-assignment language that would be impossible (or inconvenient) to express with a single-assignment language or some other existing deterministic parallel model? Those are the questions I think we need to address next time.
The semester wrapped up, concluding with Andy Keep's dissertation defense, and Alex and I skipped town for the holidays, going first to New York for an end-of-the-world party that some friends were throwing, and then to Tallahassee to spend a few days with his family, as has become traditional. All this was great fun, except that just as we got to NYC, I caught a terrible cold that destroyed my voice and just wouldn't go away. When we got back to Bloomington, Alex convinced me to see a professional, who prescribed four different medications. Believe it or not, this is the first time in my adult life that I remember having to take prescription drugs of any kind, and I was a little reluctant to do so, but they seem to have done the job. Still, it's only now, after nine days of drugs, that my throat is getting back to normal.
Finally, I submitted a talk proposal about the LVars work for the POPL student short talk session, and it was accepted, so it looks like I'll be giving a short presentation at POPL in addition to the, uh, even shorter one that I might be giving at PLMW. I have a bit of work to do to prepare for both of those before I head off to POPL (in Rome!) in a couple of weeks.
To sum up, I'm happy with how my 2012 went. I made a lot of progress on research this year. Getting our grant proposal accepted was a big win. I'm proud of having submitted my first paper to POPL, having finally gotten the LVar determinism proof done, and having given various talks and successfully reached out to various people about research collaborations. At Mozilla, I contributed to every release of Rust, and I helped bring traits into the language. In my personal life, my marriage continued to go wonderfully. Alex is still the best. The paper rejections were unfortunate, but that's part of academic life; there's not much I can really complain about.
In 2013, I'd like to do another marathon; I'm tempted to go back to Toronto and try that course again in May. Research-milestone-wise, I want to get our first LVar paper published and get another paper or two underway, and I'm hoping to propose my thesis relatively soon -- sometime this summer, perhaps. Alex is planning to propose this year, too, so he and I are even more or less synchronized, which is awesome. Although we decided we don't want to do any more internships before we graduate, I want to keep contributing to Rust as a volunteer when I can, and to do what I can to help other people contribute. And that's about it for me; thanks for reading! How about you?
So it turns out I have a lot to say about how my 2012 went. Part 2 to follow!
In January, I began a research
assistant position with my advisor, Ryan, marking the first time in grad school that I've had no major responsibilities other than research (that is, no classes and no teaching). In December, Ryan and I had submitted an NSF grant proposal about lattice-based deterministic parallel computation, so now we began fleshing out the ideas we'd described in the proposal, putting together a semantics for a small language with lattice-based data structures that we called LVars.
Alex and I started training for the 2012 Toronto Marathon, and I opened a
Fitocracy account on January 1 with ambitious plans to log 1,000 miles by year's end.
I spent some time helping organize InWIC, a conference for women in computing in Indiana; the main thing I did was work together with my friends Will and Ian at Studio Cypher to design a cooperative puzzle game for the InWIC attendees to play.
At the end of the month, I went on a delightful road trip to POPL in Philadelphia with a whole pile of folks from IU and CMU. Highlights included playing board games in the hotel lobby and being too scared to say hello to Simon Peyton Jones in the elevator (later I got over it and we had a lovely chat).
In February, Ryan and I continued working on our project. As our LVar-based language began to solidify, I started working on a determinism proof for it, taking the determinism proof for the Featherweight CnC language in this paper as a starting point and frequently soliciting Amr's advice. It had actually been the Featherweight CnC proof that had gotten me started thinking about the relationship between monotonicity and determinism in the first place, so it was interesting to go back to that proof and try to adapt it to our setting.
It began to sink in that I was headed toward doing a Ph.D. with Ryan on deterministic parallelism, which was pretty exciting. Deterministic parallelism was still rather unfamiliar territory for me, but now that I had a big proof to work on, I was in my element. Amr also agreed to join my nascent thesis committee.
As March began, I was getting close -- or so I thought! -- to finishing the proof, and we considered submitting something to ICFP, but with the proof still not done, we decided we'd better wait and shoot for POPL in July instead.
During IU's spring break in mid-March, Alex and I went to Boston for a few days, where I spent some time working with Amal on the still-unfinished multi-language parametricity project I was doing with her. While there, I told Amal about what I was working on with Ryan, and when I showed her one of the properties I'd had to prove to make our determinism proof go through, she pointed out that it looked sort of like a frame rule. At the time, I didn't really understand what that meant, but the idea stuck with me.
While I was in Boston, Amr found a bug in my proof that had to do with the names of memory locations and the circumstances under which they could be renamed. It wasn't a showstopper, but it would continue to plague us, in one guise or another, for several months.
When I got back to Bloomington, after a couple of meetings in which Ryan, Amr, and I all found ourselves evaluating expressions by hand on Amr's whiteboard, we realized that we really needed a runnable model of our language, so I started hacking a Redex model together. I had a lot of fun bringing the paper semantics to life. On March 30th, which was my 30th birthday, I celebrated by giving a whiteboard talk at PL-wonks about what Ryan, Amr, and I had been doing, followed by a Redex demo.
The next day, Alex and I ran the IU Mini Marathon, which doubled as part of my Toronto marathon training.
Once back at home, I spent much of the rest of April trying to finish a particularly hairy case of the diamond lemma (the lemma that did the heavy lifting of our determinism proof) that had been thwarting us for the previous two months. I finally got it at the end of the month -- the solution required a tweak to the aforementioned frame-rule-like property that seemed obvious in retrospect -- and the only remaining known problem with the proof was the
location-renaming bug that Amr had pointed out. By now, the proof had
grown to 40 pages long, and we'd started drafting our POPL submission.
In May, the semester ended, and Alex and I headed to Toronto
for the marathon (following a last-minute trip to Chicago to get my passport renewed). Toronto was great! We got to hang out with our friends Ryan and Janice, who live around the corner from Honest Ed's, and I caught up with my old friend Steven from the Act IV days. And I successfully completed my seventh marathon. Given the course's negative change in elevation, it should have been an easy PR, but I missed my PR by five minutes due to a series of unfortunate gastrointestinal events.
After Toronto, Alex and I headed to California (with cats in tow) for our second summers of
working on Google Translate and Rust, respectively. It
was great to return to Mozilla, where I pretty much already knew everyone on the Rust team. Most impressively, all of the summer 2011
Rust interns -- Tim Chevalier, Eric Holk, Paul Stansifer, Michael Sullivan, and me -- were back for summer 2012, either as
returning interns or, in Tim's case, as full-time employees.
For me, returning to Mozilla for a second summer was something of an indulgence. I had reached a point in my Ph.D. when I really should have been concentrating on research, rather than going off and doing yet another internship, and I suspect that some of the other interns were in a similar position. But the opportunity to work on a project as cool as Rust, with a team as good as the Rust team, was too tempting for any of us to pass up, it seems.
Near the end of May, Ryan
and I got the incredibly exciting news that the grant proposal we'd submitted in December had
been recommended for funding. Being recommended for funding usually means that a grant will be funded, but we kept quiet about it because it wasn't yet a sure thing. So began an extremely squirmy two months of compulsive email-checking and NSF-website-page-reloading.
In June, I managed to keep up my running, aided by the perfect Mountain View weather and the fact that our apartment was right next to the Stevens Creek Trail. For Alex's part, he ran almost every day, sometimes as part of his commute.
At work, I had to spend a fair amount of time re-familiarizing myself with
Rust; the compiler had matured a lot since I'd left the previous August, and the language had sprouted a mysterious new region system, a typeclass system of sorts, and a few other features that were new to me. After some amount of flailing with the type inference engine (which had been overhauled (...again)),
I finally landed integer-literal suffix inference in the compiler in mid-June, in time for the 0.3 release.
Once suffix inference was done, I needed a new project. The team had been mulling over the idea of adding traits to Rust for some time; as far as I know, Patrick was the first to bring up the idea of traits, in his object system proposal from November 2011. After reading a bit and asking some questions, I realized that what everyone had been calling traits were essentially Haskell-style typeclasses with default methods -- in the terminology of the original traits paper, "provided" methods were implementations, while "required" methods corresponded to signatures -- and that Rust's existing ifaces were just underpowered typeclasses that could be extended to traits (whereas traits and interfaces had been distinct notions in the original proposal). The idea of combining traits and interfaces was well received, and I spent a fair amount of time putting a proposal together, with help from Patrick and Dave, who had already done some thinking about separate compilation and so on. Adding default methods to an interface, it turns out, has surprisingly deep semantic ramifications -- the whole thing's...polarity?...flips, or something. I don't have a good explanation at hand, but something about it made me go "whoa" and sort of stagger back from the whiteboard.
Meanwhile, I was continuing to work on Ryan's and my POPL submission in my off hours. We still had the location-renaming bug in the proof to contend with, but Ryan suggested that I put the proof aside for a while and focus on writing instead, since the POPL deadline was quickly approaching...
When writing custom passes for a compiler, it's often a good idea to try running them on real-world programs to assess things like scalability and correctness. Taking a large project and seeing your pass work (and provide useful results!) is an exhilarating feeling. On the other hand, trying to feed in your compiler options into build systems is a good route to an insane asylum.
I complained some time ago about autoconf 2.13 failing because it assumes implicit int. People rightly pointed out that newer versions of autoconf don't assume that anymore. But new versions still come with their own cornucopias of pain. Libtool, for example, believes that the best route to linking a program is to delete every compiler flag from the command line except those it knows about. Even if you explicitly specify them in LDFLAGS. Then there's this conftest program that I found while compiling gawk:
/* Define memcpy to an innocuous variant, in case <limits.h> declares memcpy.
For example, HP-UX 11i <limits.h> declares gettimeofday. */#define memcpy innocuous_memcpy/* System header to define __stub macros and hopefully few prototypes,
which can conflict with char memcpy (); below.
Prefer <limits.h> to <assert.h> if __STDC__ is defined, since
<limits.h> exists even on freestanding compilers. */#ifdef __STDC__
# include <limits.h>
#else
# include <assert.h>
#endif
#undef memcpy/* Override any GCC internal prototype to avoid an error.
Use char because int might match the return type of a GCC
builtin and then its argument prototype would still apply. */#ifdef __cplusplusextern"C"#endifchar memcpy ();
/* The GNU C library defines this for functions which it implements
to always fail with ENOSYS. Some functions are actually named
something starting with __ and the normal name is an alias. */#if defined __stub_memcpy || defined __stub___memcpy
choke me
#endifint
main ()
{
return memcpy ();
;
return0;
}
…I think this code speaks for itself in how broken it is as a test. One of the parts of the compiler pass involved asserted due to memcpy not being used in the right way, crashing the compiler. Naturally, this being autoconf, it proceeded to assume that I didn't have a memcpy and thus decided to provide me one, which causes later code to break in spectacular bad function when you realize that memcpy is effectively a #define in modern glibc. And heaven forbid if I should try to compile with -Werror in configure scripts (nearly every test program causes a compiler warning along the lines of "builtin function is horribly misused").
The saddest part of all is that, as bad as autoconf is, it appears to be the least broken configuration system out there…
[11:13am] tjc: rustbot: build snap-stage3 qkzw [11:14am] auREAX: qkzw? [11:15am] tjc: auREAX: the FreeBSD bot [11:15am] tjc: only brson knows why it’s called that [11:15am] auREAX: ah [11:15am] auREAX: [11:16am] brson: qkzw was my grandmother’s name [11:16am] tjc: oh, that’s touching [11:17am] graydon: she was a ham radio operator? [11:17am] brson: one of the best [11:17am] graydon: (or maybe fighter pilot? hm. call signs…)
A lot of people have been asking me how to use protocols in Rust lately, so I thought I’d write up a little tutorial. Custom protocols are how to get the biggest benefits from Rust’s communication system, as this is how you get the biggest safety guarantees and expose the most opportunities for optimization. It’s also more labor intensive, so some of the other library features such as streams or port sets might be a better starting point. This post, however, introduces how to write your own protocols.
Protocols are created using the proto! syntax extension. Each protocol has a collection of states, and each state in a protocol has a collection of messages that are allowed. Every protocol must have at least states, though states needn’t have any messages. Thus, we will start with the simplest possible protocol definition:
proto! simple_proto {
StartState:send { }
}
This creates a protocol called simple_proto, with a single state called StartState. Protocols always start in the first state listed. There is one other decoration, :send. This indicates that StartState is a send state. Protocols are always written from the client perspective, so a send state means it is a state in which the client may send a message. The protocol compiler will generate a dual state for the server, meaning there will be a version of StartState that is expecting to receive a message.
Given this specification, the protocol compiler will create a module called simple_proto, that includes types and functions for communicating according to this protocol. One generated function is called init, which is used to create a pipe in the start state. We can start the protocol like this:
let (server, client) = simple_proto::init();
This creates a pair of endpoints. The client endpoint will have the type simple_proto::client::StartStart, meaning it is in the client’s version of StartState. The server endpoint, on the other hand, has type simple_proto::server::StartState, which means it is in the server’s version of the StartState. The difference means that client is expecting to send a message, while server is expecting to receive a message.
We can’t really do anything further with this protocol, since there are no messages defined. Strictly speaking, the server could try to receive, but it would block forever since there is no way for the sender to send a message. Let’s fix this by adding a message.
Messages have the form Name(arguments) -> NextState. In this case, we added a message called SayHello, which carries no data with it. After sending a SayHello message, the protocol transitions to (or stays in, really) the StartState. We can now write some functions that communicate with each other. Here’s an example client program:
Receive, by itself, is a little trickier. I recommend using the select macro instead. For now, you can get the select macro here. Once macro import is working, the select macro will be included in the standard library. For now, to use the select macro, save it in a file called select-macro.rs and add the following lines near the top of your program.
fn macros() {
include!("select-macro.rs");
}
Once you’ve done this, you can write the server as follows.
Select allows you to provide a set of endpoints to listen for messages on, followed by actions to take depending on which one receives a message. In this case, we only have one endpoint, called channel, which is in the server StartState state. After the =>, there is a block describing message patterns and code to execute if the pattern is matched. In this case, we only have one pattern, SayHello -> _channel. This mirrors the definition of the message in the protocol specification. It says “if we receive a SayHello message, bind an endpoint in the next protocol state to _channel and execute the code in the following block.” We use _channel for the next state because in this case we are not planning on sending or receiving any more messages.
Now let’s make this protocol a little more interesting by adding a new state and a message that carries data. We will do this by letting the client ask the server’s name and wait for a reply. The new protocol looks like this:
We’ve added a new message to StartState, which the client uses to ask the server’s name. After sending this, the protocol transitions to the GettingName state, where the client will wait to receive the MyNameIs message from the server. At this point, the protocol moves back to the StartState, and we can do it all over again. We’ve added an argument to MyNameIs, which means this message carries a string with it. Our client code now looks like this:
fn client(+channel: simple_proto::client::StartState) {
import simple_proto::client;
import simple_proto::MyNameIs;
let channel = client::SayHello(channel);
let channel = client::WhatsYourName(channel);
select! {
channel => {
MyNameIs(name) -> _channel {
io::println(fmt!("The server is named %s", *name));
}
}
}
}
At a high level, this code says hello to the server, then asks for it’s name, then waits for the response and reports the server’s name to the user. It probably looks a little add that every line starts with let channel = .... This is because endpoints are single use. Any time you send or receive a message on an endpoint, the endpoint is consumed. Fortunately, all the send and receive functions return a new endpoint that you can use to continue the protocol.
The use of select! here is similar to how it was in the previous server example, except that we’ve added name to the MyNameIs pattern. This matches the ~str parameter in the protocol specification, and it binds the string sent by the server to name, so that we can print it out in the handler code.
For the new server, we need to add another clause to the message patterns:
In this case, if we receive a WhatsYourName message, we send a MyNameIs message on the new endpoint (called channel), which contains the string ~"Bob", which is what this server has decided to call itself. The client will eventually receive this string and show it to the user.
This covers the basic definition and usage of protocols. There are several other features, however. This includes polymorphic states and terminal states. Polymorphic states allow you to create protocols that work for different types. One common example is the stream protocol, which lets you send a whole bunch of messages of a given type:
We can add as many type parameters as we want to each of the states, with arbitrary bounds as well. You’ll probably want all your data types to be send-bounded though. Then, each time a message transitions to a polymorphic state, it must provide type parameters. You can see this on the right had side of the Send message.
Sometimes, we want to have a message that ends the protocol. For example, in our previous example, we might want a GoodBye message. One way to do this is to make a state with no messages, and step to that:
However, this is a little verbose, and it also hides that fact that you really intended the protocol to end there. Thus, there is a special form that indicates sending a message ends the protocol. We write it like this:
Stepping to ! represents closing the protocol and is analogous to how a function that returns ! actually never returns. When a message steps to !, neither the corresponding send function nor the receive function will return a new endpoint, meaning there is no way you could even attempt to send messages on this connection.
I hope this helps to get started with protocols in Rust. There are a few other features, but this covers the basics. Please feel free to ask questions!
My expeditions in code coverage date back several years, but I am proud to report the biggest milestone to date now. Instead of running all of these tests on various clusters I have access to (and running into problems with random test failures), I am running them on the same servers that power Mozilla's build automation. How, you might ask? Read on to find out.
I've uploaded the results from running Linux64 debug builds (both LCOV results and my treemap application). My treemap application has gone through major revisions, too. It now uses CSS transitions instead of the framework's builtin transitions (no more slow script dialogs, and the animations are snappier, although Firefox 14 still chokes on the whole view and Firefox nightly appears to suddenly thrash memory). I've also tweaked the UI a bit to fix minor things that you probably wouldn't notice if I didn't point them out. Instructions on using the view are now in place (it turns out that treemaps aren't as intuitive as I feel them to be). You can also break down results by testsuite, although that feature was very hurriedly hacked in and has very obvious pain points. Code for this portion of the site can be found on github.
The other potentially interesting part is how I get this to work on Try. I don't have this in a public repository yet, but you can follow along the changes on the patch I pushed to try. This is how it works: first, I patch the mozconfigs to include gcov results. Then, I package up all of the notes files and include them in the tarball that contains firefox. Now is where the fun begins: at opportune places in all of the test suites, I output (to standard out) a base64-encoded tarball of all the data files collected during the run of the program. For test suites other than make check, I need to munge the environment to set GCOV_PREFIX to a directory where I can find all of the data files again.
When the magic patch is pushed to try, and after it builds, I can now find in all of the output log files of the build a tarball (or several, in the case of mochitest-other) of all the coverage data. The steps after this are done manually, by dint of only getting the try stuff working last night. I download all of the log files, and extract the tarballs of coverage data into a directory. Then I pull the gcov note files into that directory and run the LCOV tools to collect data. I use ccov (my work-in-progress to replace LCOV) to combine all the test suites into a single file and then produce the data for the nice tree view. LCOV is used to produce the detailed per-file data. After everything is produced, I gather it all up and upload it.
Where to from here? I want to automate the latter part of the process, as babysitting steps that take 10-20 minutes each is surprisingly unproductive. I also want to tackle a replacement for LCOV's HTML output, since it both takes forever to run (well over an hour), and the output is surprisingly lossy in information (you don't know which lines are covered by which tests, and weird corner cases in branch and function coverage could be presented better). Eliminating LCOV altogether is another goal, but I can live with it in the first data collection step for now because gcov does really crazy stuff in coverage. I also want to expand all of these tests to run on more than just one platform—ideally, I want code coverage for all test suites run on all platforms. Assuming, of course, releng doesn't decide to kill me first.
Mac OS X builds of Firefox now use clang, and as work was put in to make the switch happen, this necessitated testing individual versions of clang. This means we have infrastructure in place that makes it (relatively) easy to run tests on the try server with even patched versions of clang. Here's how you do it:
Step 1: Build your version of clang
This should be trivial, although one wrinkle is that you need to specify where the gcc-toolchain is located explictly (/tools/gcc-4.5-0moz3 for now). If you're like me and lazy, you can just use the build-clang.py script to make the final tarball, after tweaking it to include your patch. Note that it expects to be located in specific directories (/builds/slave/moz-toolchain in particular). If you're building by hand, be sure to make a .tar.bz2 of the
Step 2: Place packages in appropriate location
The next script assumes things in particular places. It wants a directory layout that looks like:
$ pwd
/path/ending/in/clang-SVN revision
$ ls
clang-darwin.tar.bz2 clang-linux32.tar.bz2 clang-linux64.tzr.bz2
Step 3: Make manifests
This script is create-manifest.py, which has a requirement of simplejson 2.5 (which is newer than what mozilla-central's virtualenv python provides, alas). If you don't have a new enough python environment, just eliminate the item_sort_key=key_sort parameter and live with the fact that your output files are going to change more lines when importing to mozilla-central. This step produces files like darwin.manifest; these should be copied to browser/config/tooltool-manifests/platform/clang.manifest and the respective releng.manifest. It will also produce files with long, ugly filenames that look like someone dumped out a SHA512 hash as the filename (this is in fact what happens).
Step 4: Upload the files
First, copy the SHA512-named filenames somewhere public, like people.mozilla.org. Next, go into #it on IRC and bug someone about uploading the files to the tooltool server. When they've been uploaded, proceed to the next step.
Step 5: Push to try
Remember to use the modified manifests produced in step 3. On Linux and Linux64, you'll also need to modify the mozconfigs to use clang instead of gcc. An example is here:
Disabling warnings as errors is also probably a good idea, since it seems that Linux people can't stop those extra semicolons sneaking in. Then, push to try and hope for the best!
I have been posting code-coverage results of comm-central off and on for several years. One common complaint I get from developers is that I don't run any of the mozilla-central testsuites. So I finally buckled down and built mozilla-central's coverage treemap (and LCOV output too).
The testsuites here correspond to running top-level check, mochitest-plain, mochitest-a11y, mochitest-ipcplugins, mochitest-chrome, reftest, crashtest, jstestbrowser, and xpcshell-tests, which should correspond to most of the test suite that Tinderbox runs (excluding Talos and some of the ipc-only-looking things). The LCOV output can break things down by test results, but the treemap still lacks this functionality (I only built in multiple test support to the framework this afternoon while waiting for things to compile).
Caveats of course. This is an x86-64 Linux opt-without-optimizations build. This isn't my laptop, and X forwarding failed, so I had to resort to using Xvfb for the display (which managed to crash during one test run). It seems that some of the mochitests failed due to not having focus, and I have no idea how to make Xvfb give it focus, so not all mochitests ran. Some of the mozapps tests just fail generally because of recursion issues. So this isn't exactly what the tinderboxes run. Oh, and gcov consistently fails to parse jschuff.cpp's coverage data.
Lcov is also getting more painful to use—I finished running tests on Saturday night, and it took me most of Sunday and Monday to actually get the output (!!!). Fortunately, I've reimplemented most of the functionality in my own coverage analysis scripts, so the only parts missing are branch coverage data and the generated HTML index which I want to integrate with my web UI anyways.
This has been a busy week. First, two days ago my advisor and I submitted a paper to POPL about what we've been up to for the last six months. Here's the abstract of our paper, which is called "A Lattice-Theoretical Approach to Deterministic Parallelism with Shared State":
We present a new model for deterministic-by-construction parallel
programming that generalizes existing single-assignment models
to allow multiple assignments that are monotonically increasing
with respect to a user-specified partial order. Our model achieves
determinism by using a novel shared data structure with an API that
allows only monotonic writes and restricted reads. We give a proof
of determinism for our model and show that it is expressive enough
to subsume diverse existing deterministic parallel models, while
providing a sound theoretical foundation for exploring controlled
nondeterminism.
There's also a Redex model, and the tech report that accompanies the paper will be up within a week or so. Hooray!
Second, the Rust team released Rust 0.3 yesterday, including, among other things, the feature I worked on for the first part of the summer, suffix inference for integer literals. I'm proud of this, since having to write suffixes on int literals was one of the more annoying things about the language (we had it tagged with "papercut" in the issue tracker).
Next, I'll be working on traits for Rust, but first I head to Oregon for OPLSS!
The benchmarks in my last post had one thing in common: all communication was one sender to one receiver. It’s surprising how often this is sufficient, but sooner or later we are going to need a way to have multiple tasks sending to the same receiver. I’ve been experimenting with two ways of doing many senders to different receivers, and I now have some results to show.
The pipes library includes a select operation. This lets you listen on several receive endpoints simultaneously. Unfortunately, the single-use nature of endpoints makes select a little clunky to use. To help alleviate this, I added a port_set to the library. Port sets allow you to easily treat several receive endpoints as a unit. This allows send to still be very fast, but receive is a little bit slower due to the overhead setting up and tearing down the select operation. The current implementation for select is O(n) in the number of endpoints, so this works well for small numbers of tasks, but breaks down as things get bigger.
The other option is to slow down the sending end, using something I call a shared_chan. This is a send endpoint wrapped in an exclusive ARC. Now all the senders have to contend with each other, but the receive side is exactly as cheap as before. For cases where you have a lot of senders that send messages relatively infrequently, this will likely outperform the port_set approach, at least until select is faster.
Both of these are sufficient to run the msgsend benchmark that I talked about at the beginning of all of this. Here are the results, combined with the previous numbers.
Language
Messages per second
Comparison
Rust port_set
881,578
232.8%
Scala
378,740
100.0%
Rust port/chan (updated)
227,020
59.9%
Rust shared_chan
173,436
45.8%
Erlang (Bare)
78,670
20.8%
Erlang (OTP)
76,405
20.2%
The most obvious thing is that the port_set version is over twice as fast as Scala, the previous winner. I also re-ran the port/chan version for comparison, and it got a little bit faster. There has been quite a bit of churn in Rust recently, so it’s quite possible that these showed up here as better performance.
Writing the port_set version proved the most interesting to me. Relying on select ended up relaxing some of the ordering guarantees. Previously if we had Task A send a message to Task C and then send a message to Task B, and then have Task B wait to receive message to from Task A and then send a message to Task C, we could count on Task C seeing Task A’s message before seeing Task B’s message. With the port_set, this is no longer true, although we still preserve the order in messages sent by a single task. An easy way to work around this, however, was to rely on pipe’s closure reporting ability. The server could tell when a worker would no longer send any more messages because it would detect when the worker closed its end of the pipe.
I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.
Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:
Sent 1000000 messages in 0.227634 seconds
4.39301e+06 messages / second
0.227634 µs / message
This is about 17x faster!
It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.
To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.
Benchmark
Port/chan time (s)
Pipe time (s)
Improvement (%)
Graph500 PBFS
0.914772
0.777784
17.6%
The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.
Benchmark
Port/chan time (s)
Pipe time (s)
Improvement (%)
Shootout K-Nucleotide
4.335
3.125
38.7%
Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.
Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.
About a month ago, I posted that I was going to be working on improving Rust’s message passing performance. I quickly threw together a prototype of a new communication system based on a shared queue protected by a mutex. This was about twice as fast as the existing system, because it removed the global mutex from the messaging paths. This prototype hurt expressiveness somewhat, and still it seemed we could do a lot better.
Rust has some extremely powerful features in its type system. The fact that it can deal with concepts like uniqueness, initialization status, copyability, and other traits mean we can encode some very powerful invariants. Thus, I took some inspiration from the Singularity OS and set out to see if I could encode something like channel contracts in Rust. The result is a proposal for a feature I’m calling pipes.
The way pipes work is that when you create a pipe you get two endpoints that are forever entangled together. One endpoint can send one message, and the other endpoint can receive that one message. Sending and receiving destroys the endpoint, but the operation also produces a new endpoint to continue the communication. Endpoints have a state associated with them, which specifies which messages can be sent or received. This information is encoding in the type system, so Rust can statically guarantee that no task will send a message that is not legal in the given state. Pipes are not copyable; they are always for 1:1 communication. However, endpoints can be sent between tasks.
Critical to pipes are the associated protocol specification. Protocols have two views: the client and the server. Protocols are always written from the perspective of the client. This decision was arbitrary, but in general it makes sense to only write down one side of the protocol. The other perspective is generated by reversing the direction of all the messages. Here’s an example of what I’m envisioning for a protocol specification.
This describes the protocol you might use in an online banking situation. The protocol has four states (login, login_response, connected and withdrawal_response), each one annotated with whether the sender is allowed to send or receive in that state. In this case, a client would start out in the login state, where the client can attempt to login with a username and password. After sending a login message, the protocol enters the login_response state, where the server informs the client that either the login succeeded (in which case the protocol transitions to the connected state), or the login failed, in which case the protocol returns to the login state and the client can retry.
From the connected state, the client can try to deposit or withdrawal money. We assume that depositing money never fails, so sending a deposit message results in the protocol staying in the connected state. On the other hand, withdrawal can fail, for example, if the account does not have enough money. To model this, sending a withdrawal message results in the protocol going to the withdrawal_response state. Here, the client waits to either receive the requested money, or for a message saying there was not enough money in the account. In both cases, we end up back in the connected state.
Below is a code example showing how a client might use this protocol.
fn bank_client(+bank: bank::client::login) {
import bank::*;
let bank = client::login(bank, "theincredibleholk", "1234");
let bank = alt recv(bank) {
some(ok(connected)) {
#move(connected)
}
some(invalid(_)) { fail "login unsuccessful" }
none { fail "bank closed the connection" }
};
let bank = client::deposit(bank, 100.00);
let bank = client::withdrawal(bank, 50.00);
alt recv(bank) {
some(money(m, _)) {
io::println("Yay! I got money!");
}
some(insufficient_funds(_)) {
fail "someone stole my money"
}
none {
fail "bank closed the connection"
}
}
}
All of this code in this posts works on the latest Rust compiler as of this morning. I’ve also started transitioning some of our benchmarks to the new pipe system, and the results have been impressive. I’ll have a post diving into the performance of pipes soon.
For the last week or so, I’ve been mostly fighting with the performance of the Rust compiler. The rustc performance graph from the RustBot page tells a lot of the story:
The wildly varying performance is largely due to a change I made to move the code for addition on vectors out of the compiler and into the core library. The compiler used to generate very optimized vector code. For example, when doing v += [x] (that is, appending a single element to a vector in place), the compiler would avoid allocating a temporary vector for the [x]. In effect, the compiler generated what amounts to an inline call to vec::push.
My work involved adding an overloaded addition operator to Rust’s core library, and changing the compiler to use that version instead of its generated version. The landing of this change corresponds to about build number 70 or so on the graph, where there is a small but significant jump in the time rustc took. Due to the way we handle overloaded operators, the compiler must insert many more vector copies than it did before. It seemed too good to be true that I would only lose a small amount of performance.
And it was too good to be true. It turns out through a bunch of rebasing, I had managed to undo a small switch in is_binopable that controlled whether my new code ever got called. In the meantime, Sully has been doing a lot of work on the way Rust handles vectors, and he landed a change around build 90 in the performance graph above that ended up causing Rust to use the library version of vector append instead. This was more like the performance decrease I was expecting.
In order to fix this, I had to track down all the places in the compiler where we were using the addition operator on vectors, and replace them with things like vec::push, vec::push_all, vec::append and vec::append_one. This was tedious work, and I did it in stages. Each spot where the execution time improves in the graph is where I landed a couple more of these changes. Once this was finally done, we can see that rustc’s performance was even slightly better than it was before. I believe this is because the library versions of these functions are able to remove some copies that rustc could not remove before. Sadly, the code is much uglier than it was before. We are discussing the ability to specify modes on the self argument, which will allow the compiler to again avoid unnecessary copies when using the plus operator. This should allow us to make using the plus operator just as fast as the underlying library functions.
The moral of the story is that copies can kill performance. Rust warns often when it is inserting copies you may not have expected. Pay attention to these warnings and you will write much faster code.
Yes yes yes. Benchmarking Timelapse is something I’ve resisted doing, because there’s no convincing metrics to be derived using standard benchmarks. Since the user ultimately cares about FPS, this type of solution seems promising.
Previously, I talked about a couple of ways to improve the message passing performance in Rust. The obvious bottleneck was that sending and receiving both involved taking a lock that was shared between all tasks. In order to remove this lock, I set out to write a new port/channel system that relied much less on the runtime library.
In order to do this, we first needed a variant of the atomic reference counter that allows us to share mutable data. We accomplish this by adding a mutex and condition variable. The mutex is your standard pthread mutex, while we’ve implemented our own condition variable that the Rust scheduler is aware of. Because we’re using a standard pthread mutex, using this exclusive access is unsafe and should be used with care; if you’re not careful, you could deadlock Rust. Fortunately, the API for Rust’s low-level locks and condition variables makes it harder to accidentally hold a lock for unbounded amounts of time.
Once we have this, ports and channels simply become a locked atomically reference-counted queue. There’s no more global lock, and things are far simpler because we only need the Rust runtime support for providing mutual exclusion and condition variable signalling. This part didn’t take long to implement, and I was eager to try it out. I spent a little while writing up a new benchmark that would really show the benefit of avoiding the global lock, and when I went to run it, things crashed.
It turns out, I had discovered a bug whereby vector addition could copy things that weren’t copyable. I saw two approaches to fixing this: fixing trans (the Rust to LLVM translation pass), or moving vector addition out of trans and into the library. The idea behind the second option is that if we did this, vector addition would be going through the existing and well-tested function call path, rather than a special vector addition codegen path. This seemed like the best option overall, since the first option felt more like a band aid for the root cause that we have too much functionality that is duplicated in subtly different ways.
Thus, I set out to move vector addition to libcore. This exposed some subtle semantics issues around const vectors, but these were mostly not too painful to work out. My firsts working versions were too slow to finish building the compiler in a reasonable amount of time. I was thinking we’d end up taking a 20% to 2x performance hit by doing this change, but thanks to some heroic optimization help from Niko, we got the performance to the point where in some cases the new code even performs ever so slightly better than the old code. In the course of doing these changes, I also discovered another bug, that led to us leaking memory, and in the course of fixing that, I discovered a way to make Rust segfault. Ah, the life of a compiler writer.
At any rate, we have fixes to all of these bugs in the works, and things are working well enough to run a benchmark. This test creates a ring of tasks, who each send a message to their neighbor on one side and receive from the other side. Here are the numbers for the old messaging system.
Sent 1000000 messages in 3.88114 seconds
257656 messages / second
3.88114 μs / message
And here are the numbers for the new system.
Sent 1000000 messages in 1.87881 seconds
532253 messages / second
1.87881 μs / message
As you can see, we’re about 1.9x faster than we were before.
This new system doesn’t yet have the exact same features as the old system, and it needs some help with ergonomics. My next task will be to work on adding these missing things and hopefully be able to incrementally replace the old code with the new, faster version.
The code for this post isn’t yet in the main Rust tree, but it should be landing soon as we squash the bugs we found in the process of doing this new message passing system.
Today we’re going to take a brief look at message passing performance in Rust. Rust is already pretty quick on this front, but there are some really obvious possibilities for making it even faster. It looks like I’ll be spending most of the summer working on this, so I thought it’d be good to start out with some baseline performance numbers. Rust’s testsuite has a benchmark called msgsend.rs, which is a microbenchmark for message passing performance. It is based on a benchmark from Paul Keeble that original compared Erlang and Scala. Here’s a table summarizing the results:
Language
Messages per second
Comparison
Scala
378,740
100.0%
Rust
175,373
46.3%
Erlang (Bare)
78,670
20.8%
Erlang (OTP)
76,405
20.2%
These numbers were generated on an Early 2011 MacBook Pro 2.3 GHz Intel Core i7 with 8GB RAM and Lion 10.7.4. Don’t read too much into these numbers; they are incredibly unscientific. The only real use for them will be to see if running the same benchmark for Rust in a few weeks yields a larger number. It’s also worth noting that my results disagree with the original author, who saw Erlang performing about 6 times faster than Scala. I suspect that Erlang’s message passing may not be as efficient on Mac OS X, but I do not know what system the tests were originally run on.
So, where do we go from here? I mentioned that there are some obvious first steps. The most obvious is that right now sending a message involves taking a global lock. In effect, this means we can only send one message in a time. Suppose we had four tasks, A, B, C and D, where A was sending to B and C to D. Intuitively, these two sends should be completely independent of each other, but this is not currently the case in Rust. This is one thing I hope to change.
It would be nice if we could do at least some of the communication without any locks at all, using a lock-free data structure. This is worth experimenting with, but we’ll have to wait for the benchmarks to show whether this is the best idea.
Somewhat orthogonal to message passing performance, I hope to be able to implement more of the communication system in Rust itself. When the communication system was first written, we had to put almost all of it in the runtime library that was written in C++. Rust is significantly more powerful now, and it seems like it would not take too much more work to write the communication system almost entirely in Rust.
In my last post, we saw how to get a parallel speedup on a breadth first search in Rust. One major flaw with this was that we had to use unsafe pointers all over the place to prevent from copying large data structures around. Given that Rust’s slogan is “a safe, concurrent, practical language,” it would be terribly sad to sacrifice safety in order to get concurrency. Fortunately, the tricks we were doing before were mostly safe. A human reviewer could see without too much trouble that the data we were sharing was immutable (so we wouldn’t have to worry about race conditions), and that we were careful to manage task and data lifetimes so we wouldn’t have a use-after-free error. Now the goal is to teach the compiler to verify the safety of these patterns on its own.
There are two safety properties we are concerned with:
Only immutable (i.e. constant) data can be shared.
Shared data must remain live as long as any task might access it.
For point (1), we have solved this by introducing a const kind. For the purposes of Rust, constant types are basically those which do not contain the mut annotation anywhere in them. Getting the details of this just right takes some care, and we will probably have to tweak the rules as time goes on. One current weakness is that only item functions (i.e. those without closures) count as const. Many Rust types already end up being const, which probably falls out of Rust’s immutable by default philosophy.
For (2), we’ve added a new module to the library called arc. This stands for atomic reference counting. This is a somewhat unfortunate name, as it is so similar to the more common automatic reference counting. We’d be happy to entertain suggestions for better names. The idea behind ARC is that it wraps over some piece of constant data. The ARC is non-copyable, but it can be sent (this required another tweak to Rust’s kind system to allow sendable resources). When you want to access the data the ARC is managing, you can ask it for a reference. Importantly, ARC provides a clone operation, which increments the reference count and returns another ARC pointing to the same data. When the ARC object goes out of scope, it decrements the reference count and potentially frees the data it is wrapping. ARCs can be safely shared between tasks because all reference counting is done using atomic operations. Additionally, Rust’s region system ensures that the ARC object outlives any references to the data inside of the ARC.
We were able to implement ARC entirely in the Rust Standard Library, with a little bit of runtime support for atomic operatons. ARC itself required some unsafe code in its implementation, but this fits with the “practical” part of Rust. We can implement limited parts of the program using unsafe primitives, and bless them as safe through manual inspection. Users can then build safe programs from the components in the standard library. Indeed, the actual parallel breadth first code no longer needs the unsafe keyword.
Along the way, Niko and I were also able to solve some annoying compiler bugs that make our standard library much more functional. The BFS code is also in the main Rust tree now, and later today I’m hoping to land a few more changes to the libraries to make building other parallel programs easier.
I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:
Allocate and update the result storage in place
Reuse tasks
Share more with unsafe pointers
Allocate and update in place
Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.
The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.
The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.
Reusing tasks
My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.
Share more with unsafe pointers
After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the -g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.
Future directions
My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.
Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.
Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.
I’ve continued tuning the Graph500 code I talked about yesterday. I still haven’t managed to achieve a speedup over the sequential version, but I’m not running about an order of magnitude faster than I was yesterday.
If you recall, the last profile from yesterday showed that we were spending an enormous amount of time waiting on a spinlock related to malloc and free. The problem was basically that the closure we were passing to par::mapi included the adjacency lists for the whole graph. In order to guarantee safety, Rust was copying the adjacency lists into each task. Since these are immutable, however, we should have been able to share them in place.
It turns out Rust has a way to let you do this by dropping into the unsafe portions of the language. Now, instead of duplicating the closure, we simply pass an unsafe pointer to the closure to the spawned tasks. These dereference it in place, and we avoid the copy. Now, instead of having a 50-100x slowdown, we’re only at a 4-5x slowdown. The profile looks like this:
The troublesome spinlock from before only accounts for 3.7% of our time. The 4th function, which has 5.3% of the time, is the main worker for the breadth first search. Ideally, this would account for the majority of the time. The top two functions, however, appear to do with allocating and zeroing memory.
I also tried using unsafe pointers to the input vectors, but this did not make a material difference in the performance. I suspect the bzero and memmove time mostly comes from aggregating the results from each of the parallel tasks together. The next obvious optimizations are to try to pre-allocate the result space and update these in place.
I'm back in Mountain View, living at the same apartment complex, working at the same place with the same people, waiting at the same traffic lights on my way to and from work. Even the dented PLDI water bottle that I left in the kitchen cabinet at the office was still there after nine months.1
If I'm lucky, I might even manage to convince myself that I'm qualified for, and capable of, doing my job, on account of having literally done it before.
I'm actually delighted about this. For background, last summer all the Rust interns went to PLDI and received, among other swag, water bottles that I liked a lot. But my bottle, after being crushed underneath the fold-down bed in Alex's and my apartment, was dented badly enough that it was more or less unusable. Sully gave me his bottle as a replacement, and I used it happily for some time, but during a bike ride this past winter, the cap flew off and was lost forever, making that bottle more or less unusable. So it was great to come back to Mozilla and find my dented bottle (with intact cap), which, together with Sully's intact bottle (with missing cap), leaves me with a complete, intact bottle-and-cap pair!
Dave’s efforts to expose undergraduate computer science students to the realities of Firefox development are inspiring and unparalleled in Higher Ed. I highly recommend those interested in CS Ed to familiarize themselves with his past teaching work at Seneca, and the extremely positive outcomes for his students.
Similar to the previously-posted PoC Light Table, but less ambitious and actually available. Also has a lot of headroom wrt. visualizing code coverage, data flow, …
Back in September, I wrote about what the first half of grad school was like for me. That post covers a period of about three years, from the middle of 2008 to the middle of 2011. I've been wanting very much to write the companion piece covering the second half of grad school, but so far, I've been hampered by the fact that most of the second half of grad school has not actually transpired yet.
Still, I'm willing to give it a shot!1
To recap from last time: while I was at Mozilla working on Rust last summer, my advisor invited me to come with her to Northeastern University for the rest of my Ph.D., which meant that I had to choose between IU and NU. It was an extraordinarily difficult decision to make. NU has a world-class group of PL researchers; living in Boston probably would have been cool; and, of course, I'd have gotten to keep working with Amal in person. But my husband, Alex, is also in the middle of his Ph.D., and while it hasn't been nonstop rainbows and sunshine for him here at IU, starting over at yet another school (assuming he could find a graduate program in Boston that was a good fit for him), and then probably having to take yet more courses and jump through more qualifying exam hoops, would be far worse. Living apart isn't a viable option for us, either -- we did that for a year and a half, and we're not interested in trying to do it again. (Empirically, if we're apart for more than a week these days, I go nocturnal and spend all my time either reading genre fiction or watching the same Lady Gaga video over and over. Bad scene.)
So, when I got back to school last August, I started working with a new advisor, Ryan Newton, who had just joined the faculty at IU. I'm now being advised by both Amal and Ryan, but Ryan is who I see day to day. I've spent a lot of time in the last several months telling Ryan about my work with Amal, and now that Ryan and I have a project going as well, I've started to do some of the reverse. Happily, I learn a lot in the process of doing so!
In order to get a better sense of the kind of research Ryan does, I took his course on embedded DSLs for parallelism in the fall. This was a new area for me to be reading in, and one of the papers that particularly grabbed me was about the Concurrent Collections system that Ryan had worked on while he was at Intel (abbreviated "CnC", as per some Intel marketing team). The paper had a proof of determinism for a parallel language called Featherweight CnC that's a lightweight model of full CnC. I noticed that the determinism proof hinges on a monotonicity lemma that says that if one state s in the semantics steps to another state s', then either s' is an error, or the store in s' is a valid extension of the store that s had. The monotonicity lemma, in turn, hinges on a single-assignment property, which states that we can only write to previously unused memory locations. It was the sort of thing that smelled like it might fit into a possible-worlds semantic model like the ones that I have some familiarity with from working with Amal. I was also curious about the analogy between the single-assignment property and the "weak updates only" requirement that some systems have, which requires that updates to a store can only change the value, not the type, of the contents of a cell -- and whether any of the tricks that allow loosening of the weak-updates-only requirement had analogues that would allow loosening of the single-assignment restriction.
When I went to talk to Ryan about that, it turned out that he had also been thinking about loosening the single-assignment restriction on single-celled I-structures, sometimes known as IVars, which are the communication mechanism that the Haskell monad-par package uses. If it's okay to write to a cell once, it seems like it should be fine to write the same value multiple times. But at that point, it would be interesting to try to loosen the notion of "same" and replace it with a less restrictive equivalence relation. "Values that unify with each other" is one possibility that Ryan brought up; for me, good ol' "values that are indistinguishable for a given number of steps" comes to mind. Also, what would happen if we parameterized a language by an equivalence relation for sameishness, lambda-eek-style? What properties would have to hold of that equivalence relation to make the language deterministic? (Wren sloganed this idea as "determinism modulo theories", which I like.) Or, what about only allowing writes of values that are monotonically increasing according to some partial order that we also parameterize the language by? And speaking of monotonicity: what do the notions of monotonicity that turn up in various models of deterministic parallel computation -- for instance, in monad-par, in CnC, and in Kahn process networks -- have in common? What would it look like to have a unified theory of monotonicity that's general enough to account for existing deterministic parallel models as well as guide the design of new ones?
By the time we were asking all these questions, it was late November, and Ryan suggested that we end the semester by writing a grant proposal. The NSF solicitation that he had in mind had a deadline of December 19th, so there was no time to waste, but we worked hard and got the proposal submitted on time. (Hooray!) "Generalizing Monotonic Data Structures for Expressive, Deterministic Parallel Programming" is the title of our proposal. It'll probably be a few months before we find out whether it will be recommended for funding, but one nice thing about being in theoretical CS is that there's no big outlay of funds needed to get a project off the ground -- the tools of our trade are pretty cheap.2 So we've been able to get started on the work anyway, and it's going to be my main focus for the next month.
In the meantime, the project that I've been working on with Amal for the last two years (!) is still underway. I made a trip to Boston back in October to work on it, and I'll be there again in mid-March. We were also able to sneak a little time to work on it during POPL last month. (While at POPL, which was awesome, I also had the chance to pick lots of clever people's brains about monotonic data structures, and I got some cool ideas that I'm still in the process of following up on and that I hope to have more to say about soon.) So, I'm still juggling separate research projects, but so far, I've mostly managed to work on each one for enough time at a stretch that the context-switching overhead doesn't kill my productivity. It also helps that I'm finally done taking classes, plus, this semester Ryan is funding me, so I don't have to work as a teaching assistant in order to pay the bills. So, all things considered, I'm very happy about how things are going! Although it's too early to say for sure, I think it's likely that I'll propose a thesis that has to do with the monotonic data structures project sometime in the next year or so, after we've submitted a paper or two. I'm excited about that, and I'm looking forward to seeing what the next couple of years will bring.
I'll end on that note, but first, one thing bears some explanation. Back in July, I was thinking I might want to develop an intermediate language for LLVM-based compilers (of which the Rust compiler is an example) as my dissertation work. Although that would be a cool project, it probably isn't the right project for me. I'm interested in doing more theoretical work. I probably won't be interested in doing more theoretical work forever, and in fact, I really enjoy doing heads-down compiler hacking, plan to do some more of it this summer, and could imagine myself doing a lot more of it after I graduate. I just don't particularly want to do it for my Ph.D.
If I'm completely, unflinchingly honest with myself, one reason that hacking on a Rust-related project came to mind as a potential thesis topic back in July was because right then, I was grappling with the decision between IU and NU, and some part of me fantasized that if I could get Mozilla to fund my research, then it wouldn't matter so much who my advisor or institutional affiliation was, and I could just avoid the decision entirely. Problem solved, right? Except, no, not at all, because the thing is that I would like to actually get a Ph.D. at the end of all this, and Mozilla, for all the cool things that it is, is not a degree-granting institution.3 Moreover, Mozilla would like me to actually get a Ph.D. at the end of all this, and as Dave Herman, my fantastic mentor at Mozilla, pointed out, they can only fund Ph.D. students for projects that the student's advisor is actively involved in, because otherwise, the student would essentially be doing contract work for Mozilla. That's not to say that there's anything wrong with doing contract work for Mozilla, of course, but you can't get a Ph.D. that way. An advisor is much more than a source of funding; an advisor is someone who guides their students through the grad school process and, along with the rest of a student's committee, eventually signs off on their Ph.D. So, although there are various ways to fund a Ph.D., to actually get one you really need to have an advisor -- or two.
Part of my job as a researcher is to try to make informed guesses about the future, right? If I wanted to really be cute, I could even title this post "The Next 700 Days of Grad School." The timing probably isn't even too far off.
Click this link. It’s one of the best overviews of architecture/strategy differences between Google, Amazon, Facebook, and Microsoft. I also think it is a good approximation to the differences between Firefox and Webkit as platforms.
My colleague Jamie already broke the news a while ago, so I guess there's nothing stopping me: my adviser1, Amal, has left Indiana University and joined the faculty at Northeastern. I've sat down to try to write about that a dozen times since Amal told me the news in mid-June, but it appears that my attempts have to do so have instead yielded a 4000-word screed about how my Ph.D. career had been progressing up until about mid-June.
I'd like to share that story with you, but before we start, a warning about me. Some people manage to keep dedicated research blogs that they keep entirely separate from wherever it is, if anywhere, that they write about goings-on in their personal lives. I am not one of those people. I love to write about research, but if I have a research-related epiphany, not only am I going to tell you what it is, I'm going to tell you which metaphorical or actual bathroom stall it happened in. At length. I can do pithy, too, but if "pithy" is what you're after right now, I think you're in the wrong place, friend.
Okay; warning imparted -- off we go, then!
What I usually tell people is that for the last three years, I've been a grad student in the programming languages group at my school. That isn't quite true. In fact, I joined IU without being a member of any research group. In my first year, I had a hooray-you're-a-girl fellowship, so I didn't need to teach or do research in order to have funding; in effect, that meant that I was able to treat the first part of grad school as an extension of undergrad.2 All I had to do was take classes, and that was something I already knew how to do, so I did pretty well at it. I finished my first semester with excellent grades and a feeling of smug superiority to everyone I'd ever heard say that grad school was hard.
Then, Dan asked me to help teach his course during my second semester. My fellowship would have supported me until the end of the year, so I didn't have to teach anything. But I'd loved taking the course in the fall, and I was flattered to be asked to help out, so I went for it. Dan's student at the time, Will, was deeply into dissertation mode, so working for Dan meant a lot of hanging around Will as he worked on miniKanren. That was how I began to dip a toe into research, which is, I'm sure, what Dan had intended to have happen all along. For Dan, teaching and research are inextricably linked. Even just the teaching part by itself was a challenge for me, though, and I was doing it on top of my normal course load, which was harder than the previous semester's, since now all the courses I was taking were covering material I hadn't seen before. I stopped feeling smug pretty quickly, and it wasn't too long before I was really getting killed.3
Halfway through the spring, Dan invited me to spend the summer doing research with him and Will, and I very nearly declined the invitation, because I was exhausted and desperately needed a break. But I wrote to John Stone for advice, and thankfully, he talked me out of saying no. In May, I declared victory over the bloody remains of the semester and stuck around to hang out with Dan and Will over the summer. I think that it was at that point that I stopped treating grad school like Undergrad, Part Deux and started really attempting to be a researcher.
The summer was a mixed bag, though. I got a first publication out of it, and even an attempt at a second, plus quite a lot of free pizza. But I never felt like I knew what the hell I was doing, and now I can put my finger on at least one of the things that was wrong. In a theoretical field like ours, I think that reading is one of the most important parts of doing research, and I wasn't doing nearly enough reading. When I was working on our FLOPS submission, I didn't understand that citing someone else's work should mean not only that you've read and understood the relevant part of that person's work, but also that you've surveyed the rest of the nearby research landscape and come up with a reason why that particular paper, rather than some other one, is the right one to cite for your purposes. You might have to look at several papers for every one you actually end up citing. I was citing all kinds of stuff I hadn't really read. Moreover, because I hadn't read enough papers, I didn't have a sense of what a good research paper was supposed to look like or sound like or be. That all goes a long way toward explaining why the FLOPS submission was rejected. At no point had anyone sat me down and said, "You want to be a researcher? You need to learn to read papers if you want to learn to write them."
Amal joined the faculty at IU that fall, just as my second year was starting, and I took her type theory course and loved it. We began working on research together in the spring, and it started to become clear that working with Amal was a good fit for me. Part of the reason I didn't do very much reading when I was working with Dan is that Dan's modus operandi with students is to give them fiendish puzzles to solve -- koans, you might say -- and then actively discourage them from reading anything that might give away the answer, because that would deprive them of the learning experience that comes from figuring it out for themselves. For a certain kind of learner, I'm sure that that approach leads to enlightenment. For me, as a first-year grad student, it just led to frustration. But Amal was different: she gave me piles of stuff to read! I took two more courses from her during spring and fall 2010, both research seminars, and it was in those courses that I finally started to read a lot of papers. Some were more relevant to me than others, but they all filled the need of getting me to just read a whole lot of PL papers, so that I would know how to read papers and therefore be prepared to absorb information from sources that were relevant.
And "prepared to absorb information" was what I needed to be. The project that Amal had suggested for us to work on was an offshoot of research that she had done with Jacob Matthews in 2008. They had published a paper that dealt with how the parametricity property could be made to hold for a language that was partly statically typed and partly dynamically typed, where the two languages were combined using a Matthews-Findler-style multi-language embedding. Having dynamically typed code in the mix would ordinarily mean giving up on parametricity, but their multi-language system used a technique called dynamic sealing that forces programs to behave parametrically at run-time, or die trying. Dynamic sealing wasn't a new idea, but Jacob and Amal were the first to present a (certain style of) proof that it worked as advertised to preserve parametricity. The only problem was that their work had turned out to have a flaw in it that invalidated the proof.4 But Amal had an idea for fixing the problem, and the plan was that we'd fix it, redo the proof, revise the paper to go along with it, and then publish the updated paper as a journal article together.
The Brian Kernighan quotation goes, "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." This is true of math as well, and the flawed proof that we were to fix employed a technique that I would go so far as to call not only clever, but actually kind of sneaky. (All the best math is sneaky.) When we started on the project in January 2010, Amal thought we'd have it done by May. What actually happened, of course, was that it took until May for me to get enough background knowledge to be able to even start the project. I had only just started studying type theory at all, and I'd only had the briefest of encounters with logical relations as a proof technique, let alone the step-indexed logical relations that this proof used. I couldn't even do an SILR proof, much less debug one. Nor did I understand what a multi-language embedding was, how dynamic sealing worked, or even what parametricity was really all about. So I had a long way to go before I was finally ready to start working on the project in May.
According to the commit logs, I then worked on it from precisely May 1 to May 19, at which point I scuttled off to my summer 2010 internship at GrammaTech and proceeded to not look at the thing for four months. Working at GT was great, but when I came back in the fall after my internship ended and tried to pick up where I'd left off on our project in May, I had a hard time remembering what any of it really meant or where we were headed with it. I flailed a bit, thought a lot, wrote down some notes, and thought I'd gotten a decent handle on things again. But then I went to ICFP in Baltimore at the end of September, and Derek Dreyer asked me a very simple question about the project that I should have been able to answer but couldn't, and I wanted to curl up and die right there under the buffet table.
So I got to work. I worked pretty hard on the project for the rest of that semester -- largely made possible by the fact that I was finally, in my third year, being paid to do research rather than being paid to teach5 -- and I slowly and gradually started to understand what the hell was going on.
In our multi-language system, parametricity on the statically typed side is enforced the way it's usually enforced: to wit, statically, by the type system! But on the dynamically typed side, it's enforced dynamically. A seal is nothing more than a unique value, generated at run-time by gensym or its equivalent. If you want to hide the number 5, you generate a unique seal, then wrap the 5 in a function that takes a seal as argument and compares whatever seal was passed in to your uniquely generated seal, which the function has stored in its closure. If the function's caller knows the correct seal and passes it in, you'll get the 5 back; otherwise, you'll get an error. It's possible to get information hiding, and therefore parametricity, by creating a seal in a private scope and using it to seal values before handing them out to clients, then unsealing them when they come back. And if dynamically typed code is embedded inside statically typed code, it can use this dynamic sealing mechanism to wrap values that came from the static side and had abstract types that were hidden behind type variables over on that side. Thus parametricity can be enforced even in the part of the system that isn't statically typed. Run-time is a rather strange time to be trying to enforce parametricity, but nevertheless, that's when we were doing it. So if we wanted to show that parametricity was actually being enforced, our proof needed to somehow account for the run-time semantics of our language.
What was needed was a sort of run-time environment that could capture information about the seals that had been generated so far. Then we needed to develop an equivalence relation on terms that was parameterized by just such a run-time environment. Such a relation turns out to be what is known as a Kripke logical relation parameterized by possible worlds.6 As it turns out, step-indexed logical relations are already a special case of Kripke logical relations, because a step index (which is just a natural number, so called because it represents the number of steps remaining for future execution of a program) can be thought of as a very low-information form of "world". Actually, for a lot of situations, that step index turns out to be all you need, so part of the contribution of our paper is to illustrate a situation where a step index alone is not enough. In fact, the reason the original version of the proof was broken was because it tried to use step indices alone. We needed to fix it by using richer worlds that capture information about which seals are generated, when they're generated, and which compile-time entities (in particular, type variables on the statically-typed side of the language) the generated seals correspond to. That's what I finally came to understand after the better part of a year.
The very end of the semester was rough, because I unexpectedly also had my oral qual exam to study for. A few days before the exam, everything that I had been working on and thinking about with regard to both quals and research culminated in a breathless middle-of-the-night email to Dan in which I explained how the step-by-step derivation of the Y combinator in The Little Schemer is precisely analogous to Appel and McAllester's (step-)indexed model of recursive types.7 Dan wrote back, "Thanks so much for sharing this with me. Amal explained it to me, but I did not see the relation with fixpoints. Your description is quite clear because I can relate to types and chapter 9 and much of Davey & Priestly. I think that D&P might hold some insight into how to make step-indexing more beautiful. What a very pretty model." That made me pretty happy, but then, one day later, Stevie sstrickl Strickland and Aaron Turon made me even happier by inviting me to come give a talk at Northeastern's PL seminar in Boston. At that point, I had just committed to starting at Mozilla in March. (Oh, yeah, I was somehow also interviewing at Mozilla while all of this was going on. Oh, and also, getting engaged. Come to think of it, those things may have also had a little to do with the happiness.) So if I were going to do the talk in Boston, it would have to be in February at the latest. In December 2010, February 2011 seemed like a very long time away, so I said, "Sign me up!"
I took the qual exam in mid-December, and I passed, but I didn't make my best showing. I knew that with quals, the only thing that matters is whether you pass. Still, I wanted another chance to really show Amal what I had learned. I knew I'd learned a lot, but we'd been working together for a year and I still had no concrete results to point to other than a shaky paper draft, a half-finished proof, and a predilection for sending spastic emails to Dan in the middle of the night. So I decided that my talk at Northeastern needed to be about the multi-language parametricity project, even though it still wasn't done, and even though Amal was going to be on leave starting in January and wouldn't be around to help much.
I kept working on our proof through January and the early part of February, and I got most of the interesting cases done. It actually turned out to be a good thing that Amal wasn't there to answer my every question, because I learned more from wrestling with the proof on my own and digging up relevant references for myself. Then, in mid-February, I put the proof aside -- still not done, but closer! -- and started working on the talk. I worked really hard on that thing. I probably spent 40 hours working on the slides alone, if you count both times I overhauled them after confronting parts of the project about which my own understanding was fuzzy. One might have thought that working on our proof for so long would have forced me to have confronted those things already, but, embarrassingly, enough of this kind of work is mindless symbol-pushing that it's possible to get quite a long way on a proof without actually understanding some key idea. So it was a good thing that I was choosing to give the talk about this project -- because it made me understand my own project a lot better! I ended up going to Boston at the end of February feeling quite well prepared. The talk went well, and I got a wonderful message from Aaron afterward: "Your talk was one of the most enjoyable and comprehensible in recent memory. I particularly liked your intuitions on the duality between universal and existential types." He was referring to the beginning part of the talk, in which I'd tried to touch on What Parametricity Really Means And Why We Care About It In The First Place. I'd written that part with no guidance from Amal at all, so I was especially glad that Aaron liked it!
Another good thing about working on the talk was that it made me think critically about the way that Amal and I had organized our paper. The paper makes a series
of false starts in which it sneaks something that won't work past
the reader, but then turns around and says "Hah, tricked you! We can't
do it that way. Let's do it this other way." That style of organization didn't seem to work for the talk, so I did it in a different way; it builds up the work in
stages, but doesn't set out to fool the audience. After having written the talk that way, I'm tempted to change the paper to a similar style. I don't know if we actually will, but the point is that before working on the talk, I wouldn't have had enough perspective to be able to make meta-level criticisms of the paper. I feel a lot more comfortable doing so now.
After getting back from Boston, I went on vacation with Alex for a few days; then, I almost immediately moved to California and started working for Mozilla on Rust. Working on Rust didn't quite start out in the way I had expected. The job posting that had originally attracted me to Rust had mentioned its type-and-effect system, which was supposed to track mutability of values in Rust. The Rust team, the posting said, was interested in having a soundness proof of a simple model of Rust as reassurance of the cohesiveness of the type-and-effect system design. I thought that my familiarity with Kripke models would serve me well in working on such a thing, since one of the things Kripke models are good for is reasoning about type safety in the presence of mutable state. (Our parametricity project doesn't deal with mutable state as such, but seal generation is a stateful notion akin to storing a value in a new memory location.) By the time I arrived at Mozilla in March, though, the type-and-effect system was already on its way out of the language.8 So I needed to figure out what I was going to work on instead.
After flipping through a draft of the language manual, I was intrigued by Rust's classless, prototype-based object system with flexible, extensible objects -- pretty unusual for a statically typed language. I asked if the prototype-based stuff was for real, and was told, "Oh, most of that isn't implemented yet; you should implement it!" At that point, Rust had the beginnings of an object system: it was possible to create an object that had methods and fields and call those methods or access those fields from the methods, but objects weren't extensible, methods couldn't be overridden, and there was no form of self-dispatch, so methods couldn't call other methods on the same object except by way of a lengthy workaround. In fact, there wasn't even a "self" (or "this") keyword in the language at that time. So I spent my spring and summer implementing pieces of the Rust object system and learning that those pieces interact with each other in interesting ways. It was a great experience, and I learned a hell of a lot. By the end of the summer, I was also contemplating a thesis proposal idea based on Rust, but now I'm jumping ahead -- more on that will have to wait until the second half of this story.
I was originally only going to be at Mozilla for the summer, but Dave Herman had asked if I could show up any earlier than May, and there was nothing really stopping me, since Amal was going to be on leave anyway. I'm very glad that I arrived at Mozilla in March, because it meant that I had more time to work on my project -- I really needed all five of the months I had. But, moreover, it meant that I got to be there in April when the compiler went self-hosting -- first, the first time that the compiler written in Rust compiled itself, and then, a few days later, the first time that that compiler successfully compiled itself, producing a fixpoint. That was one of the cooler moments in my career so far -- so cool that I got the chills.
And that brings us more or less up to mid-June. I was halfway through my internship when I got the news from Amal, and the timing of it divides grad school in half rather neatly, too. So, sometime soon, I'll write about what's happened since then, and about my plans (and dreams!) for what the second half of grad school will be like.
I still spell "adviser" in the Grinnell style much of the time. If it's good enough for the Firefox spell-checker, it's good enough for me. (On the other hand, the Firefox spell-checker doesn't approve of "parametricity", so.)
An external fellowship can be a bad thing for precisely this reason. As Tim points out, "Being employed as a research assistant for a specific professor or research group integrates you socially and binds you to a commitment to deliver a particular kind of results -- a commitment that motivates you to finish your task by any means necessary, including collaborating with others. Having a fellowship empowers you to fuck around for almost three years and never get called on your shit." To be fair, though, the one-year fellowship I had did me a lot of good, because when I arrived at grad school, my formal CS education comprised precisely eight courses, instead of the twelve or sixteen that most people take before entering a CS graduate program. (Or more. At one point, I asked Alex oniugnip to count how many CS courses he took at Georgia Tech -- he took twenty-two as an undergrad alone.) So the fellowship turned out to be a good thing, because at the beginning I really needed some time to concentrate on taking classes. But I think I'd be worse off had it gone on any longer.
It also didn't help that I kept going off to attendworkshops that were only vaguely academic. I remember walking across campus at some point in April of 2009, trying to find a place where I could sit and work on my compiler. I couldn't decide where to go, so I thought to myself, "Where have I consistently been able to get a lot of work done on this project?" Then I realized that the answer was "SFO". When you live in Indiana, that's a strange realization to have.
Neis, Dreyer, and Rossberg writing in JFP this year about the Matthews and Ahmed 2008 paper: "A key technical difference is that their formulation of the logical relations does not use possible worlds to capture the type store (the latter is left implicit in their operational semantics). Unfortunately, this resulted in a significant flaw in their paper. They have since reportedly fixed the problem -- independently of our work -- using a technique similar to ours, but they have yet to write up the details." We're working on it!
Or to be female.
Actually, this is redundant, since as far as I can tell, authors in my field use "Kripke logical relation" interchangeably with "logical relation that is parameterized by possible worlds". (After squinting at the relevant Wikipedia talk pages, I see that not everyone is happy about the terms being used interchangeably, but I'm a computer scientist, not a logician, so for me, that's splitting hairs.) In her dissertation, Amal points out that Kripke models are named in honor of logician and philosopher Saul Kripke (you know; the Naming and Necessity guy), citing the 1963 paper in which he developed a model theory for propositional modal logic. But for the most part, PL authors don't seem to cite Kripke when they use Kripke LRs. From the point of view of a PL researcher, his work was general enough and long enough ago that I think citing him would be kind of like citing Alonzo Church every time you write a lambda. (Although that would be kind of awesome.) In fact, there are two POPL 2011 papers that actually mention Kripke LRs in the title, but neither cites Kripke at all.
And, more generally, that the notion of stratification in domain theory is analogous to the notion of stratification in Kripke models. Up until that point, I'd thought of domain theory, if I thought of it at all, as some difficult and mysterious thing that we didn't have to use because we used Kripke models instead. Now I recognize domain theory and Kripke models as sharing an essential ingredient.
It turns out, though, that some of the ideas that the type-and-effect system was about are working their way back into the language by way of the kind system.
I gave a brownbag internship talk on http://air.mozilla.org yesterday. I don’t have access to a recording of the talk yet, but I’m going to go ahead and post the (slightly revised) slides that I used. Let me know if anything is unclear. I’ll post a link to a recording once it’s available.
Browser instrumentation is the technical basis for performance tuning and many web-related research projects. Such projects boil down to recording interesting data as the browser runs, and optionally modifying the behavior of the browser itself based on recorded data. In an extensible browser such as Firefox, there are a number of means to accomplish these ends, each with certain tradeoffs:
Extensions/addons are code written in high-level JavaScript and dynamically loaded at runtime. Extensions can record and modify the behavior of the browser to the extent allowed by the APIs of browser components (in Firefox, XPCOM components).
Extensions are eminently portable, cross-platform and relatively easy to author, but are limited in the data and behavior to which they have access. Many low-level subsystems (such as the layout engine and JavaScript engine) are off-limits due to the performance and complexity costs of exposing certain functionality and data to a COM-like API. Furthermore, even if there are no performance considerations, many potentially interesting pieces of data remain unexposed for lack of (widespread) need.
Many platform/OS-level dynamic instrumentation frameworks (such as DTrace, SystemTap, and Windows ETW) allow for custom instrumentation in both kernel and user code. A common metaphor for these systems is the insertion of “probes” into the code, which perform some user-defined action.
The main advantages of these probe frameworks is that they are incredibly low-overhead, and probes can be added and removed dynamically without recompiling or restarting. As a building block for browser research and tools, they are inadequate: probes are difficult to write correctly, are not cross-platform, and the obtained data does not easily feed back into the program being inspected.
Modifying the C++ source code of the browser is the brute-force method used by many performance analysts and researchers. The developer creates a fork of the browser source tree, makes local modifications, and optionally distributes a modified binary or source tarball to others who want to run the instrumented browser.
“Hacking” the browser in this way provides the ultimate power to record and change any behavior, but has significant drawbacks. The researcher/developer must invest much time and energy to navigate and understand a large, foreign codebase just to figure out where to add instrumentation. Many independent developers do not have the resources to test their changes on all relevant platforms. Most importantly, these modifications almost always doom the forked codebase (and thus the research project) to become a transitory software artifact instead of a useful tool.
jsprobes
jsprobes is cross-platform, portable, and flexible browser instrumentation framework. It follows the event-driven idiom common to DTrace and many browser components: certain locations in the code (“probe points”), when executed, can run custom bits of instrumentation. A significant feature of jsprobes is that this instrumentation (“probe handlers”) is written in JavaScript. Browser data structures are exposed to instrumentation through a safe, easy-to-use data API.
The framework was created to address the weak points of existing approaches and build on their strengths. More specifically, the design goals include:
cross-platform: jsprobes is implemented as part of the browser platform, instead of beneath it. There are no dependencies on specific operating system features, architectures, kernel modules, or escalated privileges (root access/jailbeaks).
flexible: Probe handlers have access to the full JavaScript language, and can leverage existing libraries. Adding new probe points or exposing additional data to instrumentation code is simple and customizable.
The handler execution model could be extended to serialize the probe event stream and data to disk, allowing for offline or remote evaluation of handlers.
easy-to-use: handlers are written in JavaScript, and handler-writers need not know the inner workings of Firefox data structures and architecture. Simple probes can be written using higher-level data API’s, while complex probes can use lower-level data API’s that more closely mirror the C++ view of browser data structures.
portable: Extensions are the de-facto way to share browser customizations; extensions can add, remove, and communicate with jsprobes-based probe handlers using a standard XPCOM interface.
cross-language: Firefox is implemented using a mix of C++ and JavaScript. jsprobes supports custom conversions between the data representations of C++ and JavaScript (or even JS-JS).
low overhead: The architecture of jsprobes minimizes time spent by the main thread to execute probe handler code. Most handlers are executed asynchronously on a side thread, and communication happens via asynchronous message passing.
Although not yet implemented, the design is also conducive to future adoption of “zero probe effect” techniques used by other instrumentation frameworks like DTrace. These techniques are a prerequisite for jsprobes to land in Firefox trunk.
Current status
In it’s current incarnation, jsprobes is exposed via the nsIProbeService XPCOM interface, which provides a high-level API to control the core instrumentation engine implemented inside SpiderMonkey. Accompanying this service and the core instrumentation engine are dozens of stub calls throughout the Firefox code base. These source locations are well-known and are shared by other instrumentation frameworks such as DTrace and ETW.
Currently jsprobes is distributed as a patch series, and is available on my bitbucket account. The patches track mozilla-inbound fairly closely as of the time of this writing. See the bitbucket page for more specifics.
Example: understanding GC behavior
Suppose you would like to know which benchmarks cause garbage collection (GC), the duration of such collections, and how much garbage is collected. At a data level, one needs to know the start and end times of each GC cycle, and the heap size before and after each GC.
Registering your instrumentation
Instrumentation is added and removed via an event handler-like interface, which should be familiar to DOM/JavaScript programmers. Each place where instrumentation could be added (i.e., at GC start and end) is called a probe point. Each probe point has a fixed set of arguments: for example, the current JavaScript context, the current runtime, or the GC compartment to be collected.
The instrumentation that should be executed upon reaching a probe point is called a probe handler (to distinguish it from event handlers). Probe handlers have access to the probe point arguments, but the actual handler code is run asynchronously on a separate thread with its own JavaScript heap. So, the probe point arguments are serialized when the probe point is reached, and then later deserialized into the handler-heap when the handler runs.
In our example, the probe points of interest are GC_DID_START and GC_WILL_END. These probe points are constants defined in the nsIProbeService interface, and each are described in the (forthcoming) documentation. Their interfaces are:
GC_DID_START(runtime)
GC_WILL_END(runtime)
Each argument type has a specific API as well. Below, to the left of the arrow is the field name, and to the right of the arrow is the data type. Capitalized types are representable by instances of the equivalent JavaScript prototypes. (env is a special implicit argument to all probe points, and is always available.)
runtime:
heapSize -> Number
gcTriggerBytes -> Number
heapLastSize -> Number
heapMaxSize -> Number
env:
currentTimeMS -> Date
At each of these points, we want to record heap size and current time. Fortunately, this data is readily available from the probe point arguments above. So, our two handlers should look like so:
/* handler for GC_DID_START */
/* format: [startTime, stopTime, startHeap, stopHeap] */
record = [env.currentTimeMS, 0, runtime.heapSize, 0];
/* handler for GC_WILL_END */
record[1] = env.currentTimeMS;
record[3] = runtime.HeapSize;
pendingData.push(record);
There is one more piece required before we can register these probe handlers: a data specification. Probe handlers run asynchronously, but data must be captured and serialized on the main thread when the probe point is reached. Capturing all of the data is expensive—Data specifications tell the instrumentation engine exactly which values need to be captured.
Data specifications are easy to write. Both of the above probe handlers have the same data specification:
Note that probes.addHandler returns a cookie, which is usable later as an argument to the corresponding probes.removeHandler API method.
Now that we’ve registered some probe handlers, subsequent garbage collections will trigger our instrumentation. This is great, but we eventually want to know the results of our instrumentation. nsIProbeService has one more method called asyncQuery. The arguments to asyncQuery are 1) some JavaScript source to run, and 2) an optional callback to invoke whenever a message is sent using postMessage from the handler thread to the probe service (main thread). For our example, we need two calls to asyncQuery: one for setup, and a periodic script that collects pending data records:
/* setup: run this before registering handlers */
probes.asyncQuery("pendingData = [];", function(){});
/* collection: run this periodically to marshal results */
probes.asyncQuery("while (pendingData.length) { " +
" postMessage(pendingData.pop()); " +
"} ",
function(msg) { results.push(msg); });
Once data starts rolling in via the callback, we can take action on the results array of records. For example, we could graph the data to show heap size over time, or plot GC pause length vs. garbage claimed. In fact, I have developed a demo extension called about:gc which does exactly this. It is available at my bitbucket.org/burg/aboutgc.
Perhaps I should have read the Avro documentation a little more closely. It requires Boost headers, the Boost regular expression library, as well as flex and bison in order to compile. If I don’t want my code to have a chance of making it upstream, adding these big libraries is a good start. Meanwhile, it seems upb is still too premature. In the meantime, I will just pass bulky, inefficient C++ data structures around :(
Since the last collabopt update, I’ve put some more work into the server software (dubbed optserver) that collects and aggregates profiles. This server is implemented in NodeJS and sqlite3. I have a support request in for a test phone that can run Fennec (aka Mobile Firefox). Once this comes, I will collect some profiles and see what the potential speedups are for the image load order optimization. (There may also be some tuning necessary to make the profile recording-extension work with mobile.)
Recently I have also thought of some other potential optimizations that should not be too hard to aggregate or optimize:
GC frequency/allocation rate: if we can observe that a particular website is animation-intensive or allocates a lot of short-lived objects, then we could pre-emptively run the GC more often than normal.
Compartment heap size: If we notice that particular websites keep a lot of objects alive, then fruitless GC time can be reduced by making the heap larger than normal.
Default image size: When loading webpages, often the size of image elements is unknown until the resource has been downloaded and decoded. In absence of size hints (such as the width="80" HTML attribute or min-width: 80px CSS property), the rendering engine must guess a default size for the image element. If this guess was inaccurate (known after downloading the image), the guessed image size may negatively effect on any following page content. If the guess was too big, page content may be pushed further down than necessary, and cause a layout reflow.
To avoid costly reflow, we can observe actual image dimensions during multiple loads of a website, and give this size information to new site visitors.
All three of these optimizations are conceptually simple, but recording profiles with this data is difficult or impossible if one can only use the internal JavaScript APIs of the browser. The values we seek to record are deep in the guts of the JavaScript engine (1 and 2) and layout engine (3). We require a different approach to recording this data in a low-overhead way. This different recording approach is the next major technical challenge of my internship. I have been brainstorming most of this week on approaches, and will write another post once the design has settled more.
When I came to Mozilla this summer, I told my co-workers that I was casting about for a thesis topic. For the last few weeks, I've been mulling over an idea that I think could lead to one.
Although my idea addresses a problem that I don't think is specific to Rust, I did arrive at it by working on Rust, so I'll start there. The Rust compiler is implemented with an LLVM backend; it translates
Rust code to the LLVM IR. Before I started working on Rust this
spring, I knew nothing about LLVM, but I've become a fan. By
targeting LLVM, there are a whole set of machine-specific issues
(register allocation, instruction selection, code generation) that our
team (mostly) no longer has to think about, allowing us to (mostly) concentrate on higher-level language design issues. I like the modular design of LLVM; perhaps the best part is that by targeting LLVM we're able
to take advantage of optimizations that other people contribute. If
a compiler targets LLVM, then at least in theory, improvements to LLVM
will make the compiler better over time without the compiler writers having to do
anything. In the case of Rust, we've made a few contributions upstream to
LLVM, too, so the relationship is, we hope, mutually beneficial.
Now, the LLVM folks have a way of making it sound like it's very easy
to write an LLVM-backed compiler: all you have to
do is write a straightforward front-end that massages source
programs into LLVM IR, and you're done. This might not be too
far-fetched of a thing to say if the source language you're compiling
is C. If you're compiling a much higher-level language, though,
translating to LLVM IR is nontrivial. In the Rust compiler, the translation from the fully-fleshed-out-and-annotated AST to LLVM IR happens via a single, monolithic, 9000-line pass known
affectionately as "trans". It's almost impossible to try to reason about a compiler pass this monolithic. We can ameliorate some of the pain of dealing with trans by separating it into
modules that handle, say, task and communication-related things,
vector operations, and so on; we've started doing this, and it does
help. We've also talked about abstracting low-level stuff out into what we call an "LLVM combinator library", which would present higher-level abstractions for common things we do with LLVM, and that
will help, too. But what would really be nice would be if we had more intermediate
languages. Even just one well-designed intermediate language would go
a long way toward making this part of the compiler easier to reason about and less
buggy, and at least one Rust developer besides me has advocated for such an intermediate language in discussions on our IRC channel.
But here's the thing: it's not just Rust!
Everyone who compiles a high-level language to LLVM is going to have
this problem and is going to have to deal with it one way or another.
So, wouldn't it be great to have a language that sits on top of LLVM, compiles to LLVM in a
semantics-preserving way, and serves as a target language for compiler
writers who are trying to compile high-level languages but who want to
take advantage of the goodness of LLVM? That's the language that I propose designing.
In order to do the proof of semantics preservation, I'd need a
formal semantics for LLVM, and from what I can tell, nobody has one of those yet. (It looks like Steve Zdancewic and Jianzhou Zhao recently embarked
on an LLVM-in-Coq project, but they haven't published anything yet.) So, the main contributions of this work would be
as follows:
It would make compiler writers' lives easier by giving them a
language to target that isn't as low-level as LLVM IR, but still
lets them exploit the goodness of LLVM in their compiler -- which,
after working on Rust, I believe is a real need.
It would give a formal semantics to a subset of LLVM, which I think would
be useful in many contexts outside of this work.
Rather than having to give a semantics to all of LLVM, I
can just give a semantics to the target language of compilation, which
I think could be a rather smaller subset of LLVM IR. I think a proof
of semantics preservation would be valuable even if it's only for a small subset of LLVM, because part of
the point of the LLVM infrastructure is that it's okay to produce
correct-but-inefficient IR and hand it off to the LLVM optimizer. If
we don't have to worry too much about producing efficient
LLVM IR, as long as we produce correct LLVM IR, then we can
compile to some minimal subset of LLVM that's expressive enough for
our needs and then let the optimizer go to town on code written in
that subset.
What we're after here, by the way, is not end-to-end verified compilation; for
that, there's CompCert and its ilk. Instead, we are working on the
assumption that an LLVM backend is something some people are going to
want anyway. Having said that, Greg Morrisett, Jean-Baptiste Tristan,
and Paul Govereau have work
underway on validating LLVM optimizations, so if someone
did want an end-to-end verified compiler that used LLVM, this
work and theirs could be complementary pieces of that larger
puzzle.
In the last couple of weeks, I've gotten the chance to pick some brains about my idea. So far I've gotten some great questions and feedback:
Adam Foltzer asked how my proposed intermediate language
would compare to C-- in terms of its level of abstraction. Adam may have been thinking of how, in the LLVM-backed version of GHC, an intermediate language that's based on C-- compiles to LLVM. However, C-- isn't
really higher-level than LLVM; in fact, the level of abstraction they
offer is quite similar. The C-- developers characterize it as a
portable assembly language, and that sounds a lot like LLVM. (So why
does GHC compile from C-- to LLVM, then? There's a great
paper about the LLVM GHC backend from last year's Haskell
Symposium that answers that question in detail.) So the answer is
that I'm aiming for something higher-level than C--. How much
higher-level is still an open question. The trans pass of Rust gives
me an idea of the range of abstraction I'm aiming for, but between the
level of abstraction of the input to Rust's trans pass (which is a big, hairy
AST with a whole bunch of stuff hanging off of it) and the level of
abstraction of LLVM, there's still a lot of room. This is something
that I'd especially like more feedback from people about. I'd like to
choose a set of abstractions that serve as a sane target for a variety
of source languages -- say, a purely functional one, a
super-OO-Kool-Aid one, and a super-concurrent-Kool-Aid one. I
wouldn't, I hope, actually need to implement a compiler from each of
these languages to my intermediate language. That seems like too much
work for one grad student. But I'd like to be able to have a reasonably convincing story for how my language could be a target for
any of them.
Dan Grossman pointed out that I'd need to consider how a
compiler writer who wanted to target my language would deal with
run-time system matters, particularly garbage collection. That
reminded me that most of the complexity of the aforementioned enormous
trans pass in Rust is code that handles reference counting. This is a
stopgap measure that will be replaced by a real garbage collector one
day, and I suppose that might cut down on the complexity, but in any
case, I think I'd want my language to sit at a level of abstraction
above garbage collection. On the other hand, I don't want it to tie
the compiler writer to a particular set of GC choices. This, I think,
is an example of the high-level-abstraction-versus-low-level-control
tension that drives a lot of research in our field. Actually, C--
found a neat solution to the eat-cake-and-have-it-too dilemma by
providing an interface for compiler writers to plug in their own
garbage collector; this run-time interface thing is the most
interesting and novel aspect of C--. Putting an interface like that
in my language is a possibility, but I recently realized (from talking
with my co-workers and from reading the aforementioned LLVM-backed GHC
paper) that LLVM
now offers such an interface, too. So maybe my story can be "not
my problem; use the LLVM GC interface".
Caylee Hogg asked what the framework for semantics preservation would be. I think I'd want to mechanize a proof of semantics preservation in a proof assistant and then extract the compiler from the proof, like y' do. Of the various proof assistant options, Coq is the only one I've tried, and I'm not married to it; I'm willing to invest in learning something new. More generally, though, I need to pin down exactly what "semantics preservation" means in this setting. A lot of authors seem to use "type-preserving compilation" and "semantics-preserving compilation" interchangeably, so is type preservation what I want, or do I want something more? Or something less? I've always been fuzzy on this point, and I'd like to get very, very clear on it before embarking on a proof.
So now I'd like to pick your brain. Who should I be talking to? What related work should I be looking at? What potential pitfalls have I not considered? I'd love to know what you think!
A record/replay framework provides the possibility to execute a program once and replay the execution of the program an arbitrary number of times. The main challenge in the record phase is to capture all sources of non-determinism (e.g., the schedule in a multi-threaded program). In the replay phase, this information is used to reconstruct the original (recorded) behavior of the program.
There are various levels of granularity at which information is recorded. For example, the recorder can log every executed instruction. One advantage of such a fine-grained recording strategy is that all information is available to replay the application deterministically. The disadvantage, however, is that recording slows down the program execution approximately two orders of magnitude. Such a large slowdown renders recording of programs with large workloads impractial. Fine-grained record/replay frameworks are therefore used for debugging programs rather than testing a program. Fine-grained record/replay frameworks are usually implemented with dynamic binary translation tool (e.g., PIN (http://www.pintool.org/) or FastBT (http://nebelwelt.net/?id=47)).
Our work takes a different approach in that we do not record (results of) individual instructions but events that introduce non-determinism to the program. We monitor a program (without rewriting code) and record all events that (1) provide an input to the program and (2) can affect the control-flow of the program. A list of events is described in the next paragraph – “Challenges”. The “Solution” paragraph explains how we eliminate/record events. Finally, the “Implementation” paragraph highlight the most important implementation details.
Challenges
Here is a list of sources of non-determinism we are considering so far. The list is probably not complete. So if you have an item to add, please let me know.
Address space randomization (ASR)
Signals
Schedule/interleaving of parallel programs
Instructions (e.g., rdts, rdmsr)
System calls (e.g., gettimeofday())
Solution
To eliminate the non-determinism that is introduced by the OS/hardware we propose the following strategies:
ASR: can be disabled by the OS. E.g., in Linux system, sysctl -w kernel.randomize_va_space=0 disables ASR
Signals: To be able to replay the original program execution as concise as possible, signals should be delivered at the exact spot as in the recording phase. This is a non-trivial problem since signals are, in general, asynchronous; a process can receive a signal at an arbitrary program point, and based on the signal, change the control flow. We use hardware performance counters to count the number of retired conditional branches in the recording phase and count the value down in the replay. When the counter reaches zero,the replay-process is stopped and single-steps to the eip where the original signal was received and sends the signal. This way signalling of multi-(threaded/process) applications can replayed exactly.
Scheduling: Multi-threaded programs that share memory can introduce non-determinism due to data-races and deadlocks. Trying to replay parallel programs deterministically either requires modifications to the kernel and/or language or runtime systems. All these efforts are still quite “academic” and time will show which approach is most promising. To avoid “parallel problems” we serialize the execution of parallel programs by letting only one thread/process run at a time. More specifically, the recorder determines which process runs and logs the schedule, which is used in the replayer to implement the same schedule. Well, now you might think that we are already in the area of parallelism and we are building a recorder that serializes my parallel program. In fact I think that many programs will remain mostly sequential, or apply only a couple of threads. As a result serializing a parallel program will slow down the recording/replay by a modest factor. The reasoning behind my assumption is that many problems are inherently sequential, or at least large parts of it. Keeping Amdahl’s Law in mind, many programs will use a couple of threads only.
Instructions: There are some (x86) instructions that introduce non-determinism. For example, the rdtsc reads the current time, which is given by a 64-bit value. Clearly, this value will differ in the record/replay phase. Since rdtsc is a common instruction we must make sure that the values are the same in the record/replay. In a Linux environment rdtsc can be virtualized by using the fcntl system call. The rdmsr instruction is used to access the PMU (Performance Monitoring Unit). Most data that can be gathered by the PMU is non-deterministic (e.g., cache hit/miss). Currently, we do not handle the rdmsr instruction specifically. However, since most applications do not make use of PMU data we do not consider this as a major limitation of our approach.
System calls: There are many system calls that provide input to the application. E.g., read, gettimeofday(), or sockets can give different varying (from run-to-run) inputs to the target application. Therefor, the recorder logs all inputs in the record phase, and the replayer injects the recorded data in the replay phase. This way we can guarantee that the executions in the record/replay do not differ due to system calls.
Implementation
The current system is implemented for a 32-bit Linux-based system. Subsequently, you can find a list of the most important implementation/design decisions:
The monitoring process(recorder/replayer) is implemented as a separate process. The reason for using separate address spaces is that the execution of the original (not-monitored) program should be influenced as llittle as possible.
We use ptrace to stop at system call entries/exits and to intercept signals.
We virtuaize every system call that uses file descriptors (including sockets)
Only system calls that must be executed are actually executed (e.g., mmap)
I will add more implementation details over time. The project is still in an experimental phase and things might change over time. I’ll keep you informed.
When I last wrote
about my work on Rust a month ago, I had just finished
implementing some very basic support for extending Rust objects with new methods. In the grand tradition of computer science publishing, the example program I gave was one that very carefully avoided doing any of the things that didn't work yet. The most egregious issue was that if you extended an object with new methods, you could, um, no longer call its original methods. (Really. Well, I had to start somewhere.) Even once that was fixed, a lot of things were wrong. Outer methods couldn't refer to inner ones, since object extension didn't play nicely with self. And method overriding didn't work at all.
In the last month, I've made all of those things work to some extent. For instance, programs like thesetwo from the Rust test suite that exercise the features I just mentioned can now compile, run, and exit normally, having behaved as expected. (And do, on three platforms, whenever someone lands a change to the compiler.) Hooray!
A lot of work remains to be done on the object system, of course, including a pretty huge issue that I've been putting off fixing. Everyone loves a brainteaser, right? Consider the following Rust program:
use std;
fn main() {
obj cat() {
fn ack() -> str {
ret "ack";
}
fn meow() -> str {
ret "meow";
}
fn zzz() -> str {
ret self.meow();
}
}
auto shortcat = cat();
auto longcat = obj() {
fn lol() -> str {
ret "lol";
}
fn nyan() -> str {
ret "nyan";
}
with shortcat
};
assert (shortcat.zzz() == "meow");
assert (longcat.zzz() == "meow");
}
Here we've got an object with three methods,
shortcat, that's created by calling the
cat() object constructor. When the call to
shortcat.zzz() is made, we look up zzz in
shortcat's vtable and run the code associated with it, which directs us to look up meow in
shortcat's vtable and run the code associated with it. Finally, that code returns the string "meow", and the first assertion succeeds.
Now consider the call to longcat.zzz(). The
longcat object was created by extending shortcat with two new
methods, lol and nyan. Neither one of the new
methods overrides any of shortcat's methods, and neither one should interfere with zzz or meow.
So we would expect the call to longcat.zzz() to return
"meow", too. Instead, in the current buggy implementation, it returns something else, and the assertion fails.
So, the brainteaser: can you guess what longcat.zzz() returns
instead of "meow", and why? (Note that the program compiles with no errors or warnings, doesn't segfault when run, and doesn't raise any errors under Valgrind.)
Scroll down for the answer and an explanation...
First, note that longcat has five entries in its
vtable, appearing in alphabetical order: ack,
lol, meow, nyan,
zzz. Of those, lol and nyan
are "normal" entries, and the others are "forwarding functions" that, roughly speaking, forward us along to the
appropriate vtable entries from shortcat. (Implementing
this forwarding behavior has taken me much of the last month.) So,
when we call longcat.zzz(), we get forwarded along to
shortcat's vtable entry for zzz, which contains the compiled code for
ret self.meow().
This is where the problem arises: the code we're looking at was
compiled under the assumption that self is shortcat, since of course that's what it was when shortcat's vtable was created. shortcat's vtable only has three methods,
ack, meow, and zzz, in that
order. The call to self.meow() therefore compiled
to, roughly, "do whatever the second slot in my vtable says to
do". But since self is now longcat, the second slot in
its vtable is not meow but lol. So longcat.zzz() returns
"lol", and the assertion fails.
Interestingly, as I mentioned, this code runs cleanly under Valgrind -- no memory corruption or anything like that. Furthermore, the assertion might have accidentally succeeded if I'd chosen names for the methods that alphabetize differently. So this is a truly nasty
bug.
To fix the mismatch, we need the self in
shortcat.zzz() to have the same type as
shortcat does. By "same type", I mean that its vtable
must have the same number of methods, with the same names, in the same
order, and having the same types as the original. But self is supposed to be longcat at that point, so how are we going to manage to give it the same type as shortcat? The answer, unsurprisingly, is another level of indirection. We give self a vtable that has three
backwarding functions, one for each of the original three
methods ack, meow, and zzz, which point to the corresponding entries in longcat's
vtable. From there, of course, they will forward right back to
shortcat's vtable -- unless longcat has overridden one of them, in which case we'd use the overriding entry from longcat. (There's no overriding happening in this example, but we need a solution that would work in case there were.) With luck, I'll land this backwarding behavior in Rust sometime in the next week. Stay tuned for the next exciting episode of Lindsey Implements An Object System!
I found a deterministic hardware performance counter event on a Sandy Bridge architecture. The event counts retired conditional branches. That’s great news, since we can use this event to find the exact spot to deliver a signal in the replayer by counting down the retired conditional branch instructions that were recorded.
(My posts with the tag mozilla,
such as this one, are now being syndicated onto Planet Mozilla Research along with those of my colleagues. Yay!)
This week will have been my fourteenth week (!) on the Rust team at Mozilla. Fourteen
weeks is about the length of a normal internship, so now seems like a
good time to take stock of what I've accomplished so far. When I left off writing about Rust in April,
I'd just spent a couple of weeks learning my way around PLT Redex and getting started
on a Redex model of Rust. Working in Redex was a lot of fun, but I felt rather disconnected from the rest of the team, who were
all hacking on, you know, the actual implementation of Rust.
So it was good to get back to working on Rust proper at the end of
those two weeks. Of course, Rust is changing so fast that it's easy
to lose track of what's going on if you stop paying attention for even
a moment, so when I came back to Rust it was another week or so of catching up before
I was really ready to start writing code again.
If you're new here, Rust is a new systems programming language
pursuing the trifecta of safe, concurrent, and
fast. It has an expressive type system, pattern matching,
concurrency primitives, and a lightweight, structural object system.
My project for the summer is focused on the object system and type system: I'm
designing and implementing support for "self" expressions in Rust,
which involves having a notion of self-types, or "the type of the
current object". Figuring out an appropriate type for the current
object is surprisingly subtle, especially in the presence of object
extension. For instance, suppose that you add some more methods or
fields to an instance of an object, then call a method on the extended
object. If the method you call happens to return self,
what's the type of that return value? Is it the type of the extended
object, or the type of the original one? Ideally, it would be the
type of the extended object, which is a more precise type, but some
languages, like Java, aren't able to statically determine that the
returned value has the more precise type. (Apparently, it's possible
to fake
that behavior using generics, but it ain't pretty.) It would be
great if Rust could do better.
Since object extension is one of the things that makes determining
the type of self hard, I decided to work on object
extension first. This alone has taken a pretty long time. Rust
compiles to the LLVM intermediate language, and I was forced to
confront my ignorance of how our translation to LLVM works.1 As it turns out, it's pretty hairy -- I spent at least a couple of the last eight weeks
doing nothing but staring at the translation pass of the compiler and
adding hundreds of lines of comments to it. But I can finally report that as of
yesterday, Rust supports extending an object with new methods! Here's
a complete, if rather uninteresting, Rust program that uses the new
feature:
use std;
fn main() {
obj normal() {
fn foo() -> int { ret 2; }
}
auto my_normal_obj = normal();
// Extending an object with a new method
auto my_anon_obj = obj {
fn bar() -> int {
ret 3;
}
with my_normal_obj
};
assert (my_normal_obj.foo() == 2);
assert (my_anon_obj.bar() == 3);
auto another_anon_obj = obj {
fn baz() -> int {
ret 4;
}
with my_anon_obj
};
assert (another_anon_obj.baz() == 4);
}
Here, the expression we're assigning to my_anon_obj is
what we're calling an "anonymous object expression" -- it takes
my_normal_obj, which is an instance of the object
normal, adds a new method to it, and evaluates to a
brand-new object. (Note the with my_normal_obj part.)
It's not necessary to call any kind of constructor -- the new object
simply springs into being. This seems obvious in retrospect, but it
caused me a lot of confusion when I was trying to figure out how to
translate anonymous object expressions. When normal objects such as
normal are translated, the translation is side-effecting:
it causes an object constructor function to end up in the generated
executable. Translating anonymous object expressions, on the other
hand, doesn't do that -- it produces an object "inline", and no
constructor is generated. The resulting extended object is still a
bona fide object, though, like one you'd get from a constructor, so it
can be extended again with more methods, producing
another_anon_obj.
There's still a lot of work to do here -- as you can see, I'm not
doing anything with self yet. Also conspicuously absent
are calls like, say, my_anon_obj.foo(), which
should work fine but which doesn't yet. To do that, I need
to put so-called "forwarding slots" in the anonymous object's vtable
that forward method calls to the appropriate methods in the object it
extended. And we also want to support field extension -- so those
things are all on my plate for the rest of the summer. I have nine weeks to go -- we'll see how far I get!
I'm also excited about continuing to work on Rust when the summer
is over. By August, I think I'll be in an unusual position as someone
who's familiar with the Rust compiler internals, but who has a
theoretical background and wants to do theoretical/foundational work. One thing I'm wondering about, for instance, is how well Rust's
translation to LLVM preserves the semantics of types. My guess is
that since many Rust types compile to one LLVM type, the translation is not particularly semantics-preserving -- so does a
semantics-preserving encoding exist in the LLVM type system, and if not, how much more
sophisticated would LLVM's type system have to be to make it possible?
(And a semantics for LLVM doesn't exist yet, which by itself is a huge
open problem that I think a lot of
people would like to have a solution for.)
In the more immediate term, though, I'm taking tomorrow off work to compete in the ICFP Programming Contest, which starts today at 5 p.m. local time and goes until 5 p.m. on Sunday. This will be the third year that Alex oniugnip and I have competed
as Team K&R2. As soon as our long weekend of hacking is over, Alex is headed off to a conference in Portland. He'll have something like three hours between the time the contest ends and the time his flight leaves, which meets our usual standards for ridiculousness. Who else is playing?
In fact, I didn't know anything about the LLVM compiler infrastructure, so I started by reading the chapter about LLVM from the recently published Architecture of Open Source Applications book. I highly recommend this book; everything I've read in it has been good, and it's available free online. They're also accepting contributions of more chapters.
I’ve tried to find the factor(s) in the system that causes the non-determinism in for various counters that are supposed to be deterministic. After trying various hardware events, it seems that there is at least one event that is not influenced by page faults and hardware interrupts. This event is a sub-category of BR_INST_RETIRED (0xC4 on Core i7 architectures) and is called NEAR_CALL (0x02). The good news is that the event seems to be deterministic, the bad news is that this particular event does not occur very frequently. At least, the event can be used as a guaranteed lower boundary. If the counter 0x2C4 is smaller than the recorded value, we can be sure that we haven’t already gone over the spot were we want to stop.
Enough words, here are some graphs – more will follow.
Best,
Albert
Experimental settings:
– Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz
– OS: Ubuntu kernel: 2.6.39-0-generic
– hpc information gathering: perf_events
-SPEC CPU 2006 benchmarks: compilation: gcc 4.5.2 -O2
If you need more data, please let me know!
We use ptrace to control to the process that is monitored. Execution is stopped before every system call and the content of the performance counters is read. After the data is obtained from the counters, all counters are reset to avoid a drift of the jitter throughout the execution. The graphs presented here do not include data from the Linux loader, since ldd seems to be non-deterministic. Address space randomization is disabled (sysctl -w kernel.randomize_va_space=0), and the process is bound to a specific core (sched_setaffinity(..)).
gcc benchmark:invoked with parameters: ./gcc… 166.i -o 166.s
Jitter of retired instruction/store/branch/near_call count as a function of retired instruction count.
One week has no passed, and I am trying to figure out why the number of retired instructions for a deterministic application is not deterministic. I wrote a small framework, which should allow me to track all sources of non-determinability.
To do so, we eliminate as much randomness as possible before we start the application: (1) we bind the execution of the thread to a particular cpu, and (2) we disable address space randomization, which is enabled by default in newer Linux distributions. If you know any more, please let me know! So far, so good.
There are two known sources of randomness that we track during execution: (1) page faults and (2) hardware interrupts. More specifically, the RETIRED_INST counter adds +1 instruction for each page fault and for each hardware interrupt. However, from what I have found out so far (and which I not quite sure of), the following scenario adds non-determinism (or the counter for hardware interrupts is not precise?):
inst: n
<- hw-signal 1
<- hw-signal 2
inst: n+1
In this example, the RETIRED_INST does not reflect the second signal. If any one has comments on that (or experienced the same) please let me know.
I just started my internship at Mozilla last week. So far, everything is great and the project I am going to work on this summer is very cool. The main aim of the project is to develop a debugger that enables the programmer to exactly trace down the source of a bug. Very often, the effects of a bug (e.g., write to an unintended memory location) are seen later (seconds, hours, days) in the program execution. Traditional debuggers cannot reveal such a bug, since (1) they do not record execution traces and (2) they cannot go back in time (reverse-debug).
There exist several approaches that record execution traces to enable reverse debugging. However, recording execution traces is extremely expensive. It can cause a slowdown of the application up to a factor of 100! In a real world environment, such a slow down is not acceptable. To avoid an excessive overhead, other approaches use check-pointing. Instead of recording every executed instruction, check points are created (e.g., every second) from which the program can be re-executed an arbitrary number of times. The difficulty that comes with check-pointing is that the result of every non-deterministic operation must be recorded in order to guarantee that the replay produces the same output as the original run. E.g., the results of system calls must be recorded during the original run. In the replay run, the system call is not executed, but instead the recorded result is used.
There exist frameworks which can do that. However, it remains unclear if their replay is 100% deterministic. Furthermore, they incur a slowdown of approx. 50%.
The goal of my project is to enable reverse debugging at almost zero overhead. To achieve this goal, we want to use hardware performance counters (HPC) that are available on all modern CPUs. In particular, upon every signal that is sent to the application, we intercept the signal and record the type of the signal. In a second step, we read the the number of retired instructions. Consequently, we can record the source of non-determinism and the exact point when it happened. This should (at least we hope so) be enough information to enable a fully deterministic replay.
In a first step, we try to find out how deterministic hardware performance counters are. Any results will (hopefully) be published soon here!
Two weeks ago our friends at Sony managed to get personal information of 70 million users stolen from them. I got one of the notification emails a couple days ago myself (I must have signed up for the Playstation Network when I installed the PS3 we bought to do our Cell VM work back at UCI.) In this instance, Sony is a shining example of how NOT to handle user data.
Whats user data?
User data is any piece of identifiable data about a user. It can be all sorts of obvious stuff like your name, address, birth date, passwords (all of these Sony managed to lose), but also less obvious data such as usage history, communications, and what not. Whether Sony lost the latter is not clear. Some of the information Sony lost clearly should never have been stored in the first place. I understand why Sony asked me for my birth date when I signed up for PSN. Some jurisdictions want you to be a certain age before you can engage in virtual mayhem and violence that is modern video gaming. But why the hell did Sony store this information in a database? Why not just flag my account as “remembers the first techno song being played on the radio, definitely old enough”?
Risks
Storing data is always risky. If you store any data in the cloud, eventually someone will break into it. Investing a lot of money (expensive equipment, expensive practices, expensive staff) helps delaying that day of reconning, but only within limits. There are a lot of financial incentives to steal this kinda of data. Personal information of 70 million people is a fantastic starting point for all sorts of phishing attacks. And even if only one out of 10,000 people getting that Nigerian email falls for it, that’s still plenty of people to take advantage of. Sony took a risk storing user data. Unfortunately, it was not a well calculated risk. They could have easily reduces the fallout from a breach by storing less information, i.e. by not storing the birth date, or maybe even not storing personal information at all! Shocking proposition, I know.
Not knowing is bliss
Why does Sony need the names of its PSN subscribers in the PSN user database to begin with? Let people choose a handle and a password. If you want to personal the experience, let users chose their name. Dear PSN, you may call me Tracemonkey now. You really don’t need to know my real name and address. As for payment information, I think its ok if PSN asks me once a year to re-enter my credit card info, which is then briefly processed but never stored. Had they followed this simple principle of touching (and storing) as little user data as possible, they would have saved themselves a lot of legal trouble and liability.
Browsers
I am not ranting about this topic out of thin air. Web browsers handle a lot of personal information, and its tempting for browser vendors to get in on the whole social networks online transactions online identity game. A number of people at Mozilla don’t seem too comfortable with hosting on our infrastructure any user data ever. Its really scary and risky after all (just ask Sony). I think that’s wrong. We absolutely should get into the key areas of social and identity. Why? Because the state of the art is crappy, and we can do better.
Microsoft Passport anyone?
Web identity is a total mess. I have at least 30 accounts in various places with different account names and passwords (well at least I try). Various organizations and services have tried to established a single login. Microsoft Passport was one of the earlier ones. I am really glad that didn’t work out. Can you imagine the evil empire owning all your personal data and online identity? Microsoft has a lot of incentives to use and abuse such a powerful position, and it certainly does. I still get emails from Microsoft about my Passport account on a regular basis, almost a decade later. Its usually an invitation to try this new Microsoft feature Y or maybe try chatting with my passport account or … well whatever. Microsoft sits on a lot of data, and its tempting to monetize it. And of course its not just them. Everyone else is just as bad. Ever noticed how Google and Facebook customize ads for you based on the data they have about you? Creepy. This is why I think Mozilla can do a lot better. We don’t have any hidden agenda. We don’t have any extra services we want to sell you. We don’t have to monetize data we store to turn a profit and make shareholders happy. We simply don’t have any shareholders. We are a company owned by a non-profit foundation that wants to make the web a better place. This puts us in a much better position to do whats best for our users, instead of whats best for our quarterly statement (we don’t publish any of those in case you didn’t notice.)
Playing it safe
We are currently in the process of figuring out how exactly Mozilla should handle user data. I have exactly zero authority when it comes to these kind of decisions, but Mozilla is a pretty open and democratic place and we tend to discuss this stuff pretty openly, giving anyone a voice who wants to speak up. I think its imperative that we follow a couple guiding principles as we explore ways to better serve our users using services such as identity or social:
Always keep the users’ best interest in mind (and only the users’ best interests). I don’t care if we can ship a feature faster or cheaper if we store more user data (or maybe store it not encrypted instead of encrypted). Our new Sync service is a great example for this. Its a total pain in the butt to encrypt the browser history on the client before it is uploaded to our services, but its the right thing to do. It means that in case of Sync we can never see any of your browser history, even if we tried, and your data is safe by default.
Always store as little data as possible in the cloud. If there is a way to implement a feature completely in the client without us ever having to see user data, that’s always the right approach, even if its harder. This is exactly the issue we are facing with our new F1 social browsing feature. It allows you to share websites on Facebook/Twitter/etc as you visit them. Its a really cool feature–I use it all the time. Unfortunately, the protocol Facebook/Twitter/etc offer to authenticate and access their APIs (oauth) is totally broken, and conceptually doesn’t really work for client applications. oauth requires the client (Firefox/F1) and Facebook/Twitter/etc to negotiate a shared secret (called the consumer key). With a pure client solution this secret can never be kept (someone could peek into Firefox/F1 and extract the key). It seems Facebook has blacklisted consumer keys before because people checked them into open-source repositories. The only alternative to this is to put the key behind a service Mozilla runs and then let Firefox/F1 post via that service, but that means we would be able to see (in theory, not intentionally of course) all the Facebook/Twitter/etc status updates of millions of people every day. That’s wrong. As tempting and quick as it would be to setup a Mozilla service that keeps the key hidden and posts for users, we should never put ourselves in a position where we handle user data without an overwhelming need for it. In this particular instance we should simply negotiate with Facebook/Twitter/etc to not enforce the shared secret rule (Twitter already doesn’t it seems, since there are so many Twitter client apps out there), and maybe in parallel we should work on better protocols than oauth.
Going fast
As we were discussing these various architectural aspects of how to handle user data (or how not to handle it) the last few weeks, some people were tempted to go the easy route and store a lot more user data (in particular in the clear) than necessary because it might get us to market faster. I think this is wrong for the above two principles, but its also wrong because it will NOT get us to market faster. Dealing with user data from an infrastructure perspective is a total pain. To handle or even store things like Facebook/Twitter/etc account authentication tokens or user contacts, we have to build out a serious security infrastructure. We need to hire expensive, highly trained personnel and we have to seriously tighten our security practices. That doesn’t mean we are unsafe right now. It just means our current practices match our current threat scenarios. For example we have external IT administrators who don’t even work for Mozilla administering our source code repository access controls. They are simply Mozilla project volunteers. Considering the limited risks, this is acceptable. When it comes to storing user data, entirely different standards will be needed. And getting all that sorted out and implemented will require a lot of audits, careful planning … and a lot of time. So if you want to go fast, go zero user data. Or encrypt the user data on the client so all we get to see are blobs of meaningless zeros and ones. That’s how Sync works, and we got it up and running within a couple months. That’s how you go fast.
What’s next?
We are currently discussing what our process will be to store user data. Expect people with actual authority to make decisions (and to talk about them) to start talking about this publicly in a few weeks. I already know that the result of our internal deliberations will be a policy that will focus on what’s best for our users, and that will minimize risks for them (and in the end, for us). And expect a safe and secure implementation of F1 to show up in your browser really soon. You can already try out the prototype now. It really rocks.
This blog post represents my personal opinion, not the official position of Mozilla.
As of yesterday evening, we have a self-hosting Rust compiler, which means that the Rust compiler written in Rust can compile itself. The resulting binary is identical to the compiler built by the Rust bootstrap compiler (which was written in OCaml). This is a big milestone for the project and represents the culmination of many months of focused work.
So the obvious question is: what's next? The short answer is "a lot"! Here are a few areas we'll be focusing on in the next few months.
Compiler performance. Currently, it takes 15 minutes on an Opteron for rustc to compile itself. That's far too slow to be practical. We should be at least 50× faster. This is a tall order, but, especially with the renewed emphasis on compiler performance that clang, GCC, and Go have spearheaded in recent years, we can't afford to ignore this.
Missing features. We have many features that are part of the core language that have yet to be implemented in rustc. Just to name a few, we need destructors, threads, object upcasts, garbage collection, stack growth, alias analysis for safety, and an x86-64 port. Some features are partially working but unfinished, such as pattern matching.
Language additions. Several new language features are under discussion. Most of these are attempts to solve difficulties in the language that we encountered while working on the Rust compiler. (Being able to find these difficulties early is one of the great advantages of writing the compiler in the language itself!) These features include unique pointers, self types, overloading, blocks/lambdas, and macros.
New syntax. This one is contentious, as syntax always is, but the general consensus is that some parts of the syntax are too difficult for their own good. Some of the mutually-agreed-upon facets of the syntax that we'd like to change are (a) to put items and types in different namespaces, (b) to make the switch-case syntax less spiky and verbose, and (c) to use ECMAScript 4-like .<> notation for type parameter instantiation; e.g. map.<int>(...);
It's an exciting time for Rust—there's been lots of forward progress and the language is slowly but surely starting to feel less like a toy. As always, you can find the current state of the compiler on GitHub.
For the last two weeks at work, my little PLT Redex model of Rust has been taking shape. It's been a lot of fun. I started with lambda calculus, like y' do, but it became clear that that doesn't really make sense as a language to grow Rust out of; for one thing, functions are not expressions in Rust.1 Still, one of the nice, reusable pieces to have come out of that attempt is that I implemented the λgc language from "Abstract Models of Memory Management" by Morrisett, Felleisen and Harper. This was a good way to figure out what I was doing with Redex. It was instructive to go from reading the paper to having running code for λgc. The paper didn't gloss over a lot, but it did gloss over a little! The hardest part was, as usual, Naming Things.
I was able to borrow some ideas from λgc for the Rust model, which I'm calling baby-rust. Whereas λgc models a program as an initially empty heap followed by an expression, baby-rust models a program as a list of "items" (definitions of functions and types), followed by an initially empty heap, followed by an expression. I'm still not modeling the stack explicitly, so this only takes me half as far as I need to go for a realistic model, but items, at least, are a real concept in Rust, so it's a start.
Just for fun, I threw in a baby-rust typechecker, named typeck just like the real Rust typechecker is, even though my brain still insists on pronouncing that as "type eccch". Like Rust, baby-rust has tuples and tuple indexing operations. For instance, in baby-rust you can pull bar out of the tuple (tup foo bar baz) using the expression (index (tup foo bar baz) 1). The best part was when I realized that any old thing that evaluates to a number should be able to appear where the 1 is in that expression, and that therefore, in order to determine the type of the whole expression, I needed to know not only that that thing is a number, but which number it is.2 So now the typechecker calls off to the reduction relation! Excellent!
I'm ready to start thinking seriously about self-types now, so I'm going to try to put the model on the shelf and come back to it later, but I'm a little worried that I won't be able to stay away. Building languages out of ((languages out of) ...) Scheme just feels so good.
Before you flip out and stab: Rust is still mostly an expression language, and it does have something called bind expressions, which are like closures. I'm not modeling those yet, though. Okay, carry on flipping out and stabbing.
In principle, I could also reject the program as ill-typed if it tried to index out of bounds. I'm actually almost already doing that: a call to Racket's list-ref raises a run-time error, but list-ref's run-time is baby-rust's typechecking time, so I'm "statically" rejecting baby-rust programs that index out of bounds before I "run" them. Dependent types haha woo? Now we just need to implement all high-assurance systems software in Redex and we're set, right?
As it turns out, I picked an interesting
month to start working at Mozilla. On Tuesday of my second week, Firefox 4 launched, followed by the launch of Firefox Mobile during my
third week, then the launch of a big
advertising campaign. And last week, which was my fourth, was "all
hands" week, during which Mozilla employees from around the world
descended on the main office here in Mountain View so that they could
work in person with folks that they usually mostly interact with over
IRC. Mozilla is widely distributed geographically; there are 400
employees, but only 150 or so are in Mountain View. The rest either
work from home or from the six other, much smaller Mozilla offices in
Auckland, Beijing, Paris, Tokyo, Toronto, and Vancouver. A lot of
those 400 are, like me, new employees -- 50 new hires in 2011 so far,
or so I hear. With all these events and product launches and new people
and visitors to entertain, it's been something of a non-stop party at work. As I told Jesse jes5199 when he was here to visit this weekend, "They just keep on feeding us and feeding us and giving us t-shirts and feeding us."
In spite of all the partying, we're managing to get quite a lot of
work done on Rust. The first Rust compiler, which we call the
bootstrap compiler, is written in OCaml and is being used to
develop a second compiler, which is self-hosted, that is, written in Rust. Very soon, we'll be at the point where the
self-hosted compiler can compile itself. When that happens, which should be within weeks, maybe even days, the
bootstrap compiler can go onto the scrap heap.
I've been working on the self-hosted compiler. As a starter project, I made an improvement to the typechecker's error messages: if you try to do something like, say, add 5 to an instance of an object, your code now fails to compile
with a nice, short error message like "expected int but found thingie",
instead of "expected int but found [hundred-line pretty-print dump of thingie's
internal structure]" as was happening before. Most of the infrastructure for doing this existed in the compiler already, so all I had to do was take advantage of it. I'm pleased, though, that I managed to land this first patch on March 22nd, which happened to be the day of the Firefox 4 launch. Actually, I'm impressed that I was lucid enough to write code that day at all, after having gotten up at five in the morning for the Firefox launch party!
Next, Patrick asked me to implement the n-body benchmark from the Computer Language Benchmarks Game in Rust. Here's how new of a language Rust is: I needed to read a string from the command line and parse it into an integer (which ends up being the number of time steps that a toy physics simulation runs for) . The Java version, for instance, uses Integer.parseInt() for this. I poked around in the Rust standard library for a while, looking for the analogous thing in Rust and not finding it anywhere. Finally, I asked
on IRC, "How do I turn a string into an int?" People said, "Oh, I don't think that's implemented yet. Go for it!" And that's how I found myself writing a piece of code that sort of looks like atoi went to grad school and started dating ML.1
// eventually this'll live in a library, hopefully with some error
// checking or something. only works up to 2^^31 - 1. quality first!
fn str_to_int(str s) -> int {
let int res = 0;
for (u8 a in s) {
// 48 = ascii for '0'
res = (res * 10) + (a as int - 48);
}
ret res;
}
This ended up not making it into our final version of the program, because we decided to just hard-code the number of steps that the simulation would run. It's just as well, because, you know, when code that I helped write is being demoed by Brendan Eich in front of 400 Mozilla employees on the first morning of all-hands week, I'd really prefer my contribution to not be the kind of code about which "quality first!" can only be said sarcastically.
I've now started working on my longer-term goal for the summer, which is to design and implement self-types (that is, "the type of the current object", as found in, for example, Scala) for the Rust type system. Rust has the beginnings of an object system, but until a couple weeks ago, there wasn't even such a thing as a self keyword yet, so the first thing I did was implement that, along with support for self-calls like self.foo(). This took about a week and a half and necessitated touching pretty much every part of the compiler. I learned a lot from doing it, and now you can compile programs you couldn't before. (And they even work!) But self still doesn't exist as a standalone, first-class entity; right now, it's really nothing more than a magic prefix to method calls, and you can't do things like return self, nor can you write functions that take arguments of type self. For that, we need self-types. I know very little about object systems, but I find it reassuring that, deep down in their heart of hearts, self-types are just a special case of recursive types. That shouldn't make me less scared, but somehow it does.
So, that's what's been going on! Finally, this week I've also been spending some time learning my way around PLT Redex so that we can use it to develop an operational semantics for (a very small subset of) Rust. We really need a semantics, and I've been wanting to try out Redex for a while, so now seems like a good opportunity. I just started learning Redex, and I don't have a lot of experience doing this kind of thing, so chances are I'm not doinitrite. But I'm doin' it out in the open, 'cause that's how we do.
My jokey commentary notwithstanding, you shouldn't read too much into the way Rust code looks right now. The syntax is subject to change, and it'll probably change significantly once we're self-hosting.