Miscellany

Friday, April 29, 2011

The road ahead for Rust

As of yesterday evening, we have a self-hosting Rust compiler, which means that the Rust compiler written in Rust can compile itself. The resulting binary is identical to the compiler built by the Rust bootstrap compiler (which was written in OCaml). This is a big milestone for the project and represents the culmination of many months of focused work.

So the obvious question is: what's next? The short answer is "a lot"! Here are a few areas we'll be focusing on in the next few months.

Compiler performance. Currently, it takes 15 minutes on an Opteron for rustc to compile itself. That's far too slow to be practical. We should be at least 50× faster. This is a tall order, but, especially with the renewed emphasis on compiler performance that clang, GCC, and Go have spearheaded in recent years, we can't afford to ignore this.

Missing features. We have many features that are part of the core language that have yet to be implemented in rustc. Just to name a few, we need destructors, threads, object upcasts, garbage collection, stack growth, alias analysis for safety, and an x86-64 port. Some features are partially working but unfinished, such as pattern matching.

Language additions. Several new language features are under discussion. Most of these are attempts to solve difficulties in the language that we encountered while working on the Rust compiler. (Being able to find these difficulties early is one of the great advantages of writing the compiler in the language itself!) These features include unique pointers, self types, overloading, blocks/lambdas, and macros.

New syntax. This one is contentious, as syntax always is, but the general consensus is that some parts of the syntax are too difficult for their own good. Some of the mutually-agreed-upon facets of the syntax that we'd like to change are (a) to put items and types in different namespaces, (b) to make the switch-case syntax less spiky and verbose, and (c) to use ECMAScript 4-like .<> notation for type parameter instantiation; e.g. map.<int>(...);

It's an exciting time for Rust—there's been lots of forward progress and the language is slowly but surely starting to feel less like a toy. As always, you can find the current state of the compiler on GitHub.

Thursday, February 24, 2011

Build Times Visualized

Here's a graph (whipped up in a couple of lines of bash) of where we spend our time when building a clean Firefox from scratch. Tested on Mac OS X 10.6 on a Core 2 Duo MacBook Pro, with a -j2 parallel build.

Sunday, December 5, 2010

C++ design goals in the context of Rust

Recently I came across Bjarne Stroustrup's often-cited list of design criteria for C++ again, and I thought it'd be interesting to compare it with my take on the design of Rust. Since Rust is meant to be used in many of the same scenarios as C++ is today, the design goals overlap, but Rust has a different focus that led to different choices. Note that this is my take only and not an official decree as to the design of the language by any means.

C++ is designed to be a statically typed, general-purpose language that is as efficient and portable as C.

Rust is designed to be a statically-typed, general-purpose language that is designed to be as efficient and portable as idiomatic C++, without sacrificing safety.

It's impossible to be "as fast as C" in all cases while remaining safe. As Dave Herman put it, "competing with C++ when you're safe is fighting with one arm tied behind your back". C++ allows all sorts of low-level tricks, mostly involving circumventing the type system, that offer practically unlimited avenues for optimization. In practice, though, C++ programmers restrict themselves to a few tools for the vast majority of the code they write, including stack-allocated variables owned by one function and passed by alias, uniquely owned objects (often used with auto_ptr or the C++0x unique_ptr), and reference counting via shared_ptr or COM. One of the goals of Rust's type system is to support these patterns exactly as C++ does, but to enforce their safe usage. In this way, the goal is to be competitive with the vast majority of idiomatic C++ in performance, while remaining memory-safe.

C++ is designed to directly and comprehensively support multiple programming styles (procedural programming, data abstraction, object-oriented programming, and generic programming).

Rust is designed to directly and comprehensively support multiple programming styles (procedural programming, data abstraction, object-oriented programming, generic programming, functional programming, and parallel programming). With the increasing popularity of dynamic languages like JavaScript and Ruby, functional programming has become a mainstream part of programmers' toolboxes, and we hope to fully support it. And parallel programming is becoming critical for good performance, and one of the main reasons for Rust's existence is that parallel programming is difficult in C++, especially with the traditional threaded shared-heap approach.

C++ is designed to give the programmer choice, even if this makes it possible for the programmer to choose incorrectly.

Rust is designed to give the programmer choice, but to guide the programmer to the best choice. You can use a mutable data structure in Rust, but you have to specify that in the type declaration, and you lose the ability to send such data over channels. You can use dynamic assertions throughout your code, but you cut down on check calls by performing the assertions as early as possible and propagating the constraints down with predicates. You can use unsafe code, but you have to mark the functions using it as unsafe and mark the associated modules as unsafe in the .rc file. Rust isn't intended to be a "bondage-and-discipline" language, because writing code in the recommended style is designed to be as straightforward and friendly as possible, but it is designed to make the programmer aware of aspects of the program that could have a negative impact on safety, performance, or correctness.

C++ is designed to be as compatible with C as possible, therefore providing a smooth transition from C.

Rust is designed to be as compatible with external C code as possible, therefore providing a smooth transition from C. The runtime features a separate C stack that's very quick (8 instructions) to switch to, and destructors allow C resources to be managed like Rust resources. The source language is obviously incompatible with C, but every language needs a good foreign function interface (FFI) to survive. We're already making heavy use of the FFI in the production compiler to interface to LLVM.

C++ avoids features that are platform specific or not general purpose.

Rust attempts to be as portable as possible, but we aren't particularly conservative regarding the standard library. Instead, the goal is to adopt a "batteries included" model à la Python. We diverge significantly from C++ here; we don't want different programs and libraries inventing things like vectors and hash maps.

C++ does not incur overhead for features that are not used (the "zero-overhead principle").

Rust incurs minimal overhead for features that aren't used. "Zero overhead" is too restrictive in practice, as, taken literally, it forbids features like garbage collection. Rather than aiming for zero overhead, I hope to achieve two goals: first, to make the runtime system simple enough that programmers aren't surprised by its behavior; and second, to make the system conservative enough that most software projects use most of the features anyway. Most commonly in large C++ projects, the "zero-overhead principle" is important with respect to two features: exceptions and runtime type information (RTTI). In Rust, exceptions don't exist (we need stack unwinding, but it's simpler than that of C++), and runtime type information does in fact conform to the zero-overhead principle (although runtime type information will be frequently used in practice due to the standard library's implementation of collections).

C++ is designed to function without a sophisticated programming environment.

Rust is also designed to function without a sophisticated programming environment. Rust crates are self-contained modules in the operating system's native format. There are no runtimes for end users to download beyond a small rustrt.dll runtime and the standard library.

Ultimately, Rust and C++ have more in common than not, but the few different design decisions buy us radically different safety, concurrency, and (we hope!) usability properties.

Tuesday, November 30, 2010

A syntax highlighting specification

In light of the Skywriter-Cloud9 collaboration, I've been working on a specification for cross-browser syntax highlighting engines. I now have something to announce, and I'm thrilled to have been able to work with not only the Skywriter team, but also Fabian Jakobs of Cloud9 and Marijn Haverbeke of CodeMirror fame, to develop a unified syntax highlighting specification for JavaScript code editors.

The idea is that people can share syntax highlighting engines across a variety of JavaScript code editors and viewers. Syntax highlighters are simple to create and are essentially a state machine consisting of JavaScript regexes. Here's an example of a simple diff/patch highlighter in this format:

exports.getInfo = function() {
    return {
        name: "diff",
        fileexts: [ "diff", "patch" ],
        mimetypes: [ "application/x-diff" ]
    };
}

exports.getRules = function() {
    return {
        start: [
            { regex: /\+.*/, token: 'addition.diff' },
            { regex: /-.*/, token: 'deletion.diff' },
            { regex: /.*/, token: 'plain' }
        ]
    };
}

As you can see, the format is very simple to get started with. But it's also fairly powerful: it supports nested syntax highlighting modes (for JavaScript inside HTML and the like), multiple states (so that strings and comments can be highlighted correctly), and all the features of JavaScript regexes.

It's my hope that this format will allow greater sharing among code editors. We're hoping to implement this soon in the Skywriter/Cloud9 codebase. The specification can be found here in an EtherPad, and I'd be very grateful for any and all feedback!

Friday, October 15, 2010

jsctags is now known as DoctorJS

Just a quick note that jsctags is now known as DoctorJS. You can find the new repository here on GitHub. It features an all-new abstract interpretation engine, courtesy of Dimitris Vardoulakis, that scales much better to large files and is capable of providing much better results. Check it out!

Friday, October 1, 2010

Rust Features I: Type Inference

As the Rust programming language slowly starts to get up and running, I thought I'd start a series of short blog posts to show off some of the more interesting features of the language. I'm going to do my best to restrain my enthusiasm and present only bits of the language that are currently working in the compiler. (Nevertheless, all of this can change at any time as the language evolves! This goes double for syntax; we have to make sure the semantics of the language is nailed down before we can start designing how the language looks and feels.)

I'll start with a simple feature: type inference. Along with structural typing, this allows Rust to express some pretty powerful concepts without the boilerplate seen in other languages. Rust features what I call feed-forward type inference, which is summed up by this rule: Variables have no type until they are assigned a value, at which point their type can't change.

Let's take a look at some simple examples to see how this works.

    auto x = 5;
    auto y = "foo";

Here, we see the auto keyword, which declares a variable and leaves the compiler to figure out the type. In this form, it's similar to var in C# or Go. Because a value is immediately assigned to the variable, Rust infers that x has the type int and that y has the type string.

Now for a more interesting example:

    auto x;
    if (localtime().hours >= 8) {
        x = "awake!"
    } else {
        x = "asleep, go away."
    }
    log "I'm " + x;

Here, we didn't initialize x in its declaration, but this code still compiles and runs properly. This is because Rust is able to wait until x is assigned to determine its type. Unlike, say, C# (and, of course, languages like ML in which variables must be assigned a value at declaration time), Rust doesn't require that auto declarations be initially assigned a value to determine a type for them. At the same time, unlike C, the compiler will emit an error if you ever forget to assign a value to the variable in any branch of your code. For instance, if you omit the else branch above, the code won't compile.

Where type inference becomes especially interesting is in combination with structural record types. Take this very simple example (note that a couple of these APIs are currently missing, so this won't work if you try it right now):

    log "Enter two vectors in the format 'x1 y1 x2 y2' and I'll give you the dot product:"
    auto fields = _str.split(input.read_line(), ' ');
    auto ints = _vec.map(fields, _int.of_str);  // convert to ints
    auto a = rec(x=ints.(0), y=ints.(1));
    auto b = rec(x=ints.(2), y=ints.(3));
    auto ans = a.x * b.x + a.y * b.y;
    log ans;

Take a look at the two lines that begin with auto a = and auto b =. They initialize a record that wasn't defined anywhere else in the program. The C99 equivalent of this would be:

    struct point {
        int x;
        int y;
    };
    
    ...
    
    point a = { ints[0], ints[1] };
    point b = { ints[2], ints[3] };

C doesn't have structural typing, so we're forced to declare the point type up front. In contrast, Rust allows you to effectively pull new record types out of thin air at any time, by simply listing their keys and values inside a rec() constructor. Combined with type inference, this affords you much of the convenience of the (roughly-)equivalent idiomatic Python declaration:

    a = { 'x': ints[0], 'y': ints[1] }
    b = { 'x': ints[2], 'y': ints[3] }

But you gain the type safety of Rust and performance comparable to the C version. (You do lose the dynamism that Python dictionaries provide, however; for example, a["x"] isn't allowed. For that, you'll want the hashmap type available in the standard library.)

Hopefully, these examples give you a feel for the type inference in Rust and what it can provide you. Although Rust isn't a scripting language, it's designed to bring much of the convenience and readability of scripting languages (dare I say "agile development"?) to large-scale projects.

Tuesday, May 25, 2010

Introducing jsctags

On the Developer Tools team, we're always interested in making development of web apps easier, no matter what editor or browser you're using. To that end, I'm happy to introduce a new code indexing solution for JavaScript designed to make development of large JavaScript applications a little simpler and faster. It'll be the foundation of the code completion features in Bespin, but since it's based on the venerable ctags format, you can use it now in many editors. The project's known as jsctags, and it's located here on GitHub. It requires only the node.js framework, which was selected because it's well-known and easy to set up.

ctags has become the de facto standard for simple code indexing tasks. Editors that support ctags, such as Vim and TextMate (with the TmCodeBrowser plugin), can read ctags-compatible tags files to allow you to quickly find information about symbols. You can jump to the location where a symbol is defined, perform autocompletion against names defined in the tags file, and get module and type information for any listed symbol. For example, here's a screenshot of Vim using the output of jsctags to inspect the attr function from jQuery:

The obvious question is: Hasn't this been done in JavaScript before? After all, languages like C++, Java, and Python have featured good support for ctags for years now. It's true, in fact, that the Exuberant Ctags tool features a JavaScript language module already. The difficulty, however, is that JavaScript, especially modern JavaScript, is hard. The traditional pure parser-based methods of tag extraction simply don't work very well for the kind of JavaScript people write today. Here's a comparison of the number of tags (roughly, the number of exported functions and variables) found in some popular JavaScript libraries and plugins by jsctags and Exuberant Ctags 5.8:

Library/Plugin	Exuberant Ctags	jsctags
jQuery 1.4.2	1	1219
jsTree 0.9.9a	52	72
Prototype 1.6.1	18	389
script.aculo.us 1.8.3	3	234
MooTools 1.2.4	170	192

Moreover, jsctags has more useful output for developers: it knows about CommonJS modules and the CommonJS packaging standard, and it can automatically detect the module system in use.

How does all this work? Briefly, what's going on behind the scenes in jsctags is a very simple form of what's known as abstract interpretation. jsctags actually interprets the scripts sent to it using an interpreter loosely based on Narcissus. Its interpretation is very simple and restricted: no loops or conditionals are executed, no external functions are callable by the script, unreferenced variables spring into existence instead of throwing ReferenceErrors, and so on. jsctags isn't a full-blown JavaScript interpreter by any means: all it's concerned about is how definitions make it to the window and/or exports object.

There will be more in store for jsctags in the future! Dimitris Vardoulakis is beefing up the parser to support Mozilla-specific extensions and is investigating the possibility of adding type inference. In the near future, we'll be using jsctags in Bespin to provide code completion.

Feel free to leave suggestions in the comments and fork the project on GitHub. Feedback is much appreciated!

~pcwalton

Miscellaneous notes from a research hacker