Rewriting the lexer benchmark in Rust - Eli Bendersky's website

Eli Bendersky's website

I've reimplemented my lexical analyzer (lexer) for the the TableGen language in Python, JS and Go so far. The latestpost in the seriesdiscussed how several years of new Go versions improved my lexer's performanceby roughly 2x, and how several additional optimizations won another 37%; thefinal result is a lexer that churns through 1 MiB of source code in just 5.6milliseconds.

Since I've also been playing with Rust recently, I thought it would be interestingto re-implement this lexer once again, this time in Rust. Rust is certainly thelowest-level language among those I've used so far for this task, so I expect tosee some top-notch performance.

The full code for this post is available on GitHub.

Designing an API

I find that Rust's strict ownership rules makes one think carefully about APIdesign from very early on. If we want to create a lexer and pass it a stringas input - who owns the string? In Rust, the answer to this question is notan implicit contract (like it always is in C and sometimes in C++), and itcannot be deferred to the runtime either (like one would do in Python, JS orGo). The answer has to be explicitly encoded into the types of the program.

Since one of the goals of this series of posts is performance, I decided togo with a zero-copy API at first, where the user of the lexer owns the inputstring; the lexer, in turn, returns tokens that contain references into thisinput string - so the user ends up owning these too. Rust's lifetime specifiersmake this fairly natural; here's the type:

pub struct Lexer<'source> {  input: &'source str,  iter: Peekable<CharIndices<'source>>,  // c is the last char taken from iter, and ci is its offset in the input.  c: char,  ci: usize,  // error is true iff the lexer encountered and error.  error: bool,}

Ignore the fields for now, focusing just on the struct definition. This isthe constructor:

impl<'source> Lexer<'source> {    pub fn new(input: &'source str) -> Self {      // ...    }}

The 'source lifetime specifier is used to explicitly annotate the lifetimeof the input string slice, since we want to store it in our Lexer and referto it later. Once a Lexer is created, the user obtains new tokens by callingnext_token:

pub fn next_token(&mut self) -> Token<'source> {  // ...}

Note that the returned Token also has the same lifetime annotation. Here'show the type is defined:

#[derive(Debug, PartialEq, Clone, Copy)]pub struct Token<'source> {    pub value: TokenValue<'source>,    pub pos: usize,}#[derive(Debug, PartialEq, Clone, Copy)]pub enum TokenValue<'source> {    EOF,    Error,    Plus,    Minus,    Multiply,    Divide,    Period,    Backslash,    Colon,    Percent,    Pipe,    Exclamation,    Question,    Pound,    Ampersand,    Semi,    Comma,    LeftParen,    RightParen,    LeftAng,    RightAng,    LeftBrace,    RightBrace,    LeftBracket,    RightBracket,    Equals,    Comment(&'source str),    Identifier(&'source str),    Number(&'source str),    Quote(&'source str),}

Now it should be clear how these lifetimes are tied together. Some tokens holdslices into the user's input, and this is encoded in the explicit lifetimes. Thesignature of the next_token method says "we return tokens with a lifetimethat's tied to the lifetime of the input passed into the constructor".

We can also provide a more natural iteration API for the Lexer byimplementing the Iterator trait:

// Lexer is an Iterator; it returns tokens until EOF is encountered, when it// returns None (the EOF token itself is not returned). Note that errors are// still returned as tokens with TokenValue::Error.impl<'source> Iterator for Lexer<'source> {    type Item = Token<'source>;    fn next(&mut self) -> Option<Self::Item> {        if self.error {            // If an error has already been set before we invoke next_token,            // it means we've already returned TokenValue::Error once and now            // we should terminate the iteration.            return None;        }        let tok = self.next_token();        if tok.value == TokenValue::EOF {            None        } else {            Some(tok)        }    }}

The iterator implementation makes it possible to integrate the lexer with therest of Rust very elegantly; for example, to obtain all the tokens in a giveninput as a vector:

pub fn tokenize_all_collect<'source>(data: &'source str) -> Vec<Token<'source>> {    let lex = Lexer::new(&data);    lex.collect()}

Implementation

Generally, the implementation of the lexer in Rust follows the same approachused by my previous hand-written lexers in this series. I'd like to highlighta couple of Rust-specific aspects here.

Our lexer fully supports Unicode, so I decided to use Rust's string iteratorsupport to obtain chars from &str. Rust provides a helpful iteratorcalled CharIndices, which yields chars along with their position in theinput - we need the position to report token locations. Furthermore, since ourlexer requires a bit of look-ahead [1], the iterator is wrapped in Peekable, which provides thepeek method. As we've seen above already, the final definition is:

iter: Peekable<CharIndices<'source>>,

With this in mind, the code of the lexer should be very readable, even forsomeone not too familiar with Rust.

Another note is how sub-string extraction happens when tokens are returned.As an example, let's look at the scan_number method which is invoked whenthe lexer's current character is a digit:

fn scan_number(&mut self) -> Token<'source> {    let startpos = self.ci;    while self.c.is_digit(10) {        self.scan_char();    }    Token {        value: TokenValue::Number(&self.input[startpos..self.ci]),        pos: startpos,    }}

Recall that the Number variant of the TokenValue enum is defined asNumber(&'source str) - it contains a reference to the input source string.In the code for scan_number, we see how this is actually implemented, bysub-slicing the input slice. This creates a slice with the same lifetime as theinput slice (which is encoded in the lifetimes of the types). As in Go,sub-slicing is a very cheap operation in Rust (no heap allocation).

Performance

The performance of this Rust-implemented lexer is very good! I ran it on thesame large TableGen file I've used for all the benchmarking, and it finishestokenizing it in just 3.7 ms; this is about 33% faster than my fastest Goversion from the previous post.

Profiling Rust is quite a bit trickier than Go, but I managed to cobble togetherenough magic flags and perf invocations to ascertain that next_tokenindeed takes the bulk of the time; this is good - the time is being spent whereit's supposed to be spent.

Trying a variant with allocations

As described above, the API for this Rust lexer was designed to be zero-copy.Since my Go lexers experimented with different approaches, I've decided to seehow a different API in Rust would look.

There are two aspects to ownership we can change: taking ownership of the inputstring, and/or returning owned Strings in the tokens.

For taking ownership of the input string, our constructor would have to looklike:

pub fn new(input: String) -> Self

Does it make sense for a lexer to own its input? This isn't clear, and inreality it turned out to be tricky to implement due to lifetime issues. Rustis very unhappy when a struct field is a reference to another field of the samestruct, because there is no way to guarantee proper destruction order [2]. Inour case, the iterator (a struct field) needs a reference to the string (anotherfield in the same struct), and this is doesn't pass the Rust compiler'sscrutiny. I wanted to be able to implement this without actively fooling thecompilerby using opaque indices or unsafe.

When the language fights you this hard, it may be a good sign that thedesign is wrong - it's better to leave the ownership to the code that creates aLexer.

Ownership of returned tokens was easier to set up. The code for this isavailable in the owning.rs file of the repository.In this variant, the constructor doesn't change, but tokens are defineddifferently:

#[derive(Debug, PartialEq, Clone)]pub struct Token {    pub value: TokenValue,    pub pos: usize,}#[derive(Debug, PartialEq, Clone)]pub enum TokenValue {    // .. types    Comment(String),    Identifier(String),    Number(String),    Quote(String),}

Note that variants like Identifier now hold an owning String insteadof a string reference. Therefore, lifetime annotations are no longer necessaryon Token.

The scanning code now allocates a new String and reads characters into itfrom the iterator. We've seen previously how scan_number looks; here itis again, for the owning token variant (with some helper methods):

fn scan_number(&mut self) -> Token {    let startpos = self.ci;    Token {        value: TokenValue::Number(self.scan_while_true(|c| c.is_digit(10))),        pos: startpos,    }}// Helper to scan chars while `pred(c)` returns true, into the given `s`.fn scan_while_true_into<F>(&mut self, s: &mut String, pred: F)where    F: Fn(char) -> bool,{    while pred(self.c) {        s.push(self.c);        self.scan_char();    }}// Helper to scan chars while `pred(c)` returns true and return all scanned// chars in a new String.fn scan_while_true<F>(&mut self, pred: F) -> Stringwhere    F: Fn(char) -> bool,{    let mut s = String::with_capacity(8);    self.scan_while_true_into(&mut s, pred);    s}

Performance of this variant

When I first tried this variant, its performance wasn't great at all! It wasabout 30% slower than the string-copying version in Go. I think this has todo with the slight difference in approach - in Go, my lexer figures out thetoken boundaries using numerical indices and then converts a []byte intostring in one fell swoop. The Rust version fetches chars one by onefrom an iterator and writes them into a String.

In particular, since Rust's String is dynamically allocated, this may incurreallocations (depending on how much is allocated initially).

So my solution was to create these strings with_capacity - as you can seein the previous code sample. This cut down the execution time to be roughlyequal to Go's version where strings are copied.

Which API is better?

IMHO there's little doubt that the original "slice" API is better, for multiplereasons:

Performance: the results speak for themselves - the slice API is zero-copyand incurs no additional heap allocations. Like in Go, it deals in sliceheaders, which are just pointers to parts of a string.
The ownership story is very clear, symmetrical and explicitly documented withlifetime annotations. The 'source lifetime controls everything: it's thelifetime of the &str passed into the lexer's constructor, and it's thelifetime of the tokens. The symmetry feels important - the code that createsa lexer controls the lifetime of the input, as well as the output.
In general, there's a known good practice in API design which is not to forceallocations on users, if possible. This is well articulated in this blogpost for Go,but it applies equally well in Rust. In our slice API, the user can alwayscall to_owned on the returned slice, if they so wish. But why do it forthem? Why should we assume the users want us to return them an owned string?Returning a slice provides more flexibility in using the API.

[1]For properly tokenizing comments; when the lexer sees /, it needsto peek forward to see if this is the beginning of a //, or elsethe division operator. This is a common use case in lexers, especiallywhen tokenizing multi-character operators (like >=).[2]"Destruction order" exposes my C++ background; in Rust this is typicallycalled Drop.

本文章由 flowerss 抓取自RSS，版权归源站点所有。

查看原文：Rewriting the lexer benchmark in Rust - Eli Bendersky's website

Rewriting the lexer benchmark in Rust - Eli Bendersky's website

Report Page