home

Rust Strings

27 Nov 21

If you've seen any Rust code, you've probably seen two string types: &str and String. These can be a little confusing to get used to, but they're actually simple.

You should think of the String like any other structure. All of the ownership rules we previously discussed apply as-is to the String type. The String type does not implement the Copy trait, meaning that assignments move the data to a new owner.

The &str type is a slice that references data owned by something else. This means that the &str type cannot outlive its owner.

To better understand strings, we need to look at some examples. However, there's one important detail we need to address: string literals are of type &str:

fn main() {
let power_level = "9000!!!";
println!("It's over {}", power_level);
}

In the above snippet power_level is a &str. This hopefully makes you ask: who owns the data that power_level references? For string literals, the data is baked into the executable's data section. We'll talk a little more about this later. For now, knowing that string literals are of type &str is enough to start understanding how the two types interact with each other and the ownership model.

Let's write code that keeps a count of the words inputted into our program. First, let's look at the skeleton:

fn main() {
loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();

let words: Vec<&str> = input
.split_whitespace()
.collect();
println!("{:?}", words);
}
}

Since input is a String, it owns what we typed, say "it's over 9000!!!" and words contains a list of slices referencing input. (We split_whitespace to create an iterator and use collect to automatically loop through the iterator and put the values into a list). This all works because our &str slices don't outlive the owner of the data they point to (input); it all falls out of scope, and thus gets freed, at the end of each loop iteration. To track the count of words across multiple inputs, you might try:

use std::collections::HashMap;

fn main() {
// word => count
let mut words: HashMap<&str, u32> = HashMap::new();

loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();

for word in input.split_whitespace() {
let count = match words.get(word) {
None => 0,
Some(count) => *count,
};
words.insert(word, count + 1);
}
println!("{:?}", words);
}
}

(If you're wondering why we need to dereference count, i.e. Some(count) => *count, it's because the get method of the HashMap returns a reference, i.e Option<&T>, which makes sense, as the HashMap still owns the value. In this case, we're ok with "moving" this out of the HashMap since u32 implements the Copy trait).

The above snippet will not compile. It'll complain that input is dropped while still borrowed. If you walk through the code, you should come to the same conclusion. We're trying to store word in our words HashMap which outlives the data being referenced by word (i.e. input).

To prove to ourselves that the issue is with input scope's, we can "solve" this by moving words inside the loop:

fn main() {
loop {
let mut input = String::new();
let mut words: HashMap<&str, u32> = HashMap::new();
...
}
}

Now everything lives in the same scope, our loop, so everything works. But this "fix" doesn't satisfy our desired behaviour: we're now only counting words per input not across multiple inputs.

The real fix is to store Strings inside of our words counter, not &str:

use std::collections::HashMap;

fn main() {
// Changed: HashMap<&str, u32> -> HashMap<String, u32>
let mut words: HashMap<String, u32> = HashMap::new();

loop {
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();

for word in input.split_whitespace() {
let count = match words.get(word) {
None => 0,
Some(count) => *count,
};
// Changed: word -> word.to_owned()
words.insert(word.to_owned(), count + 1);
}
println!("{:?}", words);
}
}

We changed two lines, highlighted by the two comments. Namely, instead of storing &str we store String, and to turn our word &str into a String we use the to_owned() method.

You'll probably find yourself using to_owned() frequently when dealing with strings. It's by far the simplest way to resolve any ownership and lifetime issues with string slices, but it's also, more often than not, semantically correct. In the above code, it's "right" that our words counter owns the Strings: the existence of the keys in our map should be tied to the map itself.

Performance / Allocations

A String represents allocated memory on the heap. When the owning variable falls out of scope, the memory is freed. A &str references all or part of the memory allocated by a String, or in the case of a string literals, it references a part our executable's data section. When we call to_owned() on a string slice (&str) a Stringis created by allocating memory on the heap.

This means that the above code allocates memory for each word that we type. A language with a garbage collector, such as Go, could implement the above more efficiently. But that efficiency would come with two significant costs: a garbage collector to track what data is and isn't being used, and a lack of transparency around memory allocation. Specifically, a slice in Go prevents the underlying memory from being garbage collected, which isn't always obvious and certainly isn't always efficient (you could pin gigabytes worth of data for a single small slice).

Rust is very flexible. We could write an implementation similar to Go's, but it would require considerably more code.

Greater transparency helps explain why string literals are represented as &str instead of String. Imagine a string literal in a loop:

fn main() {
loop {
let mut input = String::new();
println!("> ");
std::io::stdin().read_line(&mut input).unwrap();
...
}
}

Representing " >" as a String would require allocating it on the heap for each iteration. This might not be obvious, and it certainly isn't necessary. Treating string literals as a &str means that allocation only happens when we explicitly require it (via to_owned()).

Mutability

From an implementation and mutability point of view, the String type behaves like Java's StringBuffer, .NET's StringBuilder and Go's strings.Builder. The push() and push_str() methods are used to append values to the string. Like any other data, these mutations require the binding to be declared as mutable:

fn main() {
// note same as: String::from("hello")
let fail = "hello".to_owned();
fail.push_str(" world"); // Not mutable, won't compile

// note same as: String::from("hello")
let mut ok = "hello".to_owned();
ok.push_str(" world"); // Mutable, will work
}

A &mut str on the other hand, is something you'll rarely, if ever, use. It doesn't own the underlying data so it can't really change it.

Just like you'll commonly use to_owned() to ensure the ownership/lifetime of the value, you'll also commonly use to_owned() to mutate (often in the form of appending) the string. Fundamentally, both of these concepts are tied to the fact that String owns its data.

String -> &str

We saw how to_owned() (or the identical String::from) can be used to turn a &str into a String. To go the other way, we use the [start..end] slice syntax:

fn main() {
let hi = "Hello World".to_owned();
let slice = &hi[0..5];
println!("{}", slice);
}

Notice that we did &hi[0..5] and not hi[0..5]. This is because there is a str type, but it isn't particularly useful. Technically, str is the slice and &str is the slice with an added length value. But str is so infrequently used that people just refer to &str as a slice.

You'll often write or use functions which don't need ownership of the string. Logically these functions should accept a &str. For example, consider the following functions:

fn is_https?(url: &str) -> bool {
url.starts_with("https://")
}

Since it doesn't need ownership of the parameter, &str is the correct choice. We can, obviously, call this function with a &str either directly or by slicing a String. But since this is so common, the Rust compiler will also let us call this function with a &String:

// called with a &str
is_https("https://www.openmyind.net");

// called with a &String
let mut input = String::new();
std::io::stdin().read_line(&mut input).unwrap();
is_https(&input);

// exact same as previous line
is_https(&input[..])

Common String Tasks

Here are a few comon things you'll likely need to do with strings.

To create a new String from other String or &str (you can mix and match) use the format! macro:

let key = format!("user.{}", self.id);

To create a String from a [u8] use String::from_utf8. Note that this returns a Result as it will check to make sure the provided byte-slice is a valid UTF-8 string. A str is really just a [u8], so a &str is really a &[u8], both with the added restriction that the underlying slice must be a valid UTF-8 string. Similarly, a String is a Vec<u8>also with same same additional requirement.

Because String wraps a Vec<u8>, the String::len() method returns the number of bytes, not characters. The chars() method returns an iterator over the characters, so charr().count() will return the number of characters (and is O(N)). Note that chars() returns an iterator over Unicode Scalar Values, not graphemes. For graphemes, you'll want to use an exteranl crate.

There's a FromStr trait (think interface) which many types implement. This is used to parse a string into a given type. The implementation for bool is easy to understand:

fn from_str(s: &str) -> Result<bool, ParseBoolError> {
match s {
"true" => Ok(true),
"false" => Ok(false),
_ => Err(ParseBoolError),
}
}

To convert a string to a boolean, or a string to a integer, use:

// unwrapping will panic if the parsing fails!
let b: bool = "true".parse().unwrap();
let i: u32 = "9001".parse().unwrap();

Finally, as we already discussed, String has a push() function (for characters) and push_str() method (for strings) to append values onto a String. You'd also be right to expect other mutating methods such as trim(), remove(), replace() and many more.