I am trying to write simple TCP/IP client in Rust and I need to print out the buffer I got from the server.
How do I convert a Vec<u8> (or a &[u8]) to a String?
To convert a slice of bytes to a string slice (assuming a UTF-8 encoding):
use std::str;
//
// pub fn from_utf8(v: &[u8]) -> Result<&str, Utf8Error>
//
// Assuming buf: &[u8]
//
fn main() {
let buf = &[0x41u8, 0x41u8, 0x42u8];
let s = match str::from_utf8(buf) {
Ok(v) => v,
Err(e) => panic!("Invalid UTF-8 sequence: {}", e),
};
println!("result: {}", s);
}
The conversion is in-place, and does not require an allocation. You can create a String from the string slice if necessary by calling .to_owned() on the string slice (other options are available).
If you are sure that the byte slice is valid UTF-8, and you don’t want to incur the overhead of the validity check, there is an unsafe version of this function, from_utf8_unchecked, which has the same behavior but skips the check.
If you need a String instead of a &str, you may also consider String::from_utf8 instead.
The library references for the conversion function:
std::str::from_utf8
std::str::from_utf8_unchecked
std::string::String::from_utf8
I prefer String::from_utf8_lossy:
fn main() {
let buf = &[0x41u8, 0x41u8, 0x42u8];
let s = String::from_utf8_lossy(buf);
println!("result: {}", s);
}
It turns invalid UTF-8 bytes into � and so no error handling is required. It's good for when you don't need that and I hardly need it. You actually get a String from this. It should make printing out what you're getting from the server a little easier.
Sometimes you may need to use the into_owned() method since it's clone on write.
If you actually have a vector of bytes (Vec<u8>) and want to convert to a String, the most efficient is to reuse the allocation with String::from_utf8:
fn main() {
let bytes = vec![0x41, 0x42, 0x43];
let s = String::from_utf8(bytes).expect("Found invalid UTF-8");
println!("{}", s);
}
In my case I just needed to turn the numbers into a string, not the numbers to letters according to some encoding, so I did
fn main() {
let bytes = vec![0x41, 0x42, 0x43];
let s = format!("{:?}", &bytes);
println!("{}", s);
}
To optimally convert a Vec<u8> possibly containing non-UTF-8 characters/byte sequences into a UTF-8 String without any unneeded allocations, you'll want to optimistically try calling String::from_utf8() then resort to String::from_utf8_lossy().
let buffer: Vec<u8> = ...;
let utf8_string = String::from_utf8(buffer)
.map_err(|non_utf8| String::from_utf8_lossy(non_utf8.as_bytes()).into_owned())
.unwrap();
The approach suggested in the other answers will result in two owned buffers in memory even in the happy case (with valid UTF-8 data in the vector): one with the original u8 bytes and the other in the form of a String owning its characters. This approach will instead attempt to consume the Vec<u8> and marshal it as a Unicode String directly and only failing that will it allocate room for a new string containing the lossily UTF-8 decoded output.
Related
In Rust, I often write code like the transform function below. I have some immutable input and return a newly allocated output:
// the transform function doesn't modify inp. instead, it returns something new.
fn transform(inp: String) -> String { // with reference: inp: &String
let buf: Vec<_> = inp.as_bytes().iter().map(|x| x + 1).collect();
String::from_utf8_lossy(&buf).to_string()
}
// using the transform function
fn main() {
let msg: String = String::from("HAL9000");
println!("{}", msg);
println!("{}", transform(msg)); // with reference: &msg
}
Does it hurt performance if I don't use a reference for the inp parameter to the transform function? Or does the compiler recognize that inp, because it's not mutable, can be reused/referenced? Put another way: Is it worth the extra effort to declare immutable input parameters as references, like in C++, to avoid copying huge data structures?
The crate in the title is byteorder.
Here is how we can read binary data from std::io::BufReader. BufReader implements the std::io::Read trait. There is an implementation of byteorder::ReadBytesExt for any type implementing Read. ReadBytesExt contains read_u16 and other methods that read binary data. This implementation:
fn read_u16<T: ByteOrder>(&mut self) -> Result<u16> {
let mut buf = [0; 2];
self.read_exact(&mut buf)?;
Ok(T::read_u16(&buf))
}
It passes a reference to buf to BufReader; I suppose it passes the address of buf in the stack. Hence the resulting u16 is transferred from the internal buffer of BufReader (memory) to buf above (memory), probably, using memcpy or something. Wouldn't it be more efficient if BufReader implemented ReadBytesExt by reading data from its internal buffer directly? Or the compiler optimizes buf away?
TL;DR: It's all up to the Optimization Gods, but it should be efficient.
The key optimization here is inlining, as usual, and the probabilities are on our side, but who knows...
As long as the call to read_exact is inlined, it should just work.
Firstly, it can be inlined. In Rust, "inner" calls are always statically dispatched -- there's no inheritance -- and therefore the type of the receiver (self) in self.read_exact is known at compile-time. As a result, the exact read_exact function being called is known at compile-time.
Of course, there's no telling whether it'll be inlined. The implementation is fairly short, so chances are good, but that's out of our hands.
Secondly, what happens if it's inlined? Magic!
You can see the implementation here:
fn read_exact(&mut self, buf: &mut [u8]) -> io::Result<()> {
if self.buffer().len() >= buf.len() {
buf.copy_from_slice(&self.buffer()[..buf.len()]);
self.consume(buf.len());
return Ok(());
}
crate::io::default_read_exact(self, buf)
}
Once inlined, we therefore have:
fn read_u16<T: ByteOrder>(&mut self) -> Result<u16> {
let mut buf = [0; 2];
// self.read_exact(&mut buf)?;
if self.buffer().len() >= buf.len() {
buf.copy_from_slice(&self.buffer()[..buf.len()]);
self.consume(buf.len());
Ok(())
} else {
crate::io::default_read_exact(self, buf)
}?;
Ok(T::read_u16(&buf))
}
Needless to say, all those buf.len() calls should be replaced by 2.
fn read_u16<T: ByteOrder>(&mut self) -> Result<u16> {
let mut buf = [0; 2];
// self.read_exact(&mut buf)?;
if self.buffer().len() >= 2 {
buf.copy_from_slice(&self.buffer()[..2]);
self.consume(2);
Ok(())
} else {
crate::io::default_read_exact(self, buf)
}?;
Ok(T::read_u16(&buf))
}
So we're left with copy_from_slice, a memcpy invoked with a constant size (2).
The trick is that memcpy is so special that it's a builtin in most compilers, and it certainly is in LLVM. And it's a builtin specifically so that in special cases -- such as a constant size being specified which happen to be a register size -- its codegen can be specialized to... a mov instruction in the case of x86/x64.
So, as long as read_exact is inlined, then buf should live in a register from beginning to end... in the happy case.
In the cold path, when default_read_exact is called, then the compiler will need to use the stack and pass a slice. That's fine. It should not happen often.
If you find yourself repeatedly doing sequences of u16 reads, however... you may find yourself better served by reading larger arrays, to avoid the repeated if self.buffer().len() >= 2 checks.
I'm looking to convert a slice of bytes []byte into an UTF-8 string.
I want to write a function like that :
func bytesToUTF8string(bytes []byte)(string){
// Take the slice of bytes and encode it to UTF-8 string
// return the UTF-8 string
}
What is the most efficient way to perform this
EDIT :
Specifically I want to convert the output of crypto.rsa.EncryptPKCS1v15 or the output of SignPKCS1v15 to an UTF-8 encoded string.
How can I do it ?
func bytesToUTF8string(bytes []byte) string {
return string(bytes)
}
It's such a common, simple operation that it's arguably not worth wrapping in a function. Unless, of course, you need to translate the from a different source encoding, then it's an entirely different issue, with which the golang.org/x/text/encoding package might help
I read this article a few days ago and I thought what is the best way to implement such a thing in Rust. The article suggests to use a buffer instead of printing the string after each iteration.
Is this correct to say String::with_capacity() (or Vec) is equal to malloc in C?
Example from the codes:
String::with_capacity(size * 4096)
equal to:
char *buf = malloc(size * 4096);
It is not "equal", Rust's String is a composite object; String::with_capacity creates a String which is not only a buffer; it is a wrapper around a Vec<u8>:
pub struct String {
vec: Vec<u8>,
}
And a Vec is not just a section in memory - it also contains a RawVec and its length:
pub struct Vec<T> {
buf: RawVec<T>,
len: usize,
}
And a RawVec is not a primitive either:
pub struct RawVec<T> {
ptr: Unique<T>,
cap: usize,
}
So when you call String::with_capacity:
pub fn with_capacity(capacity: usize) -> String {
String { vec: Vec::with_capacity(capacity) }
}
You are doing much more than just reserving a section of memory.
That isn't quite accurate. It'd make more sense to say String::with_capacity is similar to std::string::reserve. From the documentation:
Creates a new empty String with a particular capacity.
Strings have an internal buffer to hold their data. The capacity is
the length of that buffer, and can be queried with the capacity
method. This method creates an empty String, but one with an initial
buffer that can hold capacity bytes. This is useful when you may be
appending a bunch of data to the String, reducing the number of
reallocations it needs to do.
If the given capacity is 0, no allocation will occur, and this method
is identical to the new method.
Whether or not it uses something similar to malloc for managing the internal buffer is an implementation detail.
In response to your edit:
You are explicitly allocating memory, whereas in C++ a memory allocation for std::string::reserve only occurs if the argument passed to reserve is greater than the existing capacity. Note that Rust's String does have a reserve method, but C++'s string does not have a with_capacity equivalent .
Two things:
If you link to an allocator, well, just call malloc.
The hook into the default global allocator is still unstable, but if you're on nightly, you can call it directly.
On stable Rust today, the closest thing you can get is Vec if you want to use the global allocator, but it's not equivalent for reasons spelled out in other answers.
In Rust it's possible to get UTF-8 from bytes by doing this:
if let Ok(s) = str::from_utf8(some_u8_slice) {
println!("example {}", s);
}
This either works or it doesn't, but Python has the ability to handle errors, e.g.:
s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');
In this example the argument surrogateescape converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8. see: Python docs for details.
Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?
Yes, via String::from_utf8_lossy:
fn main() {
let text = [104, 101, 0xFF, 108, 111];
let s = String::from_utf8_lossy(&text);
println!("{}", s); // he�lo
}
If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.
A quickly hacked-up example:
use std::str;
fn example(mut bytes: &[u8]) -> String {
let mut output = String::new();
loop {
match str::from_utf8(bytes) {
Ok(s) => {
// The entire rest of the string was valid UTF-8, we are done
output.push_str(s);
return output;
}
Err(e) => {
let (good, bad) = bytes.split_at(e.valid_up_to());
if !good.is_empty() {
let s = unsafe {
// This is safe because we have already validated this
// UTF-8 data via the call to `str::from_utf8`; there's
// no need to check it a second time
str::from_utf8_unchecked(good)
};
output.push_str(s);
}
if bad.is_empty() {
// No more data left
return output;
}
// Do whatever type of recovery you need to here
output.push_str("<badbyte>");
// Skip the bad byte and try again
bytes = &bad[1..];
}
}
}
}
fn main() {
let r = example(&[104, 101, 0xFF, 108, 111]);
println!("{}", r); // he<badbyte>lo
}
You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:
fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
// ...
handler(&mut output, bad);
// ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
use std::fmt::Write;
write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo
See also:
How do I convert a Vector of bytes (u8) to a string
How to print a u8 slice as text if I don't care about the particular encoding?.
You can either:
Construct it yourself by using the strict UTF-8 decoding which returns an error indicating the position where the decoding failed, which you can then escape. But that's inefficient since you will decode each failed attempt twice.
Try 3rd party crates which provide more customizable charset decoders.