Why does a newline character count as two keys in a BTreeMap? - char

I'm creating a compression/decompression library in Rust using Huffman encoding. One of the first steps is creating a data structure that contains all unique characters and the number of occurrences. I'm starting with just a simple text file, and having issues related to newline 'characters'.
My first attempt at solving this problem was constructing a BTreeMap, essentially a key-value pair of unique characters and their occurrences, respectively. Unfortunately, a newline 'character' is \n, which I think is not being handled corrected due to being two characters. I then converted the BTreeMap into a Vec to order by value, but that didn't solve the newline issue.
Here's my initial attempt at my comp binary package. Calling the binary is done using cargo, and my sample file is reproduced at the end of this question:
cargo run <text-file-in> <compressed-output-file>
main.rs
extern crate comp;
use std::env;
use std::process;
use std::io::prelude::*;
use comp::Config;
fn main() {
// Collect command-line args into a vector of strings
let mut stderr = std::io::stderr();
let config = Config::new(env::args()).unwrap_or_else(|err| {
writeln!(&mut stderr, "Parsing error: {}", err).expect("Could not write to stderr");
process::exit(1)
});
println!("Filename In: {}", config.filename_in);
println!("Filename Out: {}", config.filename_out);
if let Err(e) = comp::run(config) {
writeln!(&mut stderr, "Application error: {}", e).expect("Could not write to stderr");
process::exit(1);
}
}
lib.rs
use std::collections::btree_map::BTreeMap;
use std::error::Error;
use std::fs::File;
use std::io::Read;
use std::iter::FromIterator;
pub struct Config {
pub filename_in: String,
pub filename_out: String
}
impl Config {
pub fn new(mut args: std::env::Args) -> Result<Config, &'static str> {
args.next();
let filename_in = match args.next() {
Some(arg) => arg,
None => return Err("Didn't get a filename_in string"),
};
let filename_out = match args.next() {
Some(arg) => arg,
None => return Err("Didn't get a filename_out string"),
};
Ok(Config {
filename_in: filename_in,
filename_out: filename_out,
})
}
}
pub fn run(config: Config) -> Result<(), Box<Error>> {
let mut f = File::open(config.filename_in)?;
let mut contents = String::new();
f.read_to_string(&mut contents)?;
for line in contents.lines() {
println!("{}", line);
}
// Put unique occurrences into a BTreeMap
let mut count = BTreeMap::new();
for c in contents.chars() {
*count.entry(c).or_insert(0) += 1;
}
// Put contents into a Vec to order by value
let mut v = Vec::from_iter(count);
v.sort_by(|&(_, a), &(_, b)| b.cmp(&a));
// Print key-value pair of input file
println!("Number of occurrences of each character");
for &(key, value) in v.iter() {
println!("{}: {}", key, value);
}
Ok(())
}
Sample text file, poem.txt:
I'm nobody! Who are you?
Are you nobody, too?
Then there's a pair of us — don't tell!
They'd banish us, you know.
How dreary to be somebody!
How public, like a frog
To tell your name the livelong day
To an admiring bog!
Usage:
$ cargo run poem.txt poem
Compiling comp v0.1.0 (file:///home/chris/Projects/learn_rust/comp-rs)
Finished dev [unoptimized + debuginfo] target(s) in 1.96 secs
Running `target/debug/comp poem.txt poem`
Filename In: poem.txt
Filename Out: poem
I'm nobody! Who are you?
Are you nobody, too?
Then there's a pair of us — don't tell!
They'd banish us, you know.
How dreary to be somebody!
How public, like a frog
To tell your name the livelong day
To an admiring bog!
Number of occurrences of each character
: 36
o: 24
e: 15
a: 10
n: 10
y: 10
< What's going on here?
: 9 < What's going on here?
r: 9
d: 8
l: 8
b: 7
i: 7
t: 7
u: 7
h: 6
s: 5
!: 4
': 4
T: 4
g: 4
m: 4
,: 3
w: 3
?: 2
H: 2
f: 2
k: 2
p: 2
.: 1
A: 1
I: 1
W: 1
c: 1
v: 1
—: 1

Unfortunately, a newline 'character' is \n, which I think is not being handled corrected due to being two characters.
No, it is not. A newline character (UTF-8 codepoint 0x0A) is a single character.
I think I need to newline character to be a key in my key-value pair, but it's currently two keys.
No, it is not. Such a thing cannot happen "accidentally" either. If we somehow had two keys, you would have to call insert twice; there's no built-in concept of a multi-key map.
All that's happening here is that a newline character is printed as... a newline!
y: 10
: 9
If you take the time to create a MCVE, you'd see this quickly:
fn main() {
let c = '\n';
println!(">{}<", c);
println!(">{:?}<", c);
}
>
<
>'\n'<

The newline character is actually an escape sequence character. This means that if you write it as \n in the code which shows up as two characters, it's actually a placeholder for a single character - a new line - and should be treated as 'one character' in the program during runtime.
The core issue you have here is that you're using println to print it out to the command line and actually printing an new line, as the \n is interpreted to mean "A new line". This is why, when you use println here, you get the behavior you see. This is typical of most languages.
While this adds a little additional bit of code, you may wish to do something like this instead to specially-handle new-line data being printed:
// Print key-value pair of input file
println!("Number of occurrences of each character");
for &(key, value) in v.iter() {
if key == '\n' {
println!("\\n": {}, value);
} else {
println!("{}: {}", key, value);
}
}
Consider as explained by Shepmaster though to create an MCVE to thoroughly test things, it helps rule out misinterpretation of what is actually happening behind the scenes.
(NOTE: I am not a Rust master; there is probably a better way to achieve the above, but this is the shortest solution I came up with in a short period of time)

Related

How to move a String out of a for-loop embedded within a Closure? Rust

New to Rust.
I am attempting to solve: https://leetcode.com/problems/longest-common-prefix/
My own solution is:
impl Solution {
pub fn longest_common_prefix(strs: Vec<String>) -> String {
let first = &strs[0];
let result = strs.iter() // [&string1, &string2, ..., &stringn] for &string in strs
.map(|&string| {
for (i, c) in first.chars().enumerate() {
println!("c is {}", &c);
let mut partial_res = String::new();
if c == string.chars().nth(i).unwrap() {
println!("c is {}", &c);
partial_res.push(c);
}
}
partial_res
}
)
.min_by_key(|string| string.len()).unwrap();
result
}
}
The idea is that for each string in strs, we first iter() them, and map a closure to each of the string.
The closure takes in a &string and compare all the characters of &string and first (which is the first element / string of strs).
Finally, to search for the shortest string in result Iterator and returns the shortest string.
I encounter this error:
Line 16, Char 29: cannot find value `partial_res` in this scope (solution.rs)
|
16 | ... partial_res
| ^^^^^^^^^^^ not found in this scope
For more information about this error, try `rustc --explain E0425`.
error: could not compile `prog` due to previous error
mv: cannot stat '/leetcode/rust_compile/target/release/prog': No such file or directory
Therefore, my question is: > How can I move a String out of a for-loop embedded within a Closure?
Note:
I understand the approach to solving the problem is not ideal, please neglect the algorithmic aspect of the approach here (O(N^2) instead of O(N) algorithm).

Ownership question (case with immutable and mutable borrow)

I have a newbie question about ownership, I'm trying to update (+= 1) on the last bytes and print out the UTF-8 characters.
But I have mutable borrow to the String s in order to change the last byte thus I can't print it (using immutable borrow).
What would be the Rustacean way to do so?
Note: I'm aware I'm not doing it properly, I'm at learning stage, thanks.
fn main() {
let s = vec![240, 159, 140, 145];
let mut s = unsafe {
String::from_utf8_unchecked(s)
};
unsafe {
let bytes = s.as_bytes_mut(); // mutable borrow occurs here
for _ in 0..7 {
println!("{}", s); // Crash here as immutable borrow occurs here
bytes[3] += 1;
}
}
println!("{}", s);
}
You can use std::str::from_utf8 to make a &str from bytes to print it as a string.

How do I accept a literal "*" as a command-line argument?

I am writing a very simple command line calculator in rust, getting a number ,an operator, then another number and do the calculation and print the result. To show what I am getting from command args, I have printed them in a loop before the main code. I works fine for plus, minus and division, but for multiplication I get unexpected result, as I print it, instead of a star (*) for multiplication, I get the files list on my current directory.
Here is my rust code, I will appreciate an explanation and if there is any workaround.
use std::env;
fn main(){
let args: Vec<String> = env::args().collect();
for arg in args.iter(){
println!("{}", arg);
}
let mut result = 0;
let opt = args[2].to_string();
let oper1 = args[1].parse::<i32>().unwrap();
let oper2 = args[3].parse::<i32>().unwrap();
match opt.as_ref(){
"+" => result = oper1 + oper2,
"-" => result = oper1 - oper2,
"*" => result = oper1 * oper2,
"/" => result = oper1 / oper2,
_ => println!("Error")
}
println!("{} {} {} = {}", oper1, opt, oper2, result);
}
The wildcard (*) is expanding out. The shell is going to send this out to the program before it even sees what you actually typed
You can read more about here.
To avoid this, you can just wrap it in quotes, like so:
./program 1 "*" 1

Referencing / dereferencing a vector element in a for loop

In the code below, I want to retain number_list, after iterating over it, since the .into_iter() that for uses by default will consume. Thus, I am assuming that n: &i32 and I can get the value of n by dereferencing.
fn main() {
let number_list = vec![24, 34, 100, 65];
let mut largest = number_list[0];
for n in &number_list {
if *n > largest {
largest = *n;
}
}
println!("{}", largest);
}
It was revealed to me that instead of this, we can use &n as a 'pattern':
fn main() {
let number_list = vec![24, 34, 100, 65];
let mut largest = number_list[0];
for &n in &number_list {
if n > largest {
largest = n;
}
}
println!("{}", largest);
number_list;
}
My confusion (and bear in mind I haven't covered patterns) is that I would expect that since n: &i32, then &n: &&i32 rather than it resolving to the value (if a double ref is even possible). Why does this happen, and does the meaning of & differ depending on context?
It can help to think of a reference as a kind of container. For comparison, consider Option, where we can "unwrap" the value using pattern-matching, for example in an if let statement:
let n = 100;
let opt = Some(n);
if let Some(p) = opt {
// do something with p
}
We call Some and None constructors for Option, because they each produce a value of type Option. In the same way, you can think of & as a constructor for a reference. And the syntax is symmetric:
let n = 100;
let reference = &n;
if let &p = reference {
// do something with p
}
You can use this feature in any place where you are binding a value to a variable, which happens all over the place. For example:
if let, as above
match expressions:
match opt {
Some(1) => { ... },
Some(p) => { ... },
None => { ... },
}
match reference {
&1 => { ... },
&p => { ... },
}
In function arguments:
fn foo(&p: &i32) { ... }
Loops:
for &p in iter_of_i32_refs {
...
}
And probably more.
Note that the last two won't work for Option because they would panic if a None was found instead of a Some, but that can't happen with references because they only have one constructor, &.
does the meaning of & differ depending on context?
Hopefully, if you can interpret & as a constructor instead of an operator, then you'll see that its meaning doesn't change. It's a pretty cool feature of Rust that you can use constructors on the right hand side of an expression for creating values and on the left hand side for taking them apart (destructuring).
As apart from other languages (C++), &n in this case isn't a reference, but pattern matching, which means that this is expecting a reference.
The opposite of this would be ref n which would give you &&i32 as a type.
This is also the case for closures, e.g.
(0..).filter(|&idx| idx < 10)...
Please note, that this will move the variable, e.g. you cannot do this with types, that don't implement the Copy trait.
My confusion (and bear in mind I haven't covered patterns) is that I would expect that since n: &i32, then &n: &&i32 rather than it resolving to the value (if a double ref is even possible). Why does this happen, and does the meaning of & differ depending on context?
When you do pattern matching (for example when you write for &n in &number_list), you're not saying that n is an &i32, instead you are saying that &n (the pattern) is an &i32 (the expression) from which the compiler infers that n is an i32.
Similar things happen for all kinds of pattern, for example when pattern-matching in if let Some (x) = Some (42) { /* … */ } we are saying that Some (x) is Some (42), therefore x is 42.

lifetime not long enough rust

I want to open a file, replace some characters, and make some splits. Then I want to return the list of strings. however I get error: broken does not live long enough. My code works when it is in main, so it is only an issue with lifetimes.
fn tokenize<'r>(fp: &'r str) -> Vec<&'r str> {
let data = match File::open(&Path::new(fp)).read_to_string(){
Ok(n) => n,
Err(e) => fail!("couldn't read file: {}", e.desc)
};
let broken = data.replace("'", " ' ").replace("\"", " \" ").replace(" ", " ");
let mut tokens = vec![];
for t in broken.as_slice().split_str(" ").filter(|&x| *x != "\n"){
tokens.push(t)
}
return tokens;
}
How can I make the value returned by this function live in the scope of the caller?
The problem is that your function signature says "the result has the same lifetime as the input fp", but that's simply not true. The result contains references to data, which is allocated inside your function; it has nothing to do with fp! As it stands, data will cease to exist at the end of your function.
Because you're effectively creating new values, you can't return references; you need to transfer ownership of that data out of the function. There are two ways I can think of to do this, off the top of my head:
Instead of returning Vec<&str>, return Vec<String>, where each token is a freshly-allocated string.
Return data inside a wrapper type which implements the splitting logic. Then, you can have fn get_tokens(&self) -> Vec<&str>; the lifetime of the slices can be tied to the lifetime of the object which contains data.

Resources