Hashmap slower than string.find? - performance

I am doing exercises from leetcode as a way to learn Rust. One exercise involves finding the longest substring without any character repetition inside a string.
My first idea involved storing substrings in a string and searching the string to see if the character was already in it:
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut unique_str = String::from("");
let mut schars: Vec<char> = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
unique_str = schars[x].to_string();
for y in x+1..schars.len()
{
if is_new_char(&unique_str, schars[y])
{
unique_str.push(schars[y]);
} else {
break;
}
}
let cur_len = unique_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
}
longest
}
}
fn is_new_char ( unique_str: &str, c: char ) -> bool {
if unique_str.find(c) == None
{
true
} else {
false
}
}
It works fine but the performance was on the low side. Hoping to shave a few ms on the "find" operation, I replaced unique_str with a HashMap:
use std::collections::HashMap;
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut hash_str = HashMap::new();
let mut schars: Vec<char> = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
hash_str.insert(schars[x], x);
for y in x+1..schars.len()
{
if hash_str.contains_key(&schars[y]){
break;
} else {
hash_str.insert(schars[y], y);
}
}
let cur_len = hash_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
hash_str.clear();
}
longest
}
}
Surprisingly, the String.find() version is 3 times faster than the HashMap in the benchmarks, in spite of the fact that I am using the same algorithm (or at least I think so). Intuitively, I would have assumed that doing the lookups in a hashmap should be considerably faster than searching the string's characters, but it turned out to be the opposite.
Can someone explain why the HashMap is so much slower? (or point out what I am doing wrong).

When it comes to performance, one test is always better then 10 reasons.
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let mut hasher = std::collections::hash_map::DefaultHasher::new();
let s = "a";
let string = "ab";
for i in 0..100000000 {
s.hash(&mut hasher);
let hash = hasher.finish();
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
I use debug build so that compiler would not optimize out most of my code.
On my machine taking 100M hashes above takes 14s. If I replace DefaultHasher with SipHasher as some comments suggested, it takes 17s.
Now, variant with string:
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let string = "abcde";
for i in 0..100000000 {
for c in string.chars() {
// do nothing
}
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
Executing this code with 5 chars in the string takes 24s. If there are 2 chars, it takes 12s.
Now, how does it answer your question?..
To insert a value into a hashset, a hash must be calculated. Then every time you want to check if a character is in the hashset, you need to calculate a hash again. Also there is some small overhead for checking if the value is in the hashset over just calculating the hash.
As we can see from the tests, calculating one hash of a single character string takes around the same time as iterating over 3 symbol string. So let's say you have a unique_str with value abcde, and you check if there is a x character in it. Just checking it would be faster with HashSet, but then you also need to add x into the set, which makes it taking 2 hashes against iterating 5-symbol string.
So as long as on average your unique_str is shorter than 5 symbols, string realization is guaranteed to be faster. And in case of an input string like aaaaaaaaa...., it will be ~6 times faster, then the HashSet option.
Of course this analisys is very simplistic and there can be many other factors in play (like compiler optimizations and specific realization of Hash and Find for strings), but it gives the idea, why in some cases HashSet can be slower then string.find().
Side note: in your code you use HashMap instead of HashSet, which adds even more overhead and is not needed in your case...

Related

Fastest way to instantiate a pair of integer values in Kotlin to use as a fraction

Motivation
I need to instance fractions in my code without the rounding error of floating point values. Therefore, I decided to use a pair of integer values, one for numerator and the other one for denominator.
Question
I don't know what to use: Pair<Int, Int, List<Int> or IntArray (array and list of size 2)? What instance would be the fastest to create and dispose?
Measurements
I wrote this code:
fun main() {
var b: Any
val elapsedPair = measureNanoTime {
for (i in 0..100000000) {
b = Pair(-2, 1)
}
}
println(elapsedPair)
val elapsedList = measureNanoTime {
for (i in 0..100000000) {
b = listOf(-2, 1)
}
}
println(elapsedList)
val elapsedArray = measureNanoTime {
for (i in 0..100000000) {
b = intArrayOf(-2, 1)
}
}
println(elapsedArray)
}
And got these results every time (not the exact numbers, but their order):
> 16338200
> 1340355300
> 6129200
It is clear that arrays are the fastest (because they are on stack) and lists are the slowest. But compiler could've used some optimizations for array, so these results aren't representative. Maybe there are some underlying optimizations for pair instantiation, that would make pair creation faster than array creation in most cases.
Use a data class:
data class Fraction(val numerator: Int, val denominator: Int)
It's really convenient to use:
val fraction = Fraction(2,7)
val (numerator, denominator) = fraction
And you can even add your own operators:
data class Fraction(val numerator: Int, val denominator: Int) {
operator fun divide(divisor: Int) = Fraction(numerator, denominator * divisor)
}
val fraction = Fraction(2,7)
val divided = fraction / 3
As for performance, it's not a problem until you've proven it's a problem. As always with performance problems, you need to measure and be sure what the real underlying issue is, before sacrificing code readability.

How to generate a random number within a range in substrate?

I want generate random number within a certain range. How to do that in substrate?
fn draw_juror_for_citizen_profile_function(
citizen_id: u128,
length: usize,
) -> DispatchResult {
let nonce = Self::get_and_increment_nonce();
let random_seed = T::RandomnessSource::random(&nonce).encode();
let random_number = u64::decode(&mut random_seed.as_ref())
.expect("secure hashes should always be bigger than u32; qed");
Ok(())
}
I can't use rand package because it doesn't support no_std.
rng.gen_range(0..10));
I think you need to use the Randomness chain extension for this. See the Randomness docs.
This example shows how to call Randomness from a contract.
There is some discussion and another code exaxmple here.
EDIT: I'm not sure how random or appropriate this is but you could build on top of your random_seed snippet. In your example you say you need a random number between 0 and 10 so you could do:
fn max_index(array: &[u8]) -> usize {
let mut i = 0;
for (j, &value) in array.iter().enumerate() {
if value > array[i] {
i = j;
}
}
i
}
// generate your random seed
let arr1 = [0; 2];
let seed = self.env().random(&arr1).0;
// find the maximum index for the slice [0..10]
let rand_index = max_index(&seed.as_ref()[0..10]);
The returned number would be in the range 0-10. However, this is obviously limited by the fact you're starting with a [u8; 32]. For larger ranges maybe you simply concatenate u8 arrays.
Also note that this code simply takes the first max index if there are duplicates.

Circular Buffer for Strings

I'm buffering the last X lines of stdout, stderr & stdin of a process.
I'd like to keep the last X lines and be able to access a line by its id (line number).
So if we store 100 lines and insert 200 of them, you can access lines 100-200.
(In reality we want to store ~2000 lines.)
The performance case is insertion. So insertion itself should be fast. Retrieving will occasionally happen but is probably at 10% of the use case.
(We won't look into the output for most of the time.)
Old approach, fragmenting
I used a wrapping ArrayDeque and then kept book over the line-count, but this means we're using a [Vec<u8>;100] in the example above. An array of String thus an Array of Vec<u8>.
New approach, with open questions
My* new idea is to store data in one array of u8 and then keep book over start position and length of each entry in the array. The problem here is that we would need the book-keeping to be also some kind of ringbuffer and erase old entries the moment our array of data has to wrap. Maybe there are also better ways to implement this ? At least this takes full advantage of a ringbuffer and prevents memory fragmentation.
*thanks also to sebk from the rust community
Current easy approach
const MAX: usize = 5;
pub struct LineRingBuffer {
counter: Option<usize>,
data: ArrayDeque<[String; MAX], Wrapping>,
min_line: usize,
}
impl LineRingBuffer {
pub fn new() -> Self {
Self {
counter: None,
data: ArrayDeque::new(),
min_line: 0,
}
}
pub fn get<'a>(&'a self,pos: usize) -> Option<&String> {
if let Some(max) = self.counter {
if pos >= self.min_line && pos <= max {
return self.data.get(pos - self.min_line);
}
}
None
}
pub fn insert(&mut self, line: String) {
self.data.push_back(line);
if let Some(ref mut v) = self.counter {
*v += 1;
if *v - self.min_line >= MAX {
self.min_line += 1;
}
} else {
self.counter = Some(0);
}
}
}
Draft of the new idea questioned about:
pub struct SliceRingbuffer {
counter: Option<usize>,
min_line: usize,
data: Box<[u8;250_000]>,
index: ArrayDeque<Entry,Wrapping>,
}
struct Entry {
start: usize,
length: usize,
}
For whatever reason the current approach is still pretty fast, even though I expect a lot of allocations of different size (depending on the lines) and thus fragmentation.
Let's go back to basics.
A circular buffer typically guarantees no fragmentation because it is not typed by the content you store, but by size. You might define a 1MB circular buffer, for example. For fixed-length types, this gives you a fixed number of elements you can store.
You're evidently not doing this. By storing Vec<u8> as an element, even though the overarching array is fixed-length, the content is not. Each element stored is, in the array, a fat pointer pointing to the Vec (starting point and length).
Naturally, when you insert, you will therefore have to:
Create this Vec (this is the fragmentation you're thinking of, but not really seeing, as the rust allocator is pretty efficient at this kind of stuff)
Insert the vec where it should be, shifting everything sideways if you have to (the standard circular buffer techniques are at play here)
Your second option is an actual circular buffer. You gain in fixed size and zero allocs if you do it right, you lose out on the ability to store entire lines with 100% guarantee of having an entire line at the start of your buffer.
Before we head into the wide lands of DYI, a quick pointer to VecDeque is in order. This is a much more optimized version of what you implemented, albeit with some (fully warranted) unsafe sections.
Implementing our own circular buffer
We're going to make a bunch of assumptions and set some requirements for this:
We want to be able to store large strings
Our buffer stores bytes. The entire stack is therefore dealing with owned u8
We will make use of a simple Vec; in practice you would not reimplement this entire structure at all, the array is purely there for demonstration
The result of these choices is the following element structure:
| Element size | Data |
|--------------|----------|
| 4 bytes | N bytes |
We are therefore losing 4 bytes ahead of every message to be able to get a clear pointer/skip reference to the next element (of maximum size comparable to a u32).
A naive implementation example is as follows (playground link):
use byteorder::{NativeEndian, ReadBytesExt, WriteBytesExt};
pub struct CircularBuffer {
data: Vec<u8>,
tail: usize,
elements: usize,
}
impl CircularBuffer {
pub fn new(max: usize) -> Self {
CircularBuffer {
data: Vec::with_capacity(max),
elements: 0,
tail: 0,
}
}
/// Amount of elements in buffer
pub fn elements(&self) -> usize {
self.elements
}
/// Amount of used bytes in buffer, including metadata
pub fn len(&self) -> usize {
self.tail
}
/// Length of first element in ringbuffer
pub fn next_element_len(&self) -> Option<usize> {
self.data
.get(0..4)
.and_then(|mut v| v.read_u32::<NativeEndian>().ok().map(|r| r as usize))
}
/// Remove first element in ringbuffer (wrap)
pub fn pop(&mut self) -> Option<Vec<u8>> {
self.next_element_len().map(|chunk_size| {
self.tail -= chunk_size + 4;
self.elements -= 1;
self.data
.splice(..(chunk_size + 4), vec![])
.skip(4)
.collect()
})
}
pub fn get(&self, idx: usize) -> Option<&[u8]> {
if self.elements <= idx {
return None;
}
let mut current_head = 0;
let mut current_element = 0;
while current_head < self.len() - 4 {
// Get the length of the next block
let element_size = self
.data
.get(0..4)
.and_then(|mut v| v.read_u32::<NativeEndian>().ok().map(|r| r as usize))
.unwrap();
if current_element == idx {
return self
.data
.get((current_head + 4)..(current_head + element_size + 4));
}
current_element += 1;
current_head += 4 + element_size;
}
return None;
}
pub fn insert(&mut self, mut element: Vec<u8>) {
let e_len = element.len();
let capacity = self.data.capacity();
while self.len() + e_len + 4 > capacity {
self.pop();
}
self.data.write_u32::<NativeEndian>(e_len as u32).unwrap();
self.data.append(&mut element);
self.tail += 4 + e_len;
self.elements += 1;
println!("{:?}", self.data);
}
}
Do note again that this is a naive implementation aimed at showing you how you'd go around the problem of clipping strings in your buffer. The "real", optimal implementation would unsafe to shift and remove elements.

Can I randomly sample from a HashSet efficiently?

I have a std::collections::HashSet, and I want to sample and remove a uniformly random element.
Currently, what I'm doing is randomly sampling an index using rand.gen_range, then iterating over the HashSet to that index to get the element. Then I remove the selected element. This works, but it's not efficient. Is there an efficient way to do randomly sample an element?
Here's a stripped down version of what my code looks like:
use std::collections::HashSet;
extern crate rand;
use rand::thread_rng;
use rand::Rng;
let mut hash_set = HashSet::new();
// ... Fill up hash_set ...
let index = thread_rng().gen_range(0, hash_set.len());
let element = hash_set.iter().nth(index).unwrap().clone();
hash_set.remove(&element);
// ... Use element ...
The only data structures allowing uniform sampling in constant time are data structures with constant time index access. HashSet does not provide indexing, so you can’t generate random samples in constant time.
I suggest to convert your hash set to a Vec first, and then sample from the vector. To remove an element, simply move the last element in its place – the order of the elements in the vector is immaterial anyway.
If you want to consume all elements from the set in random order, you can also shuffle the vector once and then iterate over it.
Here is an example implementation for removing a random element from a Vec in constant time:
use rand::{thread_rng, Rng};
pub trait RemoveRandom {
type Item;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item>;
}
impl<T> RemoveRandom for Vec<T> {
type Item = T;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item> {
if self.len() == 0 {
None
} else {
let index = rng.gen_range(0..self.len());
Some(self.swap_remove(index))
}
}
}
(Playground)
Thinking about Sven Marnach's answer, I want to use a vector, but I also need constant time insertion without duplication. Then I realized that I can maintain both a vector and a set, and ensure that they both had the same elements at all times. This will allow both constant time insertion with deduplication and constant time random removal.
Here's the implementation I ended up with:
struct VecSet<T> {
set: HashSet<T>,
vec: Vec<T>,
}
impl<T> VecSet<T>
where
T: Clone + Eq + std::hash::Hash,
{
fn new() -> Self {
Self {
set: HashSet::new(),
vec: Vec::new(),
}
}
fn insert(&mut self, elem: T) {
assert_eq!(self.set.len(), self.vec.len());
let was_new = self.set.insert(elem.clone());
if was_new {
self.vec.push(elem);
}
}
fn remove_random(&mut self) -> T {
assert_eq!(self.set.len(), self.vec.len());
let index = thread_rng().gen_range(0, self.vec.len());
let elem = self.vec.swap_remove(index);
let was_present = self.set.remove(&elem);
assert!(was_present);
elem
}
fn is_empty(&self) -> bool {
assert_eq!(self.set.len(), self.vec.len());
self.vec.is_empty()
}
}
Sven's answer suggests converting the HashSet to a Vec, in order to randomly sample from the Vec in O(1) time. This conversion takes O(n) time and is suitable if the conversion needs to be done only sparingly; e.g., for taking a series of random samples from an otherwise unchanging hashset. It is less suitable if conversions need to be done often, e.g., if, between taking random samples, one wants to intersperse some O(1) removals-by-value from the HashSet, since that would involve converting back and forth between HashSet and Vec, with each conversion taking O(n) time.
isaacg's solution is to keep both a HashSet and a Vec and operate on them in tandem. This allows O(1) lookup by index, O(1) random removal, and O(1) insertion, but not O(1) lookup by value or O(1) removal by value (because the Vec can't do those).
Below, I give a data structure that allows O(1) lookup by index or by value, O(1) insertion, and O(1) removal by index or value:
It is a HashMap<T, usize> together with a Vec<T>, such that the Vec maps indexes (which are usizes) to Ts, while the HashMap maps Ts to usizes. The HashMap and Vec can be thought of as inverse functions of one another, so that you can go from an index to its value, and from a value back to its index. The insertion and deletion operations are defined so that the indexes are precisely the integers from 0 to size()-1, with no gaps allowed. I call this data structure a BijectiveFiniteSequence. (Note the take_random_val method; it works in O(1) time.)
use std::collections::HashMap;
use rand::{thread_rng, Rng};
#[derive(Clone, Debug)]
struct BijectiveFiniteSequence<T: Eq + Copy + Hash> {
idx_to_val: Vec<T>,
val_to_idx: HashMap<T, usize>,
}
impl<T: Eq + Copy + Hash> BijectiveFiniteSequence<T> {
fn new () -> BijectiveFiniteSequence<T> {
BijectiveFiniteSequence {
idx_to_val: Vec::new(),
val_to_idx: HashMap::new()
}
}
fn insert(&mut self, val: T) {
self.idx_to_val.push(val);
self.val_to_idx.insert(val, self.len()-1);
}
fn take_random_val(&mut self) -> Option<T> {
let mut rng = thread_rng();
let rand_idx: usize = rng.gen_range(0..self.len());
self.remove_by_idx(rand_idx)
}
fn remove_by_idx(&mut self, idx: usize) -> Option<T> {
match idx < self.len() {
true => {
let val = self.idx_to_val[idx];
let last_idx = self.len() - 1;
self.idx_to_val.swap(idx, last_idx);
self.idx_to_val.pop();
// update hashmap entry after the swap above
self.val_to_idx.insert(self.idx_to_val[idx], idx);
self.val_to_idx.remove(&val);
Some(val)
},
false => None
}
}
fn remove_val(&mut self, val: T) -> Option<T> {
//nearly identical to the implementation of remove_by_idx,above
match self.contains(&val) {
true => {
let idx: usize = *self.val_to_idx.get(&val).unwrap();
let last_idx = self.len() - 1;
self.idx_to_val.swap(idx, last_idx);
self.idx_to_val.pop();
// update hashmap entry after the swap above
self.val_to_idx.insert(self.idx_to_val[idx], idx);
self.val_to_idx.remove(&val);
Some(val)
}
false => None
}
}
fn get_idx_of(&mut self, val: &T) -> Option<&usize> {
self.val_to_idx.get(val)
}
fn get_val_at(&mut self, idx: usize) -> Option<T> {
match idx < self.len() {
true => Some(self.idx_to_val[idx]),
false => None
}
}
fn contains(&self, val: &T) -> bool {
self.val_to_idx.contains_key(val)
}
fn len(&self) -> usize {
self.idx_to_val.len()
}
// etc. etc. etc.
}
According to the documentation for HashSet::iter it returns "An iterator visiting all elements in arbitrary order."
Arbitrary is perhaps not exactly uniform randomness, but if it's close enough for your use case, this is O(1) and will return different values each time:
// Build a set of integers 0 - 99
let mut set = HashSet::new();
for i in 0..100 {
set.insert(i);
}
// Sample
for _ in 0..10 {
let n = set.iter().next().unwrap().clone();
println!("{}", n);
set.remove(&n);
}
Like the author I wanted to remove the value after sampling from the HashSet. Sampling multiple times this way, without altering the HashSet, seems to yield the same result each time.

How (if possible) to sort a BTreeMap by value in Rust?

I am following a course on Software Security for which one of the assignments is to write some basic programs in Rust. For one of these assignments I need to analyze a text-file and generate several statistics. One of these is a generated list of the ten most used words in the text.
I have written this program that performs all tasks in the assignment except for the word frequency statistic mentioned above, the program compiles and executes the way I expect:
extern crate regex;
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
use std::io::BufReader;
use std::collections::BTreeMap;
use regex::Regex;
fn main() {
// Create a path to the desired file
let path = Path::new("text.txt");
let display = path.display();
let file = match File::open(&path) {
Err(why) => panic!("couldn't open {}: {}", display,
why.description()),
Ok(file) => file,
};
let mut wordcount = 0;
let mut averagesize = 0;
let mut wordsize = BTreeMap::new();
let mut words = BTreeMap::new();
for line in (BufReader::new(file)).lines() {
let re = Regex::new(r"([A-Za-z]+[-_]*[A-Za-z]+)+").unwrap();
for cap in re.captures_iter(&line.unwrap()) {
let word = cap.at(1).unwrap_or("");
let lower = word.to_lowercase();
let s = lower.len();
wordcount += 1;
averagesize += s;
*words.entry(lower).or_insert(0) += 1;
*wordsize.entry(s).or_insert(0) += 1;
}
}
averagesize = averagesize / wordcount;
println!("This file contains {} words with an average of {} letters per word.", wordcount, averagesize);
println!("\nThe number of times a word of a certain length was found.");
for (size, count) in wordsize.iter() {
println!("There are {} words of size {}.", count, size);
}
println!("\nThe ten most used words.");
let mut popwords = BTreeMap::new();
for (word, count) in words.iter() {
if !popwords.contains_key(count) {
popwords.insert(count, "");
}
let newstring = format!("{} {}", popwords.get(count), word);
let mut e = popwords.get_mut(count);
}
let mut i = 0;
for (count, words) in popwords.iter() {
i += 1;
if i > 10 {
break;
}
println!("{} times: {}", count, words);
}
}
I have a BTreeMap (that I chose with these instructions), words, that stores each word as key and its associated frequency in the text as value. This functionality works as I expect, but there I am stuck. I have been trying to find ways to sort the BTreemap by value or find another data structure in Rust that is natively sorted by value.
I am looking for the correct way to achieve this data structure (a list of words with their frequency, sorted by frequency) in Rust. Any pointers are greatly appreciated!
If you only need to analyze a static dataset, the easiest way is to just convert your BTreeMap into a Vec<T> in the end and sort the latter (Playground):
use std::iter::FromIterator;
let mut v = Vec::from_iter(map);
v.sort_by(|&(_, a), &(_, b)| b.cmp(&a));
The vector contains the (key, value) pairs as tuple. To sort the vector, we have to use sort_by() or sort_by_key(). To sort the vector in decreasing order, I used b.cmp(&a) (as opposed to a.cmp(&b), which would be the natural order). But there are other possibilities to reverse the order of a sort.
However, if you really need some data structure such that you have a streaming calculation, it's getting more complicated. There are many possibilities in that case, but I guess using some kind of priority queue could work out.

Resources