I have a std::collections::HashSet, and I want to sample and remove a uniformly random element.
Currently, what I'm doing is randomly sampling an index using rand.gen_range, then iterating over the HashSet to that index to get the element. Then I remove the selected element. This works, but it's not efficient. Is there an efficient way to do randomly sample an element?
Here's a stripped down version of what my code looks like:
use std::collections::HashSet;
extern crate rand;
use rand::thread_rng;
use rand::Rng;
let mut hash_set = HashSet::new();
// ... Fill up hash_set ...
let index = thread_rng().gen_range(0, hash_set.len());
let element = hash_set.iter().nth(index).unwrap().clone();
hash_set.remove(&element);
// ... Use element ...
The only data structures allowing uniform sampling in constant time are data structures with constant time index access. HashSet does not provide indexing, so you can’t generate random samples in constant time.
I suggest to convert your hash set to a Vec first, and then sample from the vector. To remove an element, simply move the last element in its place – the order of the elements in the vector is immaterial anyway.
If you want to consume all elements from the set in random order, you can also shuffle the vector once and then iterate over it.
Here is an example implementation for removing a random element from a Vec in constant time:
use rand::{thread_rng, Rng};
pub trait RemoveRandom {
type Item;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item>;
}
impl<T> RemoveRandom for Vec<T> {
type Item = T;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item> {
if self.len() == 0 {
None
} else {
let index = rng.gen_range(0..self.len());
Some(self.swap_remove(index))
}
}
}
(Playground)
Thinking about Sven Marnach's answer, I want to use a vector, but I also need constant time insertion without duplication. Then I realized that I can maintain both a vector and a set, and ensure that they both had the same elements at all times. This will allow both constant time insertion with deduplication and constant time random removal.
Here's the implementation I ended up with:
struct VecSet<T> {
set: HashSet<T>,
vec: Vec<T>,
}
impl<T> VecSet<T>
where
T: Clone + Eq + std::hash::Hash,
{
fn new() -> Self {
Self {
set: HashSet::new(),
vec: Vec::new(),
}
}
fn insert(&mut self, elem: T) {
assert_eq!(self.set.len(), self.vec.len());
let was_new = self.set.insert(elem.clone());
if was_new {
self.vec.push(elem);
}
}
fn remove_random(&mut self) -> T {
assert_eq!(self.set.len(), self.vec.len());
let index = thread_rng().gen_range(0, self.vec.len());
let elem = self.vec.swap_remove(index);
let was_present = self.set.remove(&elem);
assert!(was_present);
elem
}
fn is_empty(&self) -> bool {
assert_eq!(self.set.len(), self.vec.len());
self.vec.is_empty()
}
}
Sven's answer suggests converting the HashSet to a Vec, in order to randomly sample from the Vec in O(1) time. This conversion takes O(n) time and is suitable if the conversion needs to be done only sparingly; e.g., for taking a series of random samples from an otherwise unchanging hashset. It is less suitable if conversions need to be done often, e.g., if, between taking random samples, one wants to intersperse some O(1) removals-by-value from the HashSet, since that would involve converting back and forth between HashSet and Vec, with each conversion taking O(n) time.
isaacg's solution is to keep both a HashSet and a Vec and operate on them in tandem. This allows O(1) lookup by index, O(1) random removal, and O(1) insertion, but not O(1) lookup by value or O(1) removal by value (because the Vec can't do those).
Below, I give a data structure that allows O(1) lookup by index or by value, O(1) insertion, and O(1) removal by index or value:
It is a HashMap<T, usize> together with a Vec<T>, such that the Vec maps indexes (which are usizes) to Ts, while the HashMap maps Ts to usizes. The HashMap and Vec can be thought of as inverse functions of one another, so that you can go from an index to its value, and from a value back to its index. The insertion and deletion operations are defined so that the indexes are precisely the integers from 0 to size()-1, with no gaps allowed. I call this data structure a BijectiveFiniteSequence. (Note the take_random_val method; it works in O(1) time.)
use std::collections::HashMap;
use rand::{thread_rng, Rng};
#[derive(Clone, Debug)]
struct BijectiveFiniteSequence<T: Eq + Copy + Hash> {
idx_to_val: Vec<T>,
val_to_idx: HashMap<T, usize>,
}
impl<T: Eq + Copy + Hash> BijectiveFiniteSequence<T> {
fn new () -> BijectiveFiniteSequence<T> {
BijectiveFiniteSequence {
idx_to_val: Vec::new(),
val_to_idx: HashMap::new()
}
}
fn insert(&mut self, val: T) {
self.idx_to_val.push(val);
self.val_to_idx.insert(val, self.len()-1);
}
fn take_random_val(&mut self) -> Option<T> {
let mut rng = thread_rng();
let rand_idx: usize = rng.gen_range(0..self.len());
self.remove_by_idx(rand_idx)
}
fn remove_by_idx(&mut self, idx: usize) -> Option<T> {
match idx < self.len() {
true => {
let val = self.idx_to_val[idx];
let last_idx = self.len() - 1;
self.idx_to_val.swap(idx, last_idx);
self.idx_to_val.pop();
// update hashmap entry after the swap above
self.val_to_idx.insert(self.idx_to_val[idx], idx);
self.val_to_idx.remove(&val);
Some(val)
},
false => None
}
}
fn remove_val(&mut self, val: T) -> Option<T> {
//nearly identical to the implementation of remove_by_idx,above
match self.contains(&val) {
true => {
let idx: usize = *self.val_to_idx.get(&val).unwrap();
let last_idx = self.len() - 1;
self.idx_to_val.swap(idx, last_idx);
self.idx_to_val.pop();
// update hashmap entry after the swap above
self.val_to_idx.insert(self.idx_to_val[idx], idx);
self.val_to_idx.remove(&val);
Some(val)
}
false => None
}
}
fn get_idx_of(&mut self, val: &T) -> Option<&usize> {
self.val_to_idx.get(val)
}
fn get_val_at(&mut self, idx: usize) -> Option<T> {
match idx < self.len() {
true => Some(self.idx_to_val[idx]),
false => None
}
}
fn contains(&self, val: &T) -> bool {
self.val_to_idx.contains_key(val)
}
fn len(&self) -> usize {
self.idx_to_val.len()
}
// etc. etc. etc.
}
According to the documentation for HashSet::iter it returns "An iterator visiting all elements in arbitrary order."
Arbitrary is perhaps not exactly uniform randomness, but if it's close enough for your use case, this is O(1) and will return different values each time:
// Build a set of integers 0 - 99
let mut set = HashSet::new();
for i in 0..100 {
set.insert(i);
}
// Sample
for _ in 0..10 {
let n = set.iter().next().unwrap().clone();
println!("{}", n);
set.remove(&n);
}
Like the author I wanted to remove the value after sampling from the HashSet. Sampling multiple times this way, without altering the HashSet, seems to yield the same result each time.
Related
I want generate random number within a certain range. How to do that in substrate?
fn draw_juror_for_citizen_profile_function(
citizen_id: u128,
length: usize,
) -> DispatchResult {
let nonce = Self::get_and_increment_nonce();
let random_seed = T::RandomnessSource::random(&nonce).encode();
let random_number = u64::decode(&mut random_seed.as_ref())
.expect("secure hashes should always be bigger than u32; qed");
Ok(())
}
I can't use rand package because it doesn't support no_std.
rng.gen_range(0..10));
I think you need to use the Randomness chain extension for this. See the Randomness docs.
This example shows how to call Randomness from a contract.
There is some discussion and another code exaxmple here.
EDIT: I'm not sure how random or appropriate this is but you could build on top of your random_seed snippet. In your example you say you need a random number between 0 and 10 so you could do:
fn max_index(array: &[u8]) -> usize {
let mut i = 0;
for (j, &value) in array.iter().enumerate() {
if value > array[i] {
i = j;
}
}
i
}
// generate your random seed
let arr1 = [0; 2];
let seed = self.env().random(&arr1).0;
// find the maximum index for the slice [0..10]
let rand_index = max_index(&seed.as_ref()[0..10]);
The returned number would be in the range 0-10. However, this is obviously limited by the fact you're starting with a [u8; 32]. For larger ranges maybe you simply concatenate u8 arrays.
Also note that this code simply takes the first max index if there are duplicates.
I am doing exercises from leetcode as a way to learn Rust. One exercise involves finding the longest substring without any character repetition inside a string.
My first idea involved storing substrings in a string and searching the string to see if the character was already in it:
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut unique_str = String::from("");
let mut schars: Vec<char> = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
unique_str = schars[x].to_string();
for y in x+1..schars.len()
{
if is_new_char(&unique_str, schars[y])
{
unique_str.push(schars[y]);
} else {
break;
}
}
let cur_len = unique_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
}
longest
}
}
fn is_new_char ( unique_str: &str, c: char ) -> bool {
if unique_str.find(c) == None
{
true
} else {
false
}
}
It works fine but the performance was on the low side. Hoping to shave a few ms on the "find" operation, I replaced unique_str with a HashMap:
use std::collections::HashMap;
impl Solution {
pub fn length_of_longest_substring(s: String) -> i32 {
let mut hash_str = HashMap::new();
let mut schars: Vec<char> = s.chars().collect();
let mut longest = 0 as i32;
for x in 0..schars.len()
{
hash_str.insert(schars[x], x);
for y in x+1..schars.len()
{
if hash_str.contains_key(&schars[y]){
break;
} else {
hash_str.insert(schars[y], y);
}
}
let cur_len = hash_str.len() as i32;
if cur_len > longest {
longest = cur_len;
}
hash_str.clear();
}
longest
}
}
Surprisingly, the String.find() version is 3 times faster than the HashMap in the benchmarks, in spite of the fact that I am using the same algorithm (or at least I think so). Intuitively, I would have assumed that doing the lookups in a hashmap should be considerably faster than searching the string's characters, but it turned out to be the opposite.
Can someone explain why the HashMap is so much slower? (or point out what I am doing wrong).
When it comes to performance, one test is always better then 10 reasons.
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let mut hasher = std::collections::hash_map::DefaultHasher::new();
let s = "a";
let string = "ab";
for i in 0..100000000 {
s.hash(&mut hasher);
let hash = hasher.finish();
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
I use debug build so that compiler would not optimize out most of my code.
On my machine taking 100M hashes above takes 14s. If I replace DefaultHasher with SipHasher as some comments suggested, it takes 17s.
Now, variant with string:
use std::hash::{Hash, Hasher};
fn main() {
let start = std::time::SystemTime::now();
let string = "abcde";
for i in 0..100000000 {
for c in string.chars() {
// do nothing
}
}
eprintln!("{}", start.elapsed().unwrap().as_millis());
}
Executing this code with 5 chars in the string takes 24s. If there are 2 chars, it takes 12s.
Now, how does it answer your question?..
To insert a value into a hashset, a hash must be calculated. Then every time you want to check if a character is in the hashset, you need to calculate a hash again. Also there is some small overhead for checking if the value is in the hashset over just calculating the hash.
As we can see from the tests, calculating one hash of a single character string takes around the same time as iterating over 3 symbol string. So let's say you have a unique_str with value abcde, and you check if there is a x character in it. Just checking it would be faster with HashSet, but then you also need to add x into the set, which makes it taking 2 hashes against iterating 5-symbol string.
So as long as on average your unique_str is shorter than 5 symbols, string realization is guaranteed to be faster. And in case of an input string like aaaaaaaaa...., it will be ~6 times faster, then the HashSet option.
Of course this analisys is very simplistic and there can be many other factors in play (like compiler optimizations and specific realization of Hash and Find for strings), but it gives the idea, why in some cases HashSet can be slower then string.find().
Side note: in your code you use HashMap instead of HashSet, which adds even more overhead and is not needed in your case...
How can I check if all elements of vector_a also appear in the same order as vector_b?
vector_b could be very long, there is no assumption that it is sorted, but it does not have duplicate elements.
I could not find a method implemented for Vec or in itertools, so I tried implementing by doing:
Create a hashmap from vector_b mapping value -> index
Iterate over vector_b and check that:
Element exists in hashmap
Index is strictly greater than previous element's index
I am not really happy with this as it is not space efficient due to the creation of the hashmap.
Search for each element of the needle in the haystack in order. Each time you find a matching element, only continue the search in the remaining portion of the haystack. You can express this nicely by taking a new subslice of of the haystack each time you match an element.
fn is_subsequence<T: PartialEq>(needle: &[T], mut haystack: &[T]) -> bool {
for search in needle {
if let Some(index) = haystack.iter().position(|el| search == el) {
haystack = &haystack[index + 1..];
} else {
return false;
}
}
true
}
assert!(is_subsequence(b"", b"0123456789"));
assert!(is_subsequence(b"0", b"0123456789"));
assert!(is_subsequence(b"059", b"0123456789"));
assert!(is_subsequence(b"345", b"0123456789"));
assert!(is_subsequence(b"0123456789", b"0123456789"));
assert!(!is_subsequence(b"335", b"0123456789"));
assert!(!is_subsequence(b"543", b"0123456789"));
A slice is just a pointer and a size, stored on the stack, so this does no new allocations. It runs in O(n) time and should be close to the fastest possible implementation - or at least in the same ballpark.
Easiest way to do it is to iterate the two vectors jointly:
fn contains<T: PartialEq>(needle: &[T], haystack: &[T]) -> bool {
let mut idx = 0;
for it in needle {
while (idx < haystack.len()) && (&haystack[idx] != it) {
idx += 1;
}
if idx == haystack.len() {
return false;
}
}
return true;
}
I'm buffering the last X lines of stdout, stderr & stdin of a process.
I'd like to keep the last X lines and be able to access a line by its id (line number).
So if we store 100 lines and insert 200 of them, you can access lines 100-200.
(In reality we want to store ~2000 lines.)
The performance case is insertion. So insertion itself should be fast. Retrieving will occasionally happen but is probably at 10% of the use case.
(We won't look into the output for most of the time.)
Old approach, fragmenting
I used a wrapping ArrayDeque and then kept book over the line-count, but this means we're using a [Vec<u8>;100] in the example above. An array of String thus an Array of Vec<u8>.
New approach, with open questions
My* new idea is to store data in one array of u8 and then keep book over start position and length of each entry in the array. The problem here is that we would need the book-keeping to be also some kind of ringbuffer and erase old entries the moment our array of data has to wrap. Maybe there are also better ways to implement this ? At least this takes full advantage of a ringbuffer and prevents memory fragmentation.
*thanks also to sebk from the rust community
Current easy approach
const MAX: usize = 5;
pub struct LineRingBuffer {
counter: Option<usize>,
data: ArrayDeque<[String; MAX], Wrapping>,
min_line: usize,
}
impl LineRingBuffer {
pub fn new() -> Self {
Self {
counter: None,
data: ArrayDeque::new(),
min_line: 0,
}
}
pub fn get<'a>(&'a self,pos: usize) -> Option<&String> {
if let Some(max) = self.counter {
if pos >= self.min_line && pos <= max {
return self.data.get(pos - self.min_line);
}
}
None
}
pub fn insert(&mut self, line: String) {
self.data.push_back(line);
if let Some(ref mut v) = self.counter {
*v += 1;
if *v - self.min_line >= MAX {
self.min_line += 1;
}
} else {
self.counter = Some(0);
}
}
}
Draft of the new idea questioned about:
pub struct SliceRingbuffer {
counter: Option<usize>,
min_line: usize,
data: Box<[u8;250_000]>,
index: ArrayDeque<Entry,Wrapping>,
}
struct Entry {
start: usize,
length: usize,
}
For whatever reason the current approach is still pretty fast, even though I expect a lot of allocations of different size (depending on the lines) and thus fragmentation.
Let's go back to basics.
A circular buffer typically guarantees no fragmentation because it is not typed by the content you store, but by size. You might define a 1MB circular buffer, for example. For fixed-length types, this gives you a fixed number of elements you can store.
You're evidently not doing this. By storing Vec<u8> as an element, even though the overarching array is fixed-length, the content is not. Each element stored is, in the array, a fat pointer pointing to the Vec (starting point and length).
Naturally, when you insert, you will therefore have to:
Create this Vec (this is the fragmentation you're thinking of, but not really seeing, as the rust allocator is pretty efficient at this kind of stuff)
Insert the vec where it should be, shifting everything sideways if you have to (the standard circular buffer techniques are at play here)
Your second option is an actual circular buffer. You gain in fixed size and zero allocs if you do it right, you lose out on the ability to store entire lines with 100% guarantee of having an entire line at the start of your buffer.
Before we head into the wide lands of DYI, a quick pointer to VecDeque is in order. This is a much more optimized version of what you implemented, albeit with some (fully warranted) unsafe sections.
Implementing our own circular buffer
We're going to make a bunch of assumptions and set some requirements for this:
We want to be able to store large strings
Our buffer stores bytes. The entire stack is therefore dealing with owned u8
We will make use of a simple Vec; in practice you would not reimplement this entire structure at all, the array is purely there for demonstration
The result of these choices is the following element structure:
| Element size | Data |
|--------------|----------|
| 4 bytes | N bytes |
We are therefore losing 4 bytes ahead of every message to be able to get a clear pointer/skip reference to the next element (of maximum size comparable to a u32).
A naive implementation example is as follows (playground link):
use byteorder::{NativeEndian, ReadBytesExt, WriteBytesExt};
pub struct CircularBuffer {
data: Vec<u8>,
tail: usize,
elements: usize,
}
impl CircularBuffer {
pub fn new(max: usize) -> Self {
CircularBuffer {
data: Vec::with_capacity(max),
elements: 0,
tail: 0,
}
}
/// Amount of elements in buffer
pub fn elements(&self) -> usize {
self.elements
}
/// Amount of used bytes in buffer, including metadata
pub fn len(&self) -> usize {
self.tail
}
/// Length of first element in ringbuffer
pub fn next_element_len(&self) -> Option<usize> {
self.data
.get(0..4)
.and_then(|mut v| v.read_u32::<NativeEndian>().ok().map(|r| r as usize))
}
/// Remove first element in ringbuffer (wrap)
pub fn pop(&mut self) -> Option<Vec<u8>> {
self.next_element_len().map(|chunk_size| {
self.tail -= chunk_size + 4;
self.elements -= 1;
self.data
.splice(..(chunk_size + 4), vec![])
.skip(4)
.collect()
})
}
pub fn get(&self, idx: usize) -> Option<&[u8]> {
if self.elements <= idx {
return None;
}
let mut current_head = 0;
let mut current_element = 0;
while current_head < self.len() - 4 {
// Get the length of the next block
let element_size = self
.data
.get(0..4)
.and_then(|mut v| v.read_u32::<NativeEndian>().ok().map(|r| r as usize))
.unwrap();
if current_element == idx {
return self
.data
.get((current_head + 4)..(current_head + element_size + 4));
}
current_element += 1;
current_head += 4 + element_size;
}
return None;
}
pub fn insert(&mut self, mut element: Vec<u8>) {
let e_len = element.len();
let capacity = self.data.capacity();
while self.len() + e_len + 4 > capacity {
self.pop();
}
self.data.write_u32::<NativeEndian>(e_len as u32).unwrap();
self.data.append(&mut element);
self.tail += 4 + e_len;
self.elements += 1;
println!("{:?}", self.data);
}
}
Do note again that this is a naive implementation aimed at showing you how you'd go around the problem of clipping strings in your buffer. The "real", optimal implementation would unsafe to shift and remove elements.
I have a Vec that is the allocation for a circular buffer. Let's assume the buffer is full, so there are no elements in the allocation that aren't in the circular buffer. I now want to turn that circular buffer into a Vec where the first element of the circular buffer is also the first element of the Vec. As an example I have this (allocating) function:
fn normalize(tail: usize, buf: Vec<usize>) -> Vec<usize> {
let n = buf.len();
buf[tail..n]
.iter()
.chain(buf[0..tail].iter())
.cloned()
.collect()
}
Playground
Obviously this can also be done without allocating anything, since we already have an allocation that is large enough, and we have a swap operation to swap arbitrary elements of the allocation.
fn normalize(tail: usize, mut buf: Vec<usize>) -> Vec<usize> {
for _ in 0..tail {
for i in 0..(buf.len() - 1) {
buf.swap(i, i + 1);
}
}
buf
}
Playground
Sadly this requires buf.len() * tail swap operations. I'm fairly sure it can be done in buf.len() + tail swap operations. For concrete values of tail and buf.len() I have been able to figure out solutions, but I'm not sure how to do it in the general case.
My recursive partial solution can be seen in action.
The simplest solution is to use 3 reversals, indeed this is what is recommended in Algorithm to rotate an array in linear time.
// rotate to the left by "k".
fn rotate<T>(array: &mut [T], k: usize) {
if array.is_empty() { return; }
let k = k % array.len();
array[..k].reverse();
array[k..].reverse();
array.reverse();
}
While this is linear, this requires reading and writing each element at most twice (reversing a range with an odd number of elements does not require touching the middle element). On the other hand, the very predictable access pattern of the reversal plays nice with prefetching, YMMV.
This operation is typically called a "rotation" of the vector, e.g. the C++ standard library has std::rotate to do this. There are known algorithms for doing the operation, although you may have to quite careful when porting if you're trying to it generically/with non-Copy types, where swaps become key, as one can't generally just read something straight out from a vector.
That said, one is likely to be able to use unsafe code with std::ptr::read/std::ptr::write for this, since data is just being moved around, and hence there's no need to execute caller-defined code or very complicated concerns about exception safety.
A port of the C code in the link above (by #ker):
fn rotate(k: usize, a: &mut [i32]) {
if k == 0 { return }
let mut c = 0;
let n = a.len();
let mut v = 0;
while c < n {
let mut t = v;
let mut tp = v + k;
let tmp = a[v];
c += 1;
while tp != v {
a[t] = a[tp];
t = tp;
tp += k;
if tp >= n { tp -= n; }
c += 1;
}
a[t] = tmp;
v += 1;
}
}