Speed of comparing structs using derive(Eq) versus derive(Serialize) - performance

I was curious about how much faster it is to call Eq::eq to compare two large vectors, versus serializing both vectors to a string and comparing the output strings.
I've done a basic speed-test here: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=73fc1cc61873fa70566eb7b2bf0793f9
The results, in release-mode, for a vector of 100 thousand simple structs:
Comparing by Eq::eq(x, y): 20ms
Comparing by serde_json::to_value(x) == serde_json::to_value(y): 172ms
Comparing by serde_json::to_string(x) == serde_json::to_string(y): 34ms
These results surprise me a bit; I did not expect that serializing all the way to a string would be nearly as fast as the derived implementation of Eq::eq.
Is my speed-test flawed in some way? If not, what would explain why there is not a speed advantage to the Eq::eq(x, y) approach, given that it seems like it should have a lot less work to do?

It is vec.clone()
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=8f5badd5e2d1b7258eda6b384acd9205
The results after passing references:
Times. 1:1 2:127 3:39

Related

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

Fast check if element is in MATLAB matrix

I would like to verify whether an element is present in a MATLAB matrix.
At the beginning, I implemented as follows:
if ~isempty(find(matrix(:) == element))
which is obviously slow. Thus, I changed to:
if sum(matrix(:) == element) ~= 0
but this is again slow: I am calling a lot of times the function that contains this instruction, and I lose 14 seconds each time!
Is there a way of further optimize this instruction?
Thanks.
If you just need to know if a value exists in a matrix, using the second argument of find to specify that you just want one value will be slightly faster (25-50%) and even a bit faster than using sum, at least on my machine. An example:
matrix = randi(100,1e4,1e4);
element = 50;
~isempty(find(matrix(:)==element,1))
However, in recent versions of Matlab (I'm using R2014b), nnz is finally faster for this operation, so:
matrix = randi(100,1e4,1e4);
element = 50;
nnz(matrix==element)~=0
On my machine this is about 2.8 times faster than any other approach (including using any, strangely) for the example provided. To my mind, this solution also has the benefit of being the most readable.
In my opinion, there are several things you could try to improve performance:
following your initial idea, i would go for the function any to test is any of the equality tests had a success:
if any(matrix(:) == element)
I tested this on a 1000 by 1000 matrix and it is faster than the solutions you have tested.
I do not think that the unfolding matrix(:) is penalizing since it is equivalent to a reshape and Matlab does this in a smart way where it does not actually allocate and move memory since you are not modifying the temporary object matrix(:)
If your does not change between the calls to the function or changes rarely you could simply use another vector containing all the elements of your matrix, but sorted. This way you could use a more efficient search algorithm O(log(N)) test for the presence of your element.
I personally like the ismember function for this kind of problems. It might not be the fastest but for non critical parts of the code it greatly improves readability and code maintenance (and I prefer to spend one hour coding something that will take day to run than spending one day to code something that will run in one hour (this of course depends on how often you use this program, but it is something one should never forget)
If you can have a sorted copy of the elements of your matrix, you could consider using the undocumented Matlab function ismembc but remember that inputs must be sorted non-sparse non-NaN values.
If performance really is critical you might want to write your own mex file and for this task you could even include some simple parallelization using openmp.
Hope this helps,
Adrien.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Algorithms to represent a set of integers with only one integer

This may not be a programming question but it's a problem that arised recently at work. Some background: big C development with special interest in performance.
I've a set of integers and want to test the membership of another given integer. I would love to implement an algorithm that can check it with a minimal set of algebraic functions, using only a integer to represent the whole space of integers contained in the first set.
I've tried a composite Cantor pairing function for instance, but with a 30 element set it seems too complicated, and focusing in performance it makes no sense. I played with some operations, like XORing and negating, but it gives me low estimations on membership. Then I tried with successions of additions and finally got lost.
Any ideas?
For sets of unsigned long of size 30, the following is one fairly obvious way to do it:
store each set as a sorted array, 30 * sizeof(unsigned long) bytes per set.
to look up an integer, do a few steps of a binary search, followed by a linear search (profile in order to figure out how many steps of binary search is best - my wild guess is 2 steps, but you might find out different, and of course if you test bsearch and it's fast enough, you can just use it).
So the next question is why you want a big-maths solution, which will tell me what's wrong with this solution other than "it is insufficiently pleasing".
I suspect that any big-math solution will be slower than this. A single arithmetic operation on an N-digit number takes at least linear time in N. A single number to represent a set can't be very much smaller than the elements of the set laid end to end with a separator in between. So even a linear search in the set is about as fast as a single arithmetic operation on a big number. With the possible exception of a Goedel representation, which could do it in one division once you've found the nth prime number, any clever mathematical representation of sets is going to take multiple arithmetic operations to establish membership.
Note also that there are two different reasons you might care about the performance of "look up an integer in a set":
You are looking up lots of different integers in a single set, in which case you might be able to go faster by constructing a custom lookup function for that data. Of course in C that means you need either (a) a simple virtual machine to execute that "function", or (b) runtime code generation, or (c) to know the set at compile time. None of which is necessarily easy.
You are looking up the same integer in lots of different sets (to get a sequence of all the sets it belongs to), in which case you might benefit from a combined representation of all the sets you care about, rather than considering each set separately.
I suppose that very occasionally, you might be looking up lots of different integers, each in a different set, and so neither of the reasons applies. If this is one of them, you can ignore that stuff.
One good start is to try Bloom Filters.
Basically, it's a probabilistic data structure that gives you no false negative, but some false positive. So when an integer matches a bloom filter, you then have to check if it really matches the set, but it's a big speedup by reducing a lot the number of sets to check.
if i'd understood your correctly, python example:
>>> a=[1,2,3,4,5,6,7,8,9,0]
>>>
>>>
>>> len_a = len(a)
>>> b = [1]
>>> if len(set(a) - set(b)) < len_a:
... print 'this integer exists in set'
...
this integer exists in set
>>>
math base: http://en.wikipedia.org/wiki/Euler_diagram

Hashtables/Dictionaries that use floats/doubles

I read somewhere about other data structures similar to hashtables, dictionaries but instead of using ints, they were using floats/doubles, etc.
Anyone knows what they are?
If you mean using floats/doubles as keys in your hash, that's easy. For example, in .NET, it's just using Dictionary<double,MyValueType>.
If you're talking about having the hash be based off a double instead of an int....
Technically, you can have any element as your internal hash. Normally, this is done using an int or long, since these are fast, and the hashing algorithm is easy to compute.
However, the hash is really just a BitArray at heart, so anything would work. There really isn't much advantage to making this something other than an int or long, other than potentially allowing a larger set of hash values (ie: if you go to an 8 byte or larger type for your hash).
You mean as keys? That strikes me as tricky.
If you're using them as arbitrary keys, they're no better than integers.
If you expect to calculate a floating-point value and use it to look something up in a hash table, you're living very dangerously. Floating point numbers do not have infinite precision, and calculating the same thing in two slightly different ways can result in very tiny differences in the result. Hash keys rely on getting the exact same thing every time, so you'd have to be careful to round, and round in exactly the same way at all times. This is trickier than it sounds, by the way.
So, what would you do with floating-point hashes?
A hash algorithm is, in general terms, just a function that produces a smaller output from a larger input. Good hash functions have interesting properties like a large change in output for a small change in the input, and an assurance that they produce every possible output value for some input.
It's not hard to write a simple polynomial type hash function that outputs a floating-point value, rather than an integer value, but it's difficult to ensure that the resulting hash function has the desired properties without getting into the details of the particular floating-point representation used.
At least part of the reason that hash functions are nearly always implemented in integer arithmetic is because proving various properties about an integer calculation is easier than doing the same for a floating point calculation.
It's fairly easy to prove that some (sum of prime factors) modulo (another prime) must, necessarily, produce every possible output for some input. Doing the same for a calculation with a bunch of floating-point fractions would be a drag.
Add to that the relative difficulty of storing and transmitting floating-point values without corruption, and it's just not worth it.
Your question history shows that you use .Net, so I'll answer in that context.
If you want a Dictionary that is type aware, such that you can specify it should use floats or doubles for the keys or values, use System.Collections.Generic.Dictionary<T, U> http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
If you want a Dictionary that is type blind, such that you can use floats AND doubles for keys and values, use System.Collections.HashTable http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx

Resources