Alternative hash table equality testing for keys - parallel-processing

SBCL profiling shows one of my Common Lisp hash table functions is consuming a significant amount of time. The function compares two hash tables to determine if they have the same keys:
(defun same-keys (ht1 ht2)
"Returns t if two hash tables have the same keys."
(declare (hash-table ht1 ht2))
(when (= (hash-table-count ht1) (hash-table-count ht2))
(maphash (lambda (ht1-key ht1-value)
(declare (ignore ht1-value))
(unless (gethash ht1-key ht2)
(return-from same-keys nil)))
ht1)
t))
Is there a way to speed this up given the hash tables are always #'eql with fixnum keys? I'm also loading the lparallel library, but would it make any sense to somehow parallelize the function in this case?
Edit: The size of the hash tables can range from about 10 to 100 entries. The ht key range extends from 100 up to 999,999,999,999, but the total possible fixnums actually used in this range is sparse. Each ht value is either t or a list. The key-value associations for all hash tables are set at load time. New hash tables are created at run time by copying existing ones and adding or removing entries incrementally. Routine hash table reading, writing, and copying do not seem to be a problem.

Apart from low-level optimizations, it depends on the size of the hash-tables and the possible range of values of the keys.
If the key range is not much smaller than the size, you may be faster with vectors instead of hash-tables. If size is small (less than about 20—50), but range large (e. g. UUIDs), maybe alists are better suited.
If writing to these hash-tables is not the bottleneck, you could wrap your hash-tables with objects holding also some helper data-structure for the key comparison. This might be some bit-vector marking the used keys, or a complete custom hash of all used keys, or (if size and range are really big) something like a bloom filter.
Parallelizing might make sense if your problem is big enough in some dimension to make it worth the overhead: for example, either the frequency of independent comparisons is very high, or the number of keys per hash-table very big.
One possible low-level optimization is to use loop instead of maphash, which most of the time can be compiled to much faster code:
(loop :for key1 :being :the :hash-keys :of ht1
:always (nth-value 1 (gethash key1 ht2)))

Related

When and why to use hash tables in CL instead of a-lists?

I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
The a-list being the most important one to me. I use it all the time.
When and why do you (or should you) use hash tables?
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
Maybe I am missing something in my inexperience?
The hash table is very useful when you have to access a large set of values through a key, since the complexity of this operation with a hash table is O(1), while the complexity of the operation using an a-list is O(n), where n is the length of the list.
So, I use it when I need to access multiple times a set of values which has more then few elements.
There are lot of assumptions to address in your question:
I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
I don't think this is particularly true, the standard libraries of popular languages are filled with lot of data structures too (C++, Java, Rust, Python)
When and why do you (or should you) use hash tables?
Data-structures come with costs in terms of memory and processor usage: a list must be searched linearly to find an element, whereas an hash-table has a constant lookup cost: for small lists however the linear search might be faster than the constant lookup. Moreover, there are other criteria like: do I want to access the data concurrently? a List can be manipulated in a purely functional way, making data-sharing across threads easier than with a hash-table (but hash-table can be associated with a mutex, etc.)
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
The source code of Lisp programs is made mostly of Lists and symbols, even if there is no such restriction. But at runtime, CL has a lot of different types that are not at all related to lists: bignums, floating points, rational numbers, complex numbers, vectors, arrays, packages, symbols, strings, classes and structures, hash-tables, readtables, functions, etc. You can model a lot of data at runtime by putting them in lists, which is something that works well for a lot of cases, but they are by far not the only types available.
Just to emphasize a little bit, when you write:
(vector 0 1 2)
This might look like a list in your code, but at runtime the value really is a different kind of object, a vector. Do not be confused by how things are expressed in code and how they are represented during code execution.
If you don't use it already, I suggest installing and using the Alexandria Lisp libray (see https://alexandria.common-lisp.dev/). There are useful functions to convert from and to hash-tables from alists or plists.
More generally, I think it is important to architecture your libraries and programs in a way that hide implementation details: you define a function make-person and accessors person-age, person-name, etc. as well as other user-facing functions. And the actual implementation can use hash tables, lists, etc. but this is not really a concern that should be exposed, because exposing that is a risk: you won't be able to easily change your mind later if you find out that the performance is bad or if you want to add a cache, use a database, etc.
I find however that CL is good at making nice interfaces that do not come with too much accidental complexity.
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists.
They are definitely not lists, but indeed they are not visible either:
#<HASH-TABLE :TEST EQL :COUNT 1 {100F4BA883}>
this doesn't show what's inside the hash-table. During development it will require more steps to inspect what's inside (inspect, describe, alexandria:hash-table-alist, defining a non-portable print-object method…).
serapeum:dict
I like very much serapeum:dict, coupled with (serapeum:toggle-pretty-print-hash-table) (also the Cookbook).
CL-USER> (serapeum:dict :a 1 :b 2 :c 3)
;; => #<HASH-TABLE :TEST EQUAL :COUNT 3 {100F6012D3}>
CL-USER> (serapeum:toggle-pretty-print-hash-table)
;; print the above HT again:
CL-USER> **
(SERAPEUM:DICT
:A 1
:B 2
:C 3
)
Not only is it printed readably, but it allows to create the hash-table with initial elements at the same time (unlike make-hash-table) and you can read it back in. It's even easy to save such a structure on file.
Serapeum is a solid library.
Now, use hash-tables more easily.
When to use a hash-table: You need to do fast (approximately "constant time") look-ups of data.
When to use an a-list: You have a need to dynamically shadow data you pass on to functions.
If neither of these obviously apply, you have to make a choice. And then, possibly, benchmark your choice. And then evaluate if rewriting it using the other choice would be better. In some experimentation that someone else did, well over a decade ago, the trade-off between an a-list and a hash-map in most Common Lisp implementation is somewhere in the region of 5 to 20 keys.
However, if you have a need to "shadow" bindings, for functions you call, an a-list does provide that "for free", and a hash-map does not. So if that is something that your code does a lot of, an a-list MAY be the better choice.
* (defun lookup (alist key) (assoc key alist))
LOOKUP
* (lookup '((key1 . value1) (key2 . value2)) 'key1)
(KEY1 . VALUE1)
* (lookup '((key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE2)
* (lookup '((key2 . value3) (key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE3)

What is the performance cost of converting between seqs and vectors?

Many core Clojure functions return lazy sequences, even when vectors are passed into them. For example, if I had a vector of numbers, and wanted to filter them based on some predicate but get another vector back, I'd have to do something like this:
(into [] (filter my-pred my-vec))
Or:
(vec (filter my-pred my-vec))
Though I'm not sure if there's any meaningful difference between the two.
Is this operation expensive, or do you get it effectively for free, as when converting to/from a transient?
I understand that the seq is lazy so nothing will actually get calculated until you plop it into the output vector, but is there an overhead to converting from a seq and a concrete collection? Can it be characterized in terms of big-O, or does big-O not make sense here? What about the other way, when converting from a vector to a seq?
There's an FAQ in the Clojure site for good use cases for transducers, which could be handy for some complex transformations (more than just filtering, or when the predicate is fairly complex). Otherwise you can leverage on filterv, which is on the core library and you can assume it does any reasonable optimization for you.
TL;DR Don't worry about it
Longer version:
The main cost is memory allocation/GC. Usually this is trivial. If you have too much data to fit simultaneously in RAM, the lazy version can save you.
If you want to measure toy problems, you can experiment with the Criterium library. Try powers of 10 from 10^2 up to 10^9.
(crit/quick-bench (println :sum (reduce + 0 (into [] (range (Math/pow 10 N))))))
for N=2..9 with and without the (into [] ...) part.

Hash Function For Sequence of Unique Ids (UUID)

I am storing message sequences in the database each sequence can have up to N number of messages. I want to create a hash function which will represent the message sequence and enable to check faster if message sequence exists.
Each message has a case-sensitive alphanumeric universal unique id (UUID).
Consider following messages (M1, M2, M3) with ids-
M1 - a3RA0000000e0taBB
M2 - a3RA00033000e0taC
M3 - a3RA0787600e0taBB
Message sequences can be
Sequence-1 : (M1,M2,M3)
Sequence-2 : (M1,M3,M2)
Sequence-3 : (M2,M1,M3)
Sequence-4 : (M1,M2)
Sequence-5 : (M2,M3)
...etc...
Following is the database structure example for storing message sequence
Given the message sequence, we need to check whether that message sequence exists in the database. For example, check if message sequence M1 -> M2 -> M3 i.e. with UIDs (a3RA0000000e0taBB -> a3RA00033000e0taC -> a3RA0787600e0taBB) exists in the database.
Instead of scanning the rows in the table, I want to create a hash function which represents the message sequence with a hash value. Using the hash value lookup in the table supposedly faster.
My simple hash function is-
I am wondering what would be an optimal hash function for storing the message sequence hash for faster is exists check.
You don't need a full-blown cryptographic hash, just a fast one, so how about having a look at FastHash: https://github.com/ZilongTan/Coding/tree/master/fast-hash. If you believe 32 or 64 bit hashes are not enough (i.e. produce too many collisions) then you could use the longer MurmurHash: https://en.wikipedia.org/wiki/MurmurHash (actually, the author of FastHash recommends this approach)
There's a list of more algorithms on Wikipedia: https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions
In any case, hashes using bit operations (SHIFT, XOR ...) should be faster than the multiplication in your approach, even on modern machines.
How about using MD5 algorithm to generate the hash for a concatenated string of messageUIDs.
For instance- consider messages
M1 - a3RA0000000e0taBB
M2 - a3RA00033000e0taC
M3 - a3RA0787600e0taBB
For message sequence M1->M2->M3 string would be a3RA0000000e0taBB;a3RA00033000e0taC;a3RA0787600e0taBB which will have MD5 hash as 176B1CDE75EDFE1554888DAA863671C4.
According to this answer MD5 is robust against collisions. In the given scenario there is no need for security so MD5 may suffice.
Premature optimisation is the root of all evil. Start with the hashing function that is built into your language of choice, and then hash the lists (M1, M2), etc.. Then profile it and see if that's the bottleneck before you start using third-party hash libraries.
My guess is that database lookup will be slower than the hash computation, so it won't matter which hash you use.
In Python you can just call
hash([m1, m2, m3])
In Java call the hashCode method on your ArrayList.
Any regular string hash algorithm (say, your language of choice base library string hash) applied to the concatenation of messages UUIDs would suffice as long as you select all messages by that hash and check that they are indeed your messages in correct order. That may or may not be efficient depending on how many messages are in a sequence usually (also think about the worst case). There is no way to guarantee collision-free hash calculation in general so you should think what you are going to do in case of a collision.
Now, if you want to optimize this to make sure your hash is unique, it could be possible in some circumstances. You will know about collision once you try to insert the data, so you can do something about it (say, apply a salt or a dummy message to the sequence, or something like that to modify the hash and keep doing it until you get an empty spot), but it will require sufficiently large hashes and potentially other app-specific modifications.

why is hash output fixed in length?

Hash functions always produce a fixed length output regardless of the input (i.e. MD5 >> 128 bits, SHA-256 >> 256 bits), but why?
I know that it is how the designer designed them to be, but why they designed the output to have the same length?
So that it can be stored in a consistent fashion? easier to be compared? less complicated?
Because that is what the definition of a hash is. Refer to wikipedia
A hash function is any function that can be used to map digital data
of arbitrary size to digital data of fixed size.
If your question relates to why it is useful for a hash to be a fixed size there are multiple reasons (non-exhaustive list):
Hashes typically encode a larger (often arbitrary size) input into a smaller size, generally in a lossy way, i.e. unlike compression functions, you cannot reconstruct the input from the hash value by "reversing" the process.
Having a fixed size output is convenient, especially for hashes designed to be used as a lookup key.
You can predictably (pre)allocate storage for hash values and index them in a contiguous memory segment such as an array.
For hashes of "native word sizes", e.g. 16, 32 and 64 bit integer values, you can do very fast equality and ordering comparisons.
Any algorithm working with hash values can use a single set of fixed size operations for generating and handling them.
You can predictably combine hashes produced with different hash functions in e.g. a bloom filter.
You don't need to waste any space to encode how big the hash value is.
There do exist special hash functions, that are capable of producing an output hash of a specified fixed length, such as so-called sponge functions.
As you can see it is the standard.
Also what you want is specified in standard :
Some application may require a hash function with a message digest
length different than those provided by the hash functions in this
Standard. In such cases, a truncated message digest may be used,
whereby a hash function with a larger message digest length is applied
to the data to be hashed, and the resulting message digest is
truncated by selecting an appropriate number of the leftmost bits.
Often it's because you want to use the hash value, or some part of it, to quickly store and look up values in a fixed-size array. (This is how a non-resizable hashtable works, for example.)
And why use a fixed-size array instead of some other, growable data structure (like a linked list or binary tree)? Because accessing them tends to be both theoretically and practically fast: provided that the hash function is good and the fraction of occupied table entries isn't too high, you get O(1) lookups (vs. O(log n) lookups for tree-based data structures or O(n) for lists) on average. And these accesses are fast in practice: after calculating the hash, which usually takes linear time in the size of the key with a low hidden constant, there's often just a bit shift, a bit mask and one or two indirect memory accesses into a contiguous block of memory that (a) makes good use of cache and (b) pipelines well on modern CPUs because few pointer indirections are needed.

Looking for an one-way function with small input and long output

I'm looking for an algorithm, which is a one-way function, like Hash function. And the algorithm accept a small input(serveral bits, less than 512 bits), and map it to a long output(1K Byte or more). Do you know an algorithm or a function like this?
From the Shannon theorem you don't gain any security by having a cyphertext of a size bigger than your plain text, unless the key (or the procedure to create the cyphertext) is different for any input. Even in this case, you will need to assign only one key (or mechanism) for each input x otherwise you violate the definition of a function. So if you apply an encryption mechanism f: X (set of inputs) -> Y (set of outputs), then |Y| <= |X|.
All this to say that if your input is less than 512 bits, you gain nothing by producing a 1KB output. Now, I recommend you to use one of the functions listed on the one-way function wiki page
Keccak has variable length output, (although not evaluated for in SHA-3), it's "security claim is disentangled from the output length. There is a minimum output length..." and Skein hash function has a variable output of up to 16 exabytes
Whatever your reasons are, you can calculate hashes of the same small data using different algorithms, then concatenate those hashes. If the output is not large enough, calculate hashes of hashes and append them.
As pointed in other answers, this doesn't have much sense from security perspective.

Resources