When and why to use hash tables in CL instead of a-lists? - data-structures

I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
The a-list being the most important one to me. I use it all the time.
When and why do you (or should you) use hash tables?
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
Maybe I am missing something in my inexperience?

The hash table is very useful when you have to access a large set of values through a key, since the complexity of this operation with a hash table is O(1), while the complexity of the operation using an a-list is O(n), where n is the length of the list.
So, I use it when I need to access multiple times a set of values which has more then few elements.

There are lot of assumptions to address in your question:
I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
I don't think this is particularly true, the standard libraries of popular languages are filled with lot of data structures too (C++, Java, Rust, Python)
When and why do you (or should you) use hash tables?
Data-structures come with costs in terms of memory and processor usage: a list must be searched linearly to find an element, whereas an hash-table has a constant lookup cost: for small lists however the linear search might be faster than the constant lookup. Moreover, there are other criteria like: do I want to access the data concurrently? a List can be manipulated in a purely functional way, making data-sharing across threads easier than with a hash-table (but hash-table can be associated with a mutex, etc.)
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
The source code of Lisp programs is made mostly of Lists and symbols, even if there is no such restriction. But at runtime, CL has a lot of different types that are not at all related to lists: bignums, floating points, rational numbers, complex numbers, vectors, arrays, packages, symbols, strings, classes and structures, hash-tables, readtables, functions, etc. You can model a lot of data at runtime by putting them in lists, which is something that works well for a lot of cases, but they are by far not the only types available.
Just to emphasize a little bit, when you write:
(vector 0 1 2)
This might look like a list in your code, but at runtime the value really is a different kind of object, a vector. Do not be confused by how things are expressed in code and how they are represented during code execution.
If you don't use it already, I suggest installing and using the Alexandria Lisp libray (see https://alexandria.common-lisp.dev/). There are useful functions to convert from and to hash-tables from alists or plists.
More generally, I think it is important to architecture your libraries and programs in a way that hide implementation details: you define a function make-person and accessors person-age, person-name, etc. as well as other user-facing functions. And the actual implementation can use hash tables, lists, etc. but this is not really a concern that should be exposed, because exposing that is a risk: you won't be able to easily change your mind later if you find out that the performance is bad or if you want to add a cache, use a database, etc.
I find however that CL is good at making nice interfaces that do not come with too much accidental complexity.

My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists.
They are definitely not lists, but indeed they are not visible either:
#<HASH-TABLE :TEST EQL :COUNT 1 {100F4BA883}>
this doesn't show what's inside the hash-table. During development it will require more steps to inspect what's inside (inspect, describe, alexandria:hash-table-alist, defining a non-portable print-object method…).
serapeum:dict
I like very much serapeum:dict, coupled with (serapeum:toggle-pretty-print-hash-table) (also the Cookbook).
CL-USER> (serapeum:dict :a 1 :b 2 :c 3)
;; => #<HASH-TABLE :TEST EQUAL :COUNT 3 {100F6012D3}>
CL-USER> (serapeum:toggle-pretty-print-hash-table)
;; print the above HT again:
CL-USER> **
(SERAPEUM:DICT
:A 1
:B 2
:C 3
)
Not only is it printed readably, but it allows to create the hash-table with initial elements at the same time (unlike make-hash-table) and you can read it back in. It's even easy to save such a structure on file.
Serapeum is a solid library.
Now, use hash-tables more easily.

When to use a hash-table: You need to do fast (approximately "constant time") look-ups of data.
When to use an a-list: You have a need to dynamically shadow data you pass on to functions.
If neither of these obviously apply, you have to make a choice. And then, possibly, benchmark your choice. And then evaluate if rewriting it using the other choice would be better. In some experimentation that someone else did, well over a decade ago, the trade-off between an a-list and a hash-map in most Common Lisp implementation is somewhere in the region of 5 to 20 keys.
However, if you have a need to "shadow" bindings, for functions you call, an a-list does provide that "for free", and a hash-map does not. So if that is something that your code does a lot of, an a-list MAY be the better choice.
* (defun lookup (alist key) (assoc key alist))
LOOKUP
* (lookup '((key1 . value1) (key2 . value2)) 'key1)
(KEY1 . VALUE1)
* (lookup '((key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE2)
* (lookup '((key2 . value3) (key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE3)

Related

What is the performance cost of converting between seqs and vectors?

Many core Clojure functions return lazy sequences, even when vectors are passed into them. For example, if I had a vector of numbers, and wanted to filter them based on some predicate but get another vector back, I'd have to do something like this:
(into [] (filter my-pred my-vec))
Or:
(vec (filter my-pred my-vec))
Though I'm not sure if there's any meaningful difference between the two.
Is this operation expensive, or do you get it effectively for free, as when converting to/from a transient?
I understand that the seq is lazy so nothing will actually get calculated until you plop it into the output vector, but is there an overhead to converting from a seq and a concrete collection? Can it be characterized in terms of big-O, or does big-O not make sense here? What about the other way, when converting from a vector to a seq?
There's an FAQ in the Clojure site for good use cases for transducers, which could be handy for some complex transformations (more than just filtering, or when the predicate is fairly complex). Otherwise you can leverage on filterv, which is on the core library and you can assume it does any reasonable optimization for you.
TL;DR Don't worry about it
Longer version:
The main cost is memory allocation/GC. Usually this is trivial. If you have too much data to fit simultaneously in RAM, the lazy version can save you.
If you want to measure toy problems, you can experiment with the Criterium library. Try powers of 10 from 10^2 up to 10^9.
(crit/quick-bench (println :sum (reduce + 0 (into [] (range (Math/pow 10 N))))))
for N=2..9 with and without the (into [] ...) part.

Python dictionary or map in elisp

What is the equivalent of a python dictionary like {'a':1, 'b':2} in elisp?
And again, does elisp have any map-reduce api?
Besides association lists,(whose algorithmic complexity is OK for small tables but not for large ones), there are hash tables, you can construct with make-hash-table and puthash, or if you prefer immediate values, you can write them as #s(hash-table data a 1 b 2).
Association lists are the most commonly used associative containers in elisp. It is just a list of key-value cons cells like this ((key . value)). You can use the assoc function to get a value corresponding to a key and rassoc to get a key with the required value.
Elisp comes with the built-in function mapcar which does map, but AFAIK there is no good fold facility. You could emulate it using any of the looping facilities provided. However, the better solution is to use cl-lib and slip into CommonLisp land. In particular, it supplies cl-mapcar and cl-reduce.

Where is DropWhile in Mathematica?

Mathematica 6 added TakeWhile, which has the syntax:
TakeWhile[list, crit]
gives elements ei from the beginning of list, continuing so long as crit[ei] is True.
There is however no corresponding "DropWhile" function. One can construct DropWhile using LengthWhile and Drop, but it almost seems as though one is discouraged from using DropWhile. Why is this?
To clarify, I am not asking for a way to implement this function. Rather: why is it not already present? It seems to me that there must be a reason for its absence other than an oversight, or it would have been corrected by now. Is there something inefficient, undesirable, or superfluous about DropWhile?
There appears to be some ambiguity about the function of DropWhile, so here is an example:
DropWhile = Drop[#, LengthWhile[#, #2]] &;
DropWhile[{1,2,3,4,5}, # <= 3 &]
Out= {4, 5}
Just a blind guess.
There are a lot list operations that could take a while criteria. For example:
Total..While
Accumulate..While
Mean..While
Map..While
Etc..While
They are not difficult to construct, anyway.
I think those are not included just because the number of "primitive" functions is already growing too long, and the criteria of "is it frequently needed and difficult to implement with good performance by the user?" is prevailing in those cases.
The ubiquitous Lists in Mathematica are fixed length vectors, and when they are of a machine numbers it is a packed array.
Thus the natural functions for a recursively defined linked list (e.g. in Lisp or Haskell) are not the primary tools in Mathematica.
So I am inclined to think this explains why Wolfram did not fill out its repertoire of manipulation functions.

Reusing memory of immutable state in eager evaluation?

I'm studying purely functional language and currently thinking about some immutable data implementation.
Here is a pseudo code.
List a = [1 .. 10000]
List b = NewListWithoutLastElement a
b
When evaluating b, b must be copied in eager/strict implementation of immutable data.
But in this case, a is not used anymore in any place, so memory of 'a' can be re-used safely to avoid copying cost.
Furthermore, programmer can force compiler always do this by marking the type List with some keyword meaning must-be-disposed-after-using. Which makes compile time error on logic cannot avoid copying cost.
This can gain huge performance. Because it can be applied to huge object graph too.
How do you think? Any implementations?
This would be possible, but severely limited in scope. Keep in mind that the vast majority of complex values in a functional program will be passed to many functions to extract various properties from them - and, most of the time, those functions are themselves arguments to other functions, which means you cannot make any assumptions about them.
For example:
let map2 f g x = f x, g x
let apply f =
let a = [1 .. 10000]
f a
// in another file :
apply (map2 NewListWithoutLastElement NewListWithoutFirstElement)
This is fairly standard in functional code, and there is no way to place a must-be-disposed-after-using attribute on a because no specific location has enough knowledge about the rest of the program. Of course, you could try adding that information to the type system, but type inference on this is decidedly non-trivial (not to mention that types would grow quite large).
Things get even worse when you have compound objects, such as trees, that might share sub-elements between values. Consider this:
let a = binary_tree [ 1; 2; 5; 7; 9 ]
let result_1 = complex_computation_1 (insert a 6)
let result_2 = complex_computation_2 (remove a 5)
In order to allow memory reuse within complex_computation_2, you would need to prove that complex_computation_1 does not alter a, does not store any part of a within result_1 and is done using a by the time complex_computation_2 starts working. While the two first requirements might seem the hardest, keep in mind that this is a pure functional language: the third requirement actually causes a massive performance drop because complex_computation_1 and complex_computation_2 cannot be run on different threads anymore!
In practice, this is not an issue in the vast majority of functional languages, for three reasons:
They have a garbage collector built specifically for this. It is faster for them to just allocate new memory and reclaim the abandoned one, rather than try to reuse existing memory. In the vast majority of cases, this will be fast enough.
They have data structures that already implement data sharing. For instance, NewListWithoutFirstElement already provides full reuse of the memory of the transformed list without any effort. It's fairly common for functional programmers (and any kind of programmers, really) to determine their use of data structures based on performance considerations, and rewriting a "remove last" algorithm as a "remove first" algorithm is kind of easy.
Lazy evaluation already does something equivalent: a lazy list's tail is initially just a closure that can evaluate the tail if you need to—so there's no memory to be reused. On the other hand, this means that reading an element from b in your example would read one element from a, determine if it's the last, and return it without really requiring storage (a cons cell would probably be allocated somewhere in there, but this happens all the time in functional programming languages and short-lived small objects are perfectly fine with the GC).

What does it mean to 'hash cons'?

When to use it and why?
My question comes from the sentence: "hash cons with some classes and compare their instances with reference equality"
From Odersky, Spoon and Venners (2007), Programming in Scala, Artima Press, p. 243:
You hash cons instances of a class by caching all instances you have created in a weak collection. Then, any time you want a new instance of the class, you first check the cache. If the cache already has an element equal to the one you are about to create, you can reuse the existing instance. As a result of this arrangement, any two instances that are equal with equals() are also equal with reference equality.
Putting everyone's answers together:
ACL2 (A Computational Logic for Applicative Common Lisp) is a software system consisting of a programming language, an extensible theory in a first-order logic, and a mechanical theorem prover.
-- Wiki ACL2
In computer programming, cons (pronounced /ˈkɒnz/ or /ˈkɒns/) is a fundamental function in most dialects of the Lisp programming language. cons constructs (hence the name) memory objects which hold two values or pointers to values. These objects are referred to as (cons) cells, conses, or (cons) pairs. In Lisp jargon, the expression "to cons x onto y" means to construct a new object with (cons x y). The resulting pair has a left half, referred to as the car (the first element), and a right half (the second element), referred to as the cdr.
-- Wiki Cons
Logically, hons is merely another name for cons, i.e., the following is an ACL2 theorem:
(equal (hons x y) (cons x y))
Hons generally runs slower than cons because in creating a hons, an attempt is made to see whether a hons already exists with the same car and cdr. This involves search and the use of hash-tables.
-- http://www.cs.utexas.edu/~moore/acl2/current/HONS.html
Given your question:
hash cons with some classes and compare their instances with reference equality
It appears that hash cons is the process of hashing a LISP constructor to determine if an object already exists via equality comparison.
http://en.wikipedia.org/wiki/Hash_cons now redirects.
It is cons with hashing to allow eq (reference) comparison instead of a deep one. This is more efficient for memory (because identical objects are stored as references), and is of course faster if comparison is a common operation.
http://www.cs.utexas.edu/~moore/acl2/current/HONS.html describes an implementation for Lisp.

Resources