What does it mean to 'hash cons'? - performance

When to use it and why?
My question comes from the sentence: "hash cons with some classes and compare their instances with reference equality"

From Odersky, Spoon and Venners (2007), Programming in Scala, Artima Press, p. 243:
You hash cons instances of a class by caching all instances you have created in a weak collection. Then, any time you want a new instance of the class, you first check the cache. If the cache already has an element equal to the one you are about to create, you can reuse the existing instance. As a result of this arrangement, any two instances that are equal with equals() are also equal with reference equality.

Putting everyone's answers together:
ACL2 (A Computational Logic for Applicative Common Lisp) is a software system consisting of a programming language, an extensible theory in a first-order logic, and a mechanical theorem prover.
-- Wiki ACL2
In computer programming, cons (pronounced /ˈkɒnz/ or /ˈkɒns/) is a fundamental function in most dialects of the Lisp programming language. cons constructs (hence the name) memory objects which hold two values or pointers to values. These objects are referred to as (cons) cells, conses, or (cons) pairs. In Lisp jargon, the expression "to cons x onto y" means to construct a new object with (cons x y). The resulting pair has a left half, referred to as the car (the first element), and a right half (the second element), referred to as the cdr.
-- Wiki Cons
Logically, hons is merely another name for cons, i.e., the following is an ACL2 theorem:
(equal (hons x y) (cons x y))
Hons generally runs slower than cons because in creating a hons, an attempt is made to see whether a hons already exists with the same car and cdr. This involves search and the use of hash-tables.
-- http://www.cs.utexas.edu/~moore/acl2/current/HONS.html
Given your question:
hash cons with some classes and compare their instances with reference equality
It appears that hash cons is the process of hashing a LISP constructor to determine if an object already exists via equality comparison.

http://en.wikipedia.org/wiki/Hash_cons now redirects.
It is cons with hashing to allow eq (reference) comparison instead of a deep one. This is more efficient for memory (because identical objects are stored as references), and is of course faster if comparison is a common operation.
http://www.cs.utexas.edu/~moore/acl2/current/HONS.html describes an implementation for Lisp.

Related

When and why to use hash tables in CL instead of a-lists?

I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
The a-list being the most important one to me. I use it all the time.
When and why do you (or should you) use hash tables?
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
Maybe I am missing something in my inexperience?
The hash table is very useful when you have to access a large set of values through a key, since the complexity of this operation with a hash table is O(1), while the complexity of the operation using an a-list is O(n), where n is the length of the list.
So, I use it when I need to access multiple times a set of values which has more then few elements.
There are lot of assumptions to address in your question:
I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
I don't think this is particularly true, the standard libraries of popular languages are filled with lot of data structures too (C++, Java, Rust, Python)
When and why do you (or should you) use hash tables?
Data-structures come with costs in terms of memory and processor usage: a list must be searched linearly to find an element, whereas an hash-table has a constant lookup cost: for small lists however the linear search might be faster than the constant lookup. Moreover, there are other criteria like: do I want to access the data concurrently? a List can be manipulated in a purely functional way, making data-sharing across threads easier than with a hash-table (but hash-table can be associated with a mutex, etc.)
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
The source code of Lisp programs is made mostly of Lists and symbols, even if there is no such restriction. But at runtime, CL has a lot of different types that are not at all related to lists: bignums, floating points, rational numbers, complex numbers, vectors, arrays, packages, symbols, strings, classes and structures, hash-tables, readtables, functions, etc. You can model a lot of data at runtime by putting them in lists, which is something that works well for a lot of cases, but they are by far not the only types available.
Just to emphasize a little bit, when you write:
(vector 0 1 2)
This might look like a list in your code, but at runtime the value really is a different kind of object, a vector. Do not be confused by how things are expressed in code and how they are represented during code execution.
If you don't use it already, I suggest installing and using the Alexandria Lisp libray (see https://alexandria.common-lisp.dev/). There are useful functions to convert from and to hash-tables from alists or plists.
More generally, I think it is important to architecture your libraries and programs in a way that hide implementation details: you define a function make-person and accessors person-age, person-name, etc. as well as other user-facing functions. And the actual implementation can use hash tables, lists, etc. but this is not really a concern that should be exposed, because exposing that is a risk: you won't be able to easily change your mind later if you find out that the performance is bad or if you want to add a cache, use a database, etc.
I find however that CL is good at making nice interfaces that do not come with too much accidental complexity.
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists.
They are definitely not lists, but indeed they are not visible either:
#<HASH-TABLE :TEST EQL :COUNT 1 {100F4BA883}>
this doesn't show what's inside the hash-table. During development it will require more steps to inspect what's inside (inspect, describe, alexandria:hash-table-alist, defining a non-portable print-object method…).
serapeum:dict
I like very much serapeum:dict, coupled with (serapeum:toggle-pretty-print-hash-table) (also the Cookbook).
CL-USER> (serapeum:dict :a 1 :b 2 :c 3)
;; => #<HASH-TABLE :TEST EQUAL :COUNT 3 {100F6012D3}>
CL-USER> (serapeum:toggle-pretty-print-hash-table)
;; print the above HT again:
CL-USER> **
(SERAPEUM:DICT
:A 1
:B 2
:C 3
)
Not only is it printed readably, but it allows to create the hash-table with initial elements at the same time (unlike make-hash-table) and you can read it back in. It's even easy to save such a structure on file.
Serapeum is a solid library.
Now, use hash-tables more easily.
When to use a hash-table: You need to do fast (approximately "constant time") look-ups of data.
When to use an a-list: You have a need to dynamically shadow data you pass on to functions.
If neither of these obviously apply, you have to make a choice. And then, possibly, benchmark your choice. And then evaluate if rewriting it using the other choice would be better. In some experimentation that someone else did, well over a decade ago, the trade-off between an a-list and a hash-map in most Common Lisp implementation is somewhere in the region of 5 to 20 keys.
However, if you have a need to "shadow" bindings, for functions you call, an a-list does provide that "for free", and a hash-map does not. So if that is something that your code does a lot of, an a-list MAY be the better choice.
* (defun lookup (alist key) (assoc key alist))
LOOKUP
* (lookup '((key1 . value1) (key2 . value2)) 'key1)
(KEY1 . VALUE1)
* (lookup '((key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE2)
* (lookup '((key2 . value3) (key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE3)

Can't cons cells be implemented efficiently at the library level in Clojure?

Clojure has its own collections, and has no need of the traditional lispy cons cells. But I find the concept interesting, and it is used in some teaching materials (e.g., SICP). I have been wondering if there is any reasons that this cons primitive needs to be a primitive. Can't we just implement it (and the traditional functions that operate on it) in a library? I searched, but I found no such library already written.
Cons cells are an important building block in Lisp for s-expressions. See for example the various publications by McCarthy about Lisp and Symbolic Expressions from 1958 onwards (for example Recursive Functions of Symbolic Expressions). Every list in Lisp is made of cons cells.
It's definitely possible to implement linked lists (and trees, ...) with cons cells as a library. But for Lisp they are so central, that it needs them early on and with a very efficient implementation.
In a Lisp system typically there are many cons cells and a high rate of allocating new cons cells (called consing). Thus the implementors of a Lisp may want to optimize their Lisp implementation for:
small size of cons cells -> not more than two machine words, one word for the car and one word for the cdr
fast allocation of new cons cells
efficient garbage collection of cons cells (find no-longer used cons cells very quickly)
storing primitive data (numbers, characters, ...) directly in cons cells -> no pointer overhead
optimize locality of Lisp data like cons cell structures (lists, assoc lists, trees, ...) for example by using a generational/copying garbage collector and/or memory regions for cons cells
Thus Lisp systems use all kinds of tricks to achieve that. For example pointers may encode if they point to a cons cell - thus the cons cell itself does not need a type tag. Fixnums have very few tag bits and fit into the CAR or CDR of a cons cell. On the MIT Lisp Machine the system also had the feature to omit the CDR part of a cons cell, when it was a part of a linear list.
To achieve all these optimization goals one usually needs a hand-tuned implementation of a Lisp runtime in assembler and/or C. A Lisp processor or a Lisp VM usually will provide CAR, CDR, CONS, CONSP, ... as machine instructions.
It's like TFB said: similarly one can implement floating point numbers in a library, but it will not be efficient compared to native floating point numbers and operations supported by a CPU. Lisp implementations provide cons cells at a very very low level.
But outside of such a Lisp implementation, its clearly possible to implement cons cells as a library - with somewhat worse space and time efficiency.
Side note
Maclisp had cons cells with more than two slots called Hunks
You could implement it yourself. Here is an attempt:
(defprotocol cons-cell
(car [this])
(cdr [this])
(rplaca [this v])
(rplacd [this v]))
(deftype Cons [^:volatile-mutable car
^:volatile-mutable cdr]
cons-cell
(car [this] (.car this))
(cdr [this] (.cdr this))
(rplaca [this value] (set! car value))
(rplacd [this value] (set! cdr value)))
(defn cons [car cdr]
(Cons. car cdr))
Circular list:
(let [head (cons 0 nil)]
(rplacd head head)
head)
Of course, you can implement cons cells with no tools other than lambda (called fn in Clojure).
(defn cons' [a d]
(fn [f] (f a d)))
(defn car' [c]
(c (fn [a d] a)))
(defn cdr' [c]
(c (fn [a d] d)))
user> (car' (cdr' (cons' 1 (cons' 2 nil))))
2
This is as space-efficient as you can get in Clojure (a lambda closing over two bindings is just an object with two fields). car and cdr could obviously be more time-efficient if you used a record or something instead; the point I'm making is that yes, of course you can make cons cells, even if you have next to no tools available.
Why isn't it done, though? We already have better tools available. Clojure's sequence abstraction makes a better list than cons cells do, and vectors are a perfectly fine tuple. There's just no great need for cons cells. Combine that with the fact that anyone who does want them will find it trivially easy to implement anew, and there are no customers for a prospective library solution.

Why list ++ requires to scan all elements of list on its left?

The Haskell tutorial says, be cautious that when we use "Hello"++" World", the new list construction has to visit all single elements(here, every character of "Hello"), so if the list on the left of "++" is long, then using "++" will bring down performance.
I think I was not understanding correctly, does Haskell's developers never tune the performance of list operations? Why this operation remains slow, to have some kind of syntax consistencies in any lambda function or currying?
Any hints? Thanks.
In some languages, a "list" is a general-purpose sequence type intended to offer good performance for concatenation, splitting, etc. In Haskell, and most traditional functional languages, a list is a very specific data structure, namely a singly-linked list. If you want a general-purpose sequence type, you should use Data.Sequence from the containers package (which is already installed on your system and offers very good big-O asymptotics for a wide variety of operations), or perhaps some other one more heavily optimized for common usage patterns.
If you have immutable list which has a head and a reference to the tail, you cannot change its tail. If you want to add something to the 'end' of the list, you have to reach the end and then put all items one by one to the head of your right list. It is the fundamential property of immutable lists: concatenation is expensive.
Haskell lists are like singly-linked lists: they are either empty or they consist of a head and a (possibly empty) tail. Hence, when appending something to a list, you'll first have to walk the entire list to get to the end. So you end up traversing the entire list (the list to which you append, that is), which needs O(n) runtime.

Why does Scheme need the special notion of procedure's location tag?

Why does Scheme need the special notion of procedure's location tag?
The standard says:
Each procedure created as the result of evaluating a lambda expression
is (conceptually) tagged with a storage location, in order to make
eqv? and eq? work on procedures
The eqv? procedure returns #t if:
obj1 and obj2 are procedures whose location tags are equal
Eq? and eqv? are guaranteed to have the same behavior on ... procedures ...
But at the same time:
Variables and objects such as pairs, vectors, and strings implicitly denote locations or sequences of locations
The eqv? procedure returns #t if:
obj1 and obj2 are pairs, vectors, or strings that denote the same locations in the store
Eq? and eqv? are guaranteed to have the same behavior on ... pairs ... and non-empty strings and vectors
Why not just apply "implicitly denote locations or sequences of locations" to procedures too?
I thought this concerned them as well
I don't see anything special about procedures in that matter
Pairs, vectors, and strings are mutable. Hence, the identity (or location) of such objects matter.
Procedures are immutable, so they can be copied or coalesced arbitrarily with no apparent difference in behaviour. In practice, that means that some optimising compilers can inline them, effectively making them "multiple copies". R6RS, in particular, says that for an expression like
(let ((p (lambda (x) x)))
(eqv? p p))
the result is not guaranteed to be true, since it could have been inlined as (eqv? (lambda (x) x) (lambda (x) x)).
R7RS's notion of location tags is to give assurance that that expression does indeed result in true, even if an implementation does inlining.
Treating procedures as values works in languages like ML where they are truly immutable. But in Scheme, procedures can actually be mutated, because their local variables can be. In effect, procedures are poor man's objects (though the case can also be made that OO-style objects are just poor man's procedures!) The location tag serves the same purpose as the object identity that distinguishes two pairs with identical cars and cdrs.
In particular, giving global procedures identity means that it's possible to ask directly whether a predicate we have been passed is specifically eq? or eqv? or equal?, which is not portably possible in R6RS (though possible in R6RS implementations in practice).

Difference between "open-ended lists" and "difference lists"

What is the difference between "open-ended lists" and "difference lists"?
As explained on http://homepages.inf.ed.ac.uk/pbrna/prologbook/node180.html, open list is a tool used to implement a difference list.
Open list is any list where you have a unassigned variable at some point in the list, e.g.: [a,b,c|X]. You can use open list to implement a data structure called difference list, which formally specifies a structure of two terms pointing to first element and to the open end, traditionally defined as: [a,b,c|X]-X, to make operating on such lists easier.
For example, if all you have is an open list, adding element to the end is possible, but you need to iterate over all items. In a difference list you can just use the end-of-list variable (called a Hole on the page above) to skip iteration and perform the operation in constant time.
Both notions seem to be lists, but in fact they are not. One is a concrete term, the other rather a convention.
Open-ended lists, partial lists
Open-ended lists are terms that are not lists but can be instantiated such that they become lists. In standard lingo, they are called partial lists. Here are partial lists: X, [a|X], [X|X] are all partial lists.
The notion open-ended lists suggests a certain usage of such lists to simulate some open-ended state. Think of a dictionary that might be represented by an open-ended list. Every time you add a new item, the variable "at the end of the partial list" is instantiated to a new element. While this programming technique is quite possible in Prolog, it has one big downside: The programs will heavily depend on a procedural interpretation. And in many situations there is no way to have a declarative interpretation at all.
Difference lists
Difference lists are effectively not lists as such but a certain way how lists are used such that the intended list is represented by two variables: one for the start and one for the end of the list. For this reason it would help a lot to rather talk of list differences instead of difference lists.
Consider:
el(E, [E|L],L).
Here, the last two arguments can be seen as forming a difference: a list that contains the single element [E]. You can now construct more complex lists out of simpler ones, provided you respect certain conventions which are essentially that the second argument is only passed further on. The differences as such are never compared to each other!
el2(E, F, L0,L) :-
el(E, L0,L1),
el(F, L1,L).
Note that this is merely a convention. The lists are not enforced. Think of:
?- el2(E, F, L, nonlist).
L = [E,F|nonlist].
This technique is also used to encode dcgs.
For example
Open-ended : [a,b,c | _]
Difference-list : [a,b,c|U]-U.

Resources