Python dictionary or map in elisp - elisp

What is the equivalent of a python dictionary like {'a':1, 'b':2} in elisp?
And again, does elisp have any map-reduce api?

Besides association lists,(whose algorithmic complexity is OK for small tables but not for large ones), there are hash tables, you can construct with make-hash-table and puthash, or if you prefer immediate values, you can write them as #s(hash-table data a 1 b 2).

Association lists are the most commonly used associative containers in elisp. It is just a list of key-value cons cells like this ((key . value)). You can use the assoc function to get a value corresponding to a key and rassoc to get a key with the required value.
Elisp comes with the built-in function mapcar which does map, but AFAIK there is no good fold facility. You could emulate it using any of the looping facilities provided. However, the better solution is to use cl-lib and slip into CommonLisp land. In particular, it supplies cl-mapcar and cl-reduce.

Related

When and why to use hash tables in CL instead of a-lists?

I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
The a-list being the most important one to me. I use it all the time.
When and why do you (or should you) use hash tables?
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
Maybe I am missing something in my inexperience?
The hash table is very useful when you have to access a large set of values through a key, since the complexity of this operation with a hash table is O(1), while the complexity of the operation using an a-list is O(n), where n is the length of the list.
So, I use it when I need to access multiple times a set of values which has more then few elements.
There are lot of assumptions to address in your question:
I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
I don't think this is particularly true, the standard libraries of popular languages are filled with lot of data structures too (C++, Java, Rust, Python)
When and why do you (or should you) use hash tables?
Data-structures come with costs in terms of memory and processor usage: a list must be searched linearly to find an element, whereas an hash-table has a constant lookup cost: for small lists however the linear search might be faster than the constant lookup. Moreover, there are other criteria like: do I want to access the data concurrently? a List can be manipulated in a purely functional way, making data-sharing across threads easier than with a hash-table (but hash-table can be associated with a mutex, etc.)
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
The source code of Lisp programs is made mostly of Lists and symbols, even if there is no such restriction. But at runtime, CL has a lot of different types that are not at all related to lists: bignums, floating points, rational numbers, complex numbers, vectors, arrays, packages, symbols, strings, classes and structures, hash-tables, readtables, functions, etc. You can model a lot of data at runtime by putting them in lists, which is something that works well for a lot of cases, but they are by far not the only types available.
Just to emphasize a little bit, when you write:
(vector 0 1 2)
This might look like a list in your code, but at runtime the value really is a different kind of object, a vector. Do not be confused by how things are expressed in code and how they are represented during code execution.
If you don't use it already, I suggest installing and using the Alexandria Lisp libray (see https://alexandria.common-lisp.dev/). There are useful functions to convert from and to hash-tables from alists or plists.
More generally, I think it is important to architecture your libraries and programs in a way that hide implementation details: you define a function make-person and accessors person-age, person-name, etc. as well as other user-facing functions. And the actual implementation can use hash tables, lists, etc. but this is not really a concern that should be exposed, because exposing that is a risk: you won't be able to easily change your mind later if you find out that the performance is bad or if you want to add a cache, use a database, etc.
I find however that CL is good at making nice interfaces that do not come with too much accidental complexity.
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists.
They are definitely not lists, but indeed they are not visible either:
#<HASH-TABLE :TEST EQL :COUNT 1 {100F4BA883}>
this doesn't show what's inside the hash-table. During development it will require more steps to inspect what's inside (inspect, describe, alexandria:hash-table-alist, defining a non-portable print-object method…).
serapeum:dict
I like very much serapeum:dict, coupled with (serapeum:toggle-pretty-print-hash-table) (also the Cookbook).
CL-USER> (serapeum:dict :a 1 :b 2 :c 3)
;; => #<HASH-TABLE :TEST EQUAL :COUNT 3 {100F6012D3}>
CL-USER> (serapeum:toggle-pretty-print-hash-table)
;; print the above HT again:
CL-USER> **
(SERAPEUM:DICT
:A 1
:B 2
:C 3
)
Not only is it printed readably, but it allows to create the hash-table with initial elements at the same time (unlike make-hash-table) and you can read it back in. It's even easy to save such a structure on file.
Serapeum is a solid library.
Now, use hash-tables more easily.
When to use a hash-table: You need to do fast (approximately "constant time") look-ups of data.
When to use an a-list: You have a need to dynamically shadow data you pass on to functions.
If neither of these obviously apply, you have to make a choice. And then, possibly, benchmark your choice. And then evaluate if rewriting it using the other choice would be better. In some experimentation that someone else did, well over a decade ago, the trade-off between an a-list and a hash-map in most Common Lisp implementation is somewhere in the region of 5 to 20 keys.
However, if you have a need to "shadow" bindings, for functions you call, an a-list does provide that "for free", and a hash-map does not. So if that is something that your code does a lot of, an a-list MAY be the better choice.
* (defun lookup (alist key) (assoc key alist))
LOOKUP
* (lookup '((key1 . value1) (key2 . value2)) 'key1)
(KEY1 . VALUE1)
* (lookup '((key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE2)
* (lookup '((key2 . value3) (key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE3)

Hash Function For Sequence of Unique Ids (UUID)

I am storing message sequences in the database each sequence can have up to N number of messages. I want to create a hash function which will represent the message sequence and enable to check faster if message sequence exists.
Each message has a case-sensitive alphanumeric universal unique id (UUID).
Consider following messages (M1, M2, M3) with ids-
M1 - a3RA0000000e0taBB
M2 - a3RA00033000e0taC
M3 - a3RA0787600e0taBB
Message sequences can be
Sequence-1 : (M1,M2,M3)
Sequence-2 : (M1,M3,M2)
Sequence-3 : (M2,M1,M3)
Sequence-4 : (M1,M2)
Sequence-5 : (M2,M3)
...etc...
Following is the database structure example for storing message sequence
Given the message sequence, we need to check whether that message sequence exists in the database. For example, check if message sequence M1 -> M2 -> M3 i.e. with UIDs (a3RA0000000e0taBB -> a3RA00033000e0taC -> a3RA0787600e0taBB) exists in the database.
Instead of scanning the rows in the table, I want to create a hash function which represents the message sequence with a hash value. Using the hash value lookup in the table supposedly faster.
My simple hash function is-
I am wondering what would be an optimal hash function for storing the message sequence hash for faster is exists check.
You don't need a full-blown cryptographic hash, just a fast one, so how about having a look at FastHash: https://github.com/ZilongTan/Coding/tree/master/fast-hash. If you believe 32 or 64 bit hashes are not enough (i.e. produce too many collisions) then you could use the longer MurmurHash: https://en.wikipedia.org/wiki/MurmurHash (actually, the author of FastHash recommends this approach)
There's a list of more algorithms on Wikipedia: https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions
In any case, hashes using bit operations (SHIFT, XOR ...) should be faster than the multiplication in your approach, even on modern machines.
How about using MD5 algorithm to generate the hash for a concatenated string of messageUIDs.
For instance- consider messages
M1 - a3RA0000000e0taBB
M2 - a3RA00033000e0taC
M3 - a3RA0787600e0taBB
For message sequence M1->M2->M3 string would be a3RA0000000e0taBB;a3RA00033000e0taC;a3RA0787600e0taBB which will have MD5 hash as 176B1CDE75EDFE1554888DAA863671C4.
According to this answer MD5 is robust against collisions. In the given scenario there is no need for security so MD5 may suffice.
Premature optimisation is the root of all evil. Start with the hashing function that is built into your language of choice, and then hash the lists (M1, M2), etc.. Then profile it and see if that's the bottleneck before you start using third-party hash libraries.
My guess is that database lookup will be slower than the hash computation, so it won't matter which hash you use.
In Python you can just call
hash([m1, m2, m3])
In Java call the hashCode method on your ArrayList.
Any regular string hash algorithm (say, your language of choice base library string hash) applied to the concatenation of messages UUIDs would suffice as long as you select all messages by that hash and check that they are indeed your messages in correct order. That may or may not be efficient depending on how many messages are in a sequence usually (also think about the worst case). There is no way to guarantee collision-free hash calculation in general so you should think what you are going to do in case of a collision.
Now, if you want to optimize this to make sure your hash is unique, it could be possible in some circumstances. You will know about collision once you try to insert the data, so you can do something about it (say, apply a salt or a dummy message to the sequence, or something like that to modify the hash and keep doing it until you get an empty spot), but it will require sufficiently large hashes and potentially other app-specific modifications.

Why list ++ requires to scan all elements of list on its left?

The Haskell tutorial says, be cautious that when we use "Hello"++" World", the new list construction has to visit all single elements(here, every character of "Hello"), so if the list on the left of "++" is long, then using "++" will bring down performance.
I think I was not understanding correctly, does Haskell's developers never tune the performance of list operations? Why this operation remains slow, to have some kind of syntax consistencies in any lambda function or currying?
Any hints? Thanks.
In some languages, a "list" is a general-purpose sequence type intended to offer good performance for concatenation, splitting, etc. In Haskell, and most traditional functional languages, a list is a very specific data structure, namely a singly-linked list. If you want a general-purpose sequence type, you should use Data.Sequence from the containers package (which is already installed on your system and offers very good big-O asymptotics for a wide variety of operations), or perhaps some other one more heavily optimized for common usage patterns.
If you have immutable list which has a head and a reference to the tail, you cannot change its tail. If you want to add something to the 'end' of the list, you have to reach the end and then put all items one by one to the head of your right list. It is the fundamential property of immutable lists: concatenation is expensive.
Haskell lists are like singly-linked lists: they are either empty or they consist of a head and a (possibly empty) tail. Hence, when appending something to a list, you'll first have to walk the entire list to get to the end. So you end up traversing the entire list (the list to which you append, that is), which needs O(n) runtime.

Constant-time list concatenation in OCaml

Is it possible to implement constant-time list concatenation in OCaml?
I imagine an approach where we deal directly with memory and concatenate lists by pointing the end of the first list to the beginning of the second list. Essentially, we're creating some type of linked-list like object.
With the normal list type, no, you can't. The algorithm you gave is exactly the one implemented ... but you still have to actually find the end of the first list...
There are various methods to implement constant time concatenation (see Okazaki for fancy details). I will just give you names of ocaml libraries that implement it: BatSeq, BatLazyList (both in batteries), sequence, gen, Core.Sequence.
Pretty sure there is a diff-list implementation somewhere too.
Lists are already (singly) linked lists. But list nodes are immutable. So you change any node's pointer to point to anything different. In order to concatenate two lists you must therefore copy all the nodes in the first list.

Speedy attribute lookup in dynamically typed language?

I'm currently developing a dynamically typed language.
One of the main problems I'm facing during development is how to do fast runtime symbol lookups.
For general, free global and local symbols I simply index them and let each scope (global or local) keep an array of the symbols and quickly look them up using the index. I'm very happy with this approach.
However, for attributes in objects the problem is much harder. I can't use the same indexing scheme on them, because I have no idea which object I'm currently accessing, thus I don't know which index to use!
Here's an example in python which reflects what I want working in my language:
class A:
def __init__(self):
self.a = 10
self.c = 30
class B:
def __init__(self):
self.c = 20
def test():
if random():
foo = A()
else:
foo = B()
# There could even be an eval here that sets foo
# to something different or removes attribute c from foo.
print foo.c
Does anyone know any clever tricks to do the lookup quickly? I know about hash maps and splay trees, so I'm interesting if there is any ways to do it as efficient as my other lookup.
Once you've reached the point where looking up properties in the hash table isn't fast enough, the standard next step is inline caching. You can do this in JIT languages, or even bytecode compilers or interpreters, though it seems to be less common there.
If the shape of your objects can change over time (i.e. you can add new properties at runtime) you'll probably end up doing something similar to V8's hidden classes.
A technique known as maps can store the values for each attribute in a compact array. The knowledge which attribute name corresponds to which index is maintained in an auxiliary data structure (the eponymous map), so you don't immediately gain a performance benefit (though it does use memory more efficiently if many objects share a set of attributes). With a JIT compiler, you can make the map persistent and constant-fold lookups, so the final machine code can use constant offsets into the attributes array (for constant attribute names).
In an interpreter (I'll assume byte code), things are much harder because you don't have much opportunity to specialize code for specific objects. However, I have an idea myself for turning attribute names into integral keys. Maintain a global mapping assigning integral IDs to attribute names. When adding new byte code to the VM (loading from disk or compiling in memory), scan for strings used as attributes, and replace them with the associated ID, creating a new ID if the string hasn't been seen before. Instead of storing hash tables or similar mappings on each object - or in the map, if you use maps - you can now use sparse arrays, which are hopefully more compact and faster to operate on.
I haven't had a change to implement and test this, and you still need a sparse array. Unless you want to make all objects (or maps) take as many words of memory as there are distinct attribute names in the whole program, that is. At least you can replace string hash tables with integer hash tables.
Just by tuning a hash table for IDs as keys, you can make several optimizations: Don't invoke a hash function (use the ID as hash), remove some indirection and hence cache misses, save yourself the complexity of dealing with pathologically bad hash functions, etc.

Resources