persistent vs immutable data structure - data-structures

Is there any difference in a persistent and immutable data structure? Wikipedia refers to immutable data structure when discussing persistence but I have a feeling there might be a subtle difference between the two.

Immutability is an implementation technique. Among other things, it provides persistence, which is an interface. The persistence API is something like:
version update(operation o, version v) performs operation o on version v, returning a new version. If the data structure is immutable, the new version is a new structure (that may share immutable parts of the old structure). If the data structure isn't immutable, the returned version might just be a version number. The version v remains a valid version, and it shouldn't change in any observe-able way because of this update - the update is only visible in the returned version, not in v.
data observe(query q, version v) observes a data structure at version v without changing it or creating a new version.
For more about these differences, see:
The chapter "Persistent data structures" by Haim Kaplan in Handbook on Data Structures and Applications
"Making Data Structures Persistent" by Driscoll et al.
The first lecture of the Spring 2012 incarnation of MIT's 6.851: Advanced Data Structures

Yes, there is a difference. An immutable data structure cannot be modified in any way after its creation. The only way to effective modify it would be to make a mutable copy or something similar (e.g. slightly modifying the parameters you pass to the constructor of the new one). A persistent data structure, on the other hand, is mutable in the sense that the exposed API appears to allow changes to the data structure. In truth, though, any changes will retain a pointer to the existing data structure (and therefore every preceding structure); they only seem to mutate the data structure because the exposed API returns a new pointer that may include pointers to a subset of the previous data structure (in trees, e.g., we will point at the node whose subtree hasn't changed as a result of the operation).

Related

Basic differences between HashTable and HashMap?

I am researching about hash tables and hash maps, everything I have read or watched gives a very vague description of the differences. From messing around on Netbeans with them both, they seem to have the same functions and do the same things, what are the fundamental differences between these two data structures?
There are no differences, but you can find that the same thing called differently in different programming languages, so how people call something depends on their background and programming language they use. For example: in c++ it will be HashMap and in java it will be HashTable.
Also, there could be one difference concluded based on the naming: HashTable allows only store hashed keys, but not values whereas HashMap allows to retrieve a value by hashed key. Internally the both will use the same algorithm and can be considered as same data structure.
HashTable sounds to me like a concrete data structure, although it has numerous variants depending on what happens when a collision occurs, when the table fills up, when it empties.
Map sounds like a abstract data structure, something defined by the available operations (Dictionary would be a potential other name for the same data structure, but I'd not be surprised if some nomenclature defined both with a nuance somewhere).
HashMap sounds like an implementation of the Map abstract data structure using an HashTable concrete data structure.
Again, I'd not be surprised if a language or a library provided both, with a nuance somewhere (HashMap for instance could provide only the operations defined for a Map, but HashTable provides everything which make sense for an HashTable).

How to achieve Time Travel with clojure

Is there a way to achieve time traveling in Clojure, for example if I have a vector (which is internally a tree implemented as a persistent data structure) is there a way to achieve time traveling and get preivous versions of that vector? Kind of what Datomic does at the database level, since Clojure and Datomic share many concepts including the facts being immutable implemented as persistent data strcutures, technically the older version of the vector is still there. So I was wondering if time traveling and getting previous versions is possible in plain Clojure similarly to what it is done in Datomic at the database level
Yes, but you need to keep a reference to it in order to access it, and in order to prevent it from being garbage collected. Clojurists often implement undo/redo in this way; all you need to do is maintain a list of historical states of your data, and then you can trivially step backward.
David Nolen has described this approach here, and you can find a more detailed example and explanation here.
Datomic is plain Clojure. You can use Datomic as a Clojure library either with an in-memory database (for version tracking) or with no database at all.

Efficient view updating with functional data model

In functional programming, data models are immutable, and updating a data model is done by applying a function on the data model, and getting a new version of the data model in return. I'm wondering how people write efficient viewers/editors for such data models, though (more specifically in Clojure)
A simplified example: suppose that you want to implement a viewer for a huge tree. In the non-functional world, you could have a controller for the Tree, with a function updateNode(Node, Value), which could then notify all observers to tell them that a specific node in the tree has been updated. On the viewer side, you would put all the nodes in a TreeView widget, keep a mapping of Node->WidgetNode, and when you are notified that a Node has changed, you can update just the one corresponding NodeWidget in the tree that needs updating.
The solution described in another Clojure MVC question talks about keeping the model in a ref, and adding a watcher. While this would indeed allow you to get notified of a change in the model, you still wouldn't know which node was updated, and would have to traverse the whole tree, correct?
The best thing I can come up with from the top of my head requires you to in the worst case update all the nodes on the path from root to the changed node (as all these nodes will be different)
What is the standard solution for updating views on immutable data models?
I'm not sure how this is a problem that's unique to functional programming. If you kept all of your state in a singly rooted mutable object graph with a notify when it changed, the same issue would exist.
To get around this, you could simply store the current state of model, and some information about what changed for the last edit. You could even keep a history of these things to allow for easy undo/redo because Clojure's persistent data structures make that extremely efficient with their shared underlying state.
That's just one thought on how to attack it. I'm sure there are many more.
I also think it's worth asking, "How efficient does it need to be?" The answer is, "just efficient enough for the circumstances." It might be the the plain map of data will work because you don't really have all that much data to deal with in a given application.

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

Writing datastructures requiring pointers/references in Clojure?

I've been working on toy a Database in Clojure and wanted to implement a B+ Tree. When I started thinking about it, I realised there may not be a way to have something like a pointer/reference to other nodes in Clojure. It doesn't matter for something like a BST or a lot of other Tree structures since all you need is to store a Node's child. But what do I do in something like a B+ tree where I need to be able to refer to a Node's sibling?
When looking for solutions, I came across a post in Google Groups about how you don't implement a Doubly linked list in Clojure because there are other ways of doing things in Clojure.
What do I do for a B+ Tree though?
It's not that it's difficult to have references to objects in clojure; but generally, these references are immutable. It's immutability which makes the doubly linked list impossible, because unlike a singly-linked list, you can't change any part of it without creating a mutation somewhere.
To see this, suppose I have a singly linked list,
a -> b -> c
and suppose I want to change the head of it. I can do so, with changing the entirety of the list. I create a new list by creating a new value for the head value, and reuse the tail:
a'-> b -> c
But doubly linked lists are impossible. So in clojure, and other functional languages, we sometimes use a zipper in such situations.
Now, suppose you really need mutable references in Clojure -- how do it? Well, depending on what concurrency semantics you need, clojure has vars, refs, atoms, etc.
Also, with deftype, you can create objects that have mutable fields, and these mutable fields can hold references to other things. You can also use raw java arrays in clojure for this same purpose.
Is your database going to be an in-memory database, or a disk-backed database? If on disk, I think that the issue of pointer swizzling is trickier than that of having mutable references.
Getting back to the issue of functional data structures, I believe that it is possible to create B-trees which have purely functional semantics. The first clue here is that it's a tree, and trees are the bread butter and meat of functional data structures. Secondly, note that there are databases which work in an append-only fashion -- couchDB for instance. This has the benefit that the database is its own log, in a sense. To get more of an idea of the costs and benefits of this approach you might want to watch Slava Akhmechet's presentation. His company, RethinkDB, eventually took a sort of hybrid approach, IIRC.
You may wish to look at Chouser's finger trees in Clojure to see how the functionality of a doubly-linked list may be implemented using functional style.
Alternatively, you may simply want to step back and ask yourself why you believe that B+ is a good choice of data structure for a functional language.
If you are unfamiliar with the alternatives, you may want to look at Chris Okazaki's book "Purely Functional Data Structures."

Resources