Writing datastructures requiring pointers/references in Clojure? - data-structures

I've been working on toy a Database in Clojure and wanted to implement a B+ Tree. When I started thinking about it, I realised there may not be a way to have something like a pointer/reference to other nodes in Clojure. It doesn't matter for something like a BST or a lot of other Tree structures since all you need is to store a Node's child. But what do I do in something like a B+ tree where I need to be able to refer to a Node's sibling?
When looking for solutions, I came across a post in Google Groups about how you don't implement a Doubly linked list in Clojure because there are other ways of doing things in Clojure.
What do I do for a B+ Tree though?

It's not that it's difficult to have references to objects in clojure; but generally, these references are immutable. It's immutability which makes the doubly linked list impossible, because unlike a singly-linked list, you can't change any part of it without creating a mutation somewhere.
To see this, suppose I have a singly linked list,
a -> b -> c
and suppose I want to change the head of it. I can do so, with changing the entirety of the list. I create a new list by creating a new value for the head value, and reuse the tail:
a'-> b -> c
But doubly linked lists are impossible. So in clojure, and other functional languages, we sometimes use a zipper in such situations.
Now, suppose you really need mutable references in Clojure -- how do it? Well, depending on what concurrency semantics you need, clojure has vars, refs, atoms, etc.
Also, with deftype, you can create objects that have mutable fields, and these mutable fields can hold references to other things. You can also use raw java arrays in clojure for this same purpose.
Is your database going to be an in-memory database, or a disk-backed database? If on disk, I think that the issue of pointer swizzling is trickier than that of having mutable references.
Getting back to the issue of functional data structures, I believe that it is possible to create B-trees which have purely functional semantics. The first clue here is that it's a tree, and trees are the bread butter and meat of functional data structures. Secondly, note that there are databases which work in an append-only fashion -- couchDB for instance. This has the benefit that the database is its own log, in a sense. To get more of an idea of the costs and benefits of this approach you might want to watch Slava Akhmechet's presentation. His company, RethinkDB, eventually took a sort of hybrid approach, IIRC.

You may wish to look at Chouser's finger trees in Clojure to see how the functionality of a doubly-linked list may be implemented using functional style.
Alternatively, you may simply want to step back and ask yourself why you believe that B+ is a good choice of data structure for a functional language.
If you are unfamiliar with the alternatives, you may want to look at Chris Okazaki's book "Purely Functional Data Structures."

Related

Do linked lists have a practical performance advantage in eager functional languages?

I'm aware that in lazy functional languages, linked lists take on generator-esque semantics, and that under an optimizing compiler, their overhead can be completely removed when they're not actually being used for storage.
But in eager functional languages, their use seems no less heavy, while optimizing them out seems more difficult. Is there a good performance reason that languages like Scheme use them over flat arrays as their primary sequence representation?
I'm aware of the obvious time complexity implications of using singly-linked lists; I'm more thinking about the in-practice performance consequences of using eager singly-linked lists as a primary sequence representation, compiler optimizations considered.
TL; DR: No performance advantage!
The first Lisp had cons (linked list cell) as only data structure and it used it for everything. Both Common Lisp and Scheme have vectors today, but for functional style it's not a good match. The linked list can have one recursive step add zero or more elements in front of an accumulator and in the end you have a list that was made with sharing between the iterations. The operation might do more than one recursion making several versions all sharing the tail. I would say sharing is the most important aspect of the linked list. If you make a minimax algorithm and store the state in a linked list you can change the state without having to copy the unchanged parts of the state.
Creator of C++ Bjarne Stroustrup mentions in a talk that the penalty of having the data structure scrambled and double the size like in a linked list is easily outperformed even if you insert in order and need to move half the elements in the data structure. Keep in mind these were doubly linked lists and mutation inserts, but he mentions that most of the time was following the pointers linearly, getting all the CPU cache misses, to find the correct spot and thus for every O(n) search in a sorted list a vector would be better.
If you have a program where you do many inserts on a list which is not in front then perhaps a tree is a better choice, which you can do in CL and Scheme with cons. In fact all of Chris Okasaki purly functional data structures can be implemented with cons. Most "mutable" structures in Haskell are implemented similar to it.
If you are suffering from performance problems in Scheme and after profiling you find out you should try to replace a linked list operation with an array one there is nothing standing in the way of that. In the end all algorithm choices have pros and cons. Hard computations are hard in any language.

what is data structure? simple straight forward explaination required

i have to explain what data structure is to someone, so what would be the easiest way to explain it? would it be right if i say
"Data structure is used to organize data(arrange data in some fashion) so that we can perform certain operation fastly with as little resource usage as possible"
How values are placed in locations together and their location addresses and indices are stored as values too.
And that as very abstract "structures" so one has linked lists, arrays, pointers, graphs, binary trees. And can do things with them (the algorithms). The capabilities like being sorted, needing sortedness, fast access and so on.
This is fundamental, not too complicated, and a good grasp of data
structures, the correct usage of data structures can solve problems
elegantly. For learning data structures a language like Pascal is more
beneficial than C.
In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently.
Source: wikipedia (https://en.wikipedia.org/wiki/Data_structure)
I would say what you wrote is pretty close. :)

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Iterable O(1) insert and random delete collection

I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.
It'd be hard to beat a hash map.
Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.
In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.

What are some uses for linked lists?

Do linked lists have any practical uses at all. Many computer science books compare them to arrays and say the main advantage is that they are mutable. However, most languages provide mutable versions of arrays. So do linked lists have any actual uses in the real world, or are they just part of computer science theory?
They're absolutely precious (in both the popular doubly-linked version and the less-popular, but simpler and faster when applicable!, single-linked version). For example, inserting (or removing) a new item in a specified "random" spot in a "mutable version of an array" (e.g. a std::vector in C++) is O(N) where N is the number of items in the array, because all that follow (on average half of them) must be shifted over, and that's an O(N) operation; in a list, it's O(1), i.e., constant-time, if you already have e.g. the pointer to the "previous" item. Big-O differences like this are absolutely huge -- the difference between a real-world usable and scalable program, and a toy, "homework"-level one!-)
Linked lists have many uses. For example, implementing data structures that appear to the end user to be mutable arrays.
If you are using a programming language that provides implementations of various collections, many of those collections will be implemented using linked lists. When programming in those languages, you won't often be implementing a linked list yourself but it might be wise to understand them so you can understand what tradeoffs the libraries you use are making. In other words, the set "just part of computer science theory" contains elements that you just need to know if you are going to write programs that just work.
The main Applications of Linked Lists are
For representing Polynomials
It means in addition/subtraction /multipication.. of two polynomials.
Eg:p1=2x^2+3x+7 and p2=3x^3+5x+2
p1+p2=3x^3+2x^2+8x+9
In Dynamic Memory Management
In allocation and releasing memory at runtime.
*In Symbol Tables
in Balancing paranthesis
Representing Sparse Matrix
Ref:-
http://www.cs.ucf.edu/courses/cop3502h.02/linklist3.pdf
So do linked lists have any actual uses in the real world,
A Use/Example of Linked List (Doubly) can be Lift in the Building.
- A person have to go through all the floor to reach top (tail in terms of linked list).
- A person can never go to some random floor directly (have to go through intermediate floors/Nodes).
- A person can never go beyond the top floor (next to the tail node is assigned null).
- A person can never go beyond the ground/last floor (previous to the head node is assigned null in linked list).
Yes of course it's useful for many reasons.
Anytime for example that you want efficient insertion and deletion from the list. To find a place of insertion you have an O(N) search, but to do an insertion if you already have the correct position it is O(1).
Also the concepts you learn from working with linked lists help you learn how to make tree based data structures and many other data structures.
A primary advantage to a linked list as opposed to a vector is that random-insertion time is as simple as decoupling a pair of pointers and recoupling them to the new object (this is of course, slightly more work for a doubly-linked list). A vector, on the other hand generally reorganizes memory space on insertions, causing it to be significantly slower. A list is not as efficient, however, at doing things like adding on the end of the container, due to the necessity to progress all the way through the list.
An Immutable Linked List is the most trivial example of a Persistent Data Structure, which is why it is the standard (and sometimes even only) data structure in many functional languages. Lisp, Scheme, ML, Haskell, Scala, you name it.
Linked Lists are very useful in dynamic memory allocation. These lists are used in operating systems. insertion and deletion in linked lists are very useful. Complex data structures like tree and graphs are implemented using linked lists.
Arrays that grow as needed are always just an illusion, because of the way computer memory works. Under the hood, it's just a continous block of memory that has to be reallocated when enough new elements have been added. Likewise if you remove elements from the array, you'll have to allocate a new block of memory, copy the array and release the previous block to reclaim the unused memory. A linked list allows you to grow and shrink a list of elements without having to reallocate the rest of the list.
Linked lists are useful because elements can be efficiently spliced and removed in the middle as others noted. However a downside to linked lists are poor locality of reference. I prefer not using lists for this reason unless I have an explicit need for the capabilities.

Resources