Persistent data structures in Scala - data-structures

Are all immutable data structures in Scala persistent? If not, which of them are and which not? What are the behavioural characteristics of those which are persistent? Also, how do they compare to the persistent data structures in Clojure?

Scala's immutable data structures are all persistent, in the sense that the old value is maintained by an `update' operation. In fact, I do not know of a difference between immutable and persistent; for me the two terms are aliases.
Two of Scala's 2.8 immutable data structures are vectors and hash tries, represented as 32-ary trees. These were originally designed by Phil Bagwell, who was working with my team at EPFL, then adopted for Clojure, and now finally adopted for Scala 2.8. The Scala implementation shares a common root with the Clojure implementation, but is certainly not a port of it.

Please have a look at these excellent articles by Daniel Spiewak:
http://www.codecommit.com/blog/scala/implementing-persistent-vectors-in-scala
http://www.codecommit.com/blog/scala/more-persistent-vectors-performance-analysis
He's also referring to the Clojure implementation.

List, Vector, HashMap and HashSet are all persistent on Scala 2.8. There are other persistent data structures, but these covering all the major uses, I'm not sure there's any point in enumerating all of them.

For the last part of your question, I remember Rich Hickey mentioning in a presentation that the Clojure data structures have been ported to Scala. Also, Michael Fogus mentions plans for Scala 2.8 to adopt some of Clojure's data structures in this interview.
Sorry this is so short on details... I'm not sure what the status is on the above mentioned Scala 2.8 plans, but I remembered Rich and Michael mentioning this and thought it might be an interesting thing for you to google for if you're interested.

Related

When would it be a good idea to implement data structures rather than using built-in ones?

What is the purpose of creating your own linked list, or other data structure like maps, queues or hash function, for some programming language, instead of using built-in ones, or why should I create it myself? Thank you.
Good question! There are several reasons why you might want to do this.
For starters, not all programming languages ship with all the nice data structures that you might want to use. For example, C doesn't have built-in libraries for any data structures (though it does have bsearch and qsort for arrays), so if you want to use a linked list, hash table, etc. in C you need to either build it yourself or use a custom third-party library.
Other languages (say, JavaScript) have built-in support for some but not all types of data structures. There's no native JavaScript support for linked lists or binary search trees, for example. And I'm not aware of any mainstream programming language that has a built-in library for tries, though please let me know if that's not the case!
The above examples indicate places where a lack of support, period, for some data structure would require you to write your own. But there are other reasons why you might want to implement your own custom data structures.
A big one is efficiency. Put yourself in the position of someone who has to implement a dynamic array, hash table, and binary search tree for a particular programming language. You can't possibly know what workflows people are going to subject your data structures to. Are they going to do a ton of inserts and deletes, or are they mostly going to be querying things? For example, if you're writing a binary search tree type where insertions and deletions are common, you probably would want to look at something like a red/black tree, but if insertions and deletions are rare then an AVL tree would work a lot better. But you can't know this up front, because you have to write one implementation that stands the test of time and works pretty well for all applications. That might counsel you to pick a "reasonable" choice that works well in many applications, but isn't aggressively performance-tuned for your specific application. Coding up a custom data structure, therefore, might let you take advantage of the particular structure of the problem you're solving.
In some cases, the language specification makes it impossible or difficult to use fast implementations of data structures as the language standard. For example, C++ requires its associative containers to allow for deletions and insertions of elements without breaking any iterators into them. This makes it significantly more challenging / inefficient to implement those containers with types like B-trees that might actually perform a bit better than regular binary search trees due to the effects of caches. Similarly, the implementation of the unordered containers has an interface that assumes chained hashing, which isn't necessarily how you'd want to implement a hash table. That's why, for example, there's Google's alternatives to the standard containers that are optimized to use custom data structures that don't easily fit into the language framework.
Another reason why libraries might not provide the fastest containers would be challenges in providing a simple interface. For example, cuckoo hashing is a somewhat recent hashing scheme that has excellent performance in practice and guarantees worst-case efficient lookups. But to make cuckoo hashing work, you need the ability to select multiple hash functions for a given data type. Most programming languages have a concept that each data type has "a" hash function (std::hash<T>, Object.hashCode, __hash__, etc.), which isn't compatible with this idea. The languages could in principle require users to write families of hash functions with the idea that there would be many different hashes to pick from per object, but that complicates the logistics of writing your own custom types. Leaving it up to the programmer to write families of hash functions for types that need it then lets the language stay simple.
And finally, there's just plain innovation in the space. New data structures get invented all the time, and languages are often slow to grow and change. There's been a bunch of research into new faster binary search trees recently (check out WAVL trees as an example) or new hashing strategies (cuckoo hashing and the "Swiss Table" that Google developed), and language designers and implementers aren't always able to keep pace with them.
So, overall, the answer is a mix of "because you can't assume your favorite data structure will be there" and "because you might be able to get better performance rolling your own implementations."
There's one last reason I can think of, and that's "to learn how the language and the data structure work." Sometimes it's worthwhile building out custom data types just to sharpen your skills, and you'll often find some really clever techniques in data structures when you do!
All of this being said, I wouldn't recommend defaulting to coding your own version of a data structure every time you need one. Library versions are usually a pretty safe bet unless you're looking for extra performance or you're missing some features that you need. But hopefully this gives you a better sense as to why you may want to consider setting aside the default, well-tested tools and building out your own.
Hope this helps!

How To Use Classic Custom Data Structures As Java 8 Streams

I saw a SO question yesterday about implementing a classic linked list in Java. It was clearly an assignment from an undergraduate data structures class. It's easy to find questions and implementations for lists, trees, etc. in all languages.
I've been learning about Java lambdas and trying to use them at every opportunity to get the idiom under my fingers. This question made me wonder: How would I write a custom list or tree so I could use it in all the Java 8 lambda machinery?
All the examples I see use the built in collections. Those work for me. I'm more curious about how a professor teaching data structures ought to rethink their techniques to reflect lambdas and functional programming.
I started with an Iterator,but it doesn't appear to be fully featured.
Does anyone have any advice?
Exposing a stream view of arbitrary data structures is pretty easy. The key interface you have to implement is Spliterator, which, as the name suggests, combines two things -- sequential element access (iteration) and decomposition (splitting).
Once you have a Spliterator, you can turn that into a stream easily with StreamSupport.stream(). In fact, here's the stream() method from AbstractCollection (which most collections just inherit):
default Stream<E> stream() {
return StreamSupport.stream(spliterator(), false);
}
All the real work is in the spliterator() method -- and there's a broad range of spliterator quality (the absolute minimum you need to implement is tryAdvance, but if that's all you implement, it will work sequentially, but will lose out on most of the stream optimizations.) Look in the JDK sources Arrays.stream(), IntStream.range()) for examples of how to do better.)
I'd look at http://www.javaslang.io for inspiration, a library that does exactly what you want to do: Implement custom lists, trees, etc. in a Java 8 manner.
It specifically doesn't closely couple with the JDK collections outside of importing/exporting methods, but re-implements all the immutable collection semantics that a Scala (or other FP language) developer would expect.

Purely functional soft heap

Are there any implementations of a purely functional soft heap data structure in any language?
A quick search of the ACM digital library indicates that Chazelle's soft heap structure, despite being very interesting, has received relatively little study, and that persistent/functional soft heaps are thus an open research topic.
So I would say no, there are no known approaches for persistent soft heaps. Describing one would be a publishable result (it may boil down to adding copying where you would mutate the original structure, and identifying sharing opportunities).
The Haim Kaplan, Robert E. Tarjan, Uri Zwick paper describes but doesn't fully analyze purely functional variant. It can be found at:
http://phdtree.org/pdf/44150182-soft-heaps-simplified/
This project has Java code that might not be too terrible to translate to Scala... and then make it more functional.
https://github.com/lowasser/SoftSelect
But as noted previously the Purely Functional Data Structures book has Haskell code that may be easier to adopt to Soft Heaps, especially given the example Java code.
https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf

Reimplementing data structures in the real world

The topic of algorithms class today was reimplementing data structures, specifically ArrayList in Java. The fact that you can customize a structure for in various ways definitely got me interested, particularly with variations of add() & iterator.remove() methods.
But is reimplementing and customizing a data structure something that is of more interest to the academics vs the real-world programmers? Has anyone reimplemented their own version of a data structure in a commercial application/program, and why did you pick that route over your particular language's implementation?
Knowing how data structures are implemented and can be implemented is definitely of interest to everyone, not just academics. While you will most likely not reimplement a datastructure if the language already provides an implementation with suitable functions and performance characteristics, it is very possible that you will have to create your own data structure by composing other data structures... or you may need to implement a data structure with slightly different behavior than a well-known data structure. In that case, you certainly will need to know how the original data structure is implemented. Alternatively, you may end up needing a data structure that does not exist or which provides similar behavior to an existing data structure, but the way in which it is used requires that it be optimized for a different set of functions. Again, such a situation would require you to know how to implement (and alter) the data structure, so yes it is of interest.
Edit
I am not advocating that you reimplement existing datastructures! Don't do that. What I'm saying is that the knowledge does have practical application. For example, you may need to create a bidirectional map data structure (which you can implement by composing two unidirectional map data structures), or you may need to create a stack that keeps track of a variety of statistics (such as min, max, mean) by using an existing stack data structure with an element type that contains the value as well as these various statistics. These are some trivial examples of things that you might need to implement in the real world.
I have re-implemented some of a language's built-in data structures, functions, and classes on a number of occasions. As an embedded developer, the main reason I would do that is for speed or efficiency. The standard libraries and types were designed to be useful in a variety of situations, but there are many instances where I can create a more specialized version that is custom-tailored to take advantage of the features and limitations of my current platform. If the language doesn't provide a way to open up and modify existing classes (like you can in Ruby, for instance), then re-implementing the class/function/structure can be the only way to go.
For example, one system I worked on used a MIPS CPU that was speedy when working with 32-bit numbers but slower when working with smaller ones. I re-wrote several data structures and functions to use 32-bit integers instead of 16-bit integers, and also specified that the fields be aligned to 32-bit boundaries. The result was a noticable speed boost in a section of code that was bottlenecking other parts of the software.
That being said, it was not a trivial process. I ended up having to modify every function that used that structure and I ended up having to re-write several standard library functions as well. In this particular instance, the benefits outweighed the work. In the general case, however, it's usually not worth the trouble. There's a big potential for hard-to-debug problems, and it's almost always more work than it looks like. Unless you have specific requirements or restrictions that the existing structures/classes don't meet, I would recommend against re-implementing them.
As Michael mentions, it is indeed useful to know how to re-implement structures even if you never do so. You may find a problem in the future that can be solved by applying the principles and techniques used in existing data structures.

What are algorithms and data structures in layman’s terms?

I currently work with PHP and Ruby on Rails as a web developer. My question is why would I need to know algorithms and data structures? Do I need to learn C, C++ or Java first? What are the practical benefits of knowing algorithms and data structures? What are algorithms and data structures in layman’s terms? (As you can tell unfortunately I have not done a CS course.)
Please provide as much information as possible and thank you in advance ;-)
Data structures are ways of storing stuff, just like you can put stuff in stacks, queues, heaps and buckets - you can do the same thing with data.
Algorithms are recipes or instructions, the quick start manual for your coffee maker is an algorithm to make coffee.
Algorithms are, quite simply, the steps by which you do something. For instance the Coffee Maker Algorithm would run something like
Turn on Coffee Maker
Grind Coffee Beans
Put in filter and place coffee in filter
Add Water
Start brewing process
Drink coffee
A data structure is a means by which we store information in a organized fashion. For further info, check out the Wikipedia Article.
An algorithm is a list of instructions and data structures are ways to represent information. If you're writing computer programs then you're already using algorithms and data structures even if you don't know what the words mean.
I think the biggest advantages in knowing standard algorithms and data structures are:
You can communicate with other programmers using a common language.
Other people will be able to understand your code once you've left.
You will also learn better methods for solving common problems. You could probably solve these problems eventually anyway even without knowing the standard way to do it, but you will spend a lot of time reinventing the wheel and it's unlikely your solutions will be as good as those that thousands of experts have worked on and improved over the years.
An algorithm is a sequence of well defined steps leading to the solution of a type of problem.
A data structure is a way to store and organize data to facilitate access and modifications.
The benefit of knowing standard algorithms and data structures is they are mostly better than you yourself could develop. They are the result of months or even years of work by people who are far more intelligent than the majority of programmers. Knowing a range of data structures and algorithms allows you to fit a problem roughly to a data structure or/and algorithm and tweak as required.
In the classic "cooking/baking equivalent", algorithms are recipes and data structures are your measuring cups, your baking sheets, your cookie cutters, mixing bowls and essentially any other tool you would be using (your cooker is your compiler/interpreter, though).
(source: mit.edu)
This book is the bible on algorithms. In general, data structures relate to how to organize your data to access it in memory, and algorithms are methods / small programs to resolve problems (ex: sorting a list).
The reason you should care is first to understand what can go wrong in your code; poorly implemented algorithms can perform very badly compared to "proven" ones. Knowing classic algorithms and what performance to expect from them helps in knowing how good your code can be, and whether you can/should improve it.
Then there is no need to reinvent the wheel, and rewrite a buggy or sub-optimal implementation of a well-known structure or algorithm.
An algorithm is a representation of the process involved in a computation.
If you wanted to add two numbers then the algorithm might go:
Get first number;
Get second number;
Add first number to second number;
Return result.
At its simplest, an algorithm is just a structured list of things to do - its use in computing is that it allows people to see the intent behind the code and makes logical (as opposed to syntactical) errors easier to spot.
e.g. if step three above said multiply instead of add then someone would be able to point out the error in the logic without having to debug code.
A data structure is a representation of how a system's data should be referenced. It might match a table structure exactly or may be de-normalised to make data access easier. At its simplest it should show how the entities in a system are related.
It is too large a topic to go into in detail but there are plenty of resources on the web.
Data structures are critical the second your software has more than a handful of users. Algorithms is a broad topic, and you'll want to study it if a good knowledge of data structures doesn't fix your performance problems.
You probably don't need a new programming language to benefit from data structures knowledge, though PHP (and other high level languages) will make a lot of it invisible to you, unless you know where to look. Java is my personal favorite learning language for stuff like this, but that's pretty subjective.
My question is why would I need to know algorithms and data structures?
If you are doing any non-trivial programming, it is a good idea to understand the class data structures and algorithms and their uses in order to avoid reinventing the wheel. For example, if you need to put an array of things in order, you need to understand the various ways of sorting, so that you can choose the most appropriate one for the task in hand. If you choose the wrong approach, you can end up with a program that is grossly inefficient in some circumstances.
Do I need to learn C, C++ or Java first?
You need to know how to program in some language in order to understand what the algorithms and data structures do.
What are the practical benefits of knowing algorithms and data structures?
The main practical benefits are:
to avoid having to reinvent the wheel all of the time,
to avoid the problem of square wheels.

Resources