Reimplementing data structures in the real world - data-structures

The topic of algorithms class today was reimplementing data structures, specifically ArrayList in Java. The fact that you can customize a structure for in various ways definitely got me interested, particularly with variations of add() & iterator.remove() methods.
But is reimplementing and customizing a data structure something that is of more interest to the academics vs the real-world programmers? Has anyone reimplemented their own version of a data structure in a commercial application/program, and why did you pick that route over your particular language's implementation?

Knowing how data structures are implemented and can be implemented is definitely of interest to everyone, not just academics. While you will most likely not reimplement a datastructure if the language already provides an implementation with suitable functions and performance characteristics, it is very possible that you will have to create your own data structure by composing other data structures... or you may need to implement a data structure with slightly different behavior than a well-known data structure. In that case, you certainly will need to know how the original data structure is implemented. Alternatively, you may end up needing a data structure that does not exist or which provides similar behavior to an existing data structure, but the way in which it is used requires that it be optimized for a different set of functions. Again, such a situation would require you to know how to implement (and alter) the data structure, so yes it is of interest.
Edit
I am not advocating that you reimplement existing datastructures! Don't do that. What I'm saying is that the knowledge does have practical application. For example, you may need to create a bidirectional map data structure (which you can implement by composing two unidirectional map data structures), or you may need to create a stack that keeps track of a variety of statistics (such as min, max, mean) by using an existing stack data structure with an element type that contains the value as well as these various statistics. These are some trivial examples of things that you might need to implement in the real world.

I have re-implemented some of a language's built-in data structures, functions, and classes on a number of occasions. As an embedded developer, the main reason I would do that is for speed or efficiency. The standard libraries and types were designed to be useful in a variety of situations, but there are many instances where I can create a more specialized version that is custom-tailored to take advantage of the features and limitations of my current platform. If the language doesn't provide a way to open up and modify existing classes (like you can in Ruby, for instance), then re-implementing the class/function/structure can be the only way to go.
For example, one system I worked on used a MIPS CPU that was speedy when working with 32-bit numbers but slower when working with smaller ones. I re-wrote several data structures and functions to use 32-bit integers instead of 16-bit integers, and also specified that the fields be aligned to 32-bit boundaries. The result was a noticable speed boost in a section of code that was bottlenecking other parts of the software.
That being said, it was not a trivial process. I ended up having to modify every function that used that structure and I ended up having to re-write several standard library functions as well. In this particular instance, the benefits outweighed the work. In the general case, however, it's usually not worth the trouble. There's a big potential for hard-to-debug problems, and it's almost always more work than it looks like. Unless you have specific requirements or restrictions that the existing structures/classes don't meet, I would recommend against re-implementing them.
As Michael mentions, it is indeed useful to know how to re-implement structures even if you never do so. You may find a problem in the future that can be solved by applying the principles and techniques used in existing data structures.

Related

When would it be a good idea to implement data structures rather than using built-in ones?

What is the purpose of creating your own linked list, or other data structure like maps, queues or hash function, for some programming language, instead of using built-in ones, or why should I create it myself? Thank you.
Good question! There are several reasons why you might want to do this.
For starters, not all programming languages ship with all the nice data structures that you might want to use. For example, C doesn't have built-in libraries for any data structures (though it does have bsearch and qsort for arrays), so if you want to use a linked list, hash table, etc. in C you need to either build it yourself or use a custom third-party library.
Other languages (say, JavaScript) have built-in support for some but not all types of data structures. There's no native JavaScript support for linked lists or binary search trees, for example. And I'm not aware of any mainstream programming language that has a built-in library for tries, though please let me know if that's not the case!
The above examples indicate places where a lack of support, period, for some data structure would require you to write your own. But there are other reasons why you might want to implement your own custom data structures.
A big one is efficiency. Put yourself in the position of someone who has to implement a dynamic array, hash table, and binary search tree for a particular programming language. You can't possibly know what workflows people are going to subject your data structures to. Are they going to do a ton of inserts and deletes, or are they mostly going to be querying things? For example, if you're writing a binary search tree type where insertions and deletions are common, you probably would want to look at something like a red/black tree, but if insertions and deletions are rare then an AVL tree would work a lot better. But you can't know this up front, because you have to write one implementation that stands the test of time and works pretty well for all applications. That might counsel you to pick a "reasonable" choice that works well in many applications, but isn't aggressively performance-tuned for your specific application. Coding up a custom data structure, therefore, might let you take advantage of the particular structure of the problem you're solving.
In some cases, the language specification makes it impossible or difficult to use fast implementations of data structures as the language standard. For example, C++ requires its associative containers to allow for deletions and insertions of elements without breaking any iterators into them. This makes it significantly more challenging / inefficient to implement those containers with types like B-trees that might actually perform a bit better than regular binary search trees due to the effects of caches. Similarly, the implementation of the unordered containers has an interface that assumes chained hashing, which isn't necessarily how you'd want to implement a hash table. That's why, for example, there's Google's alternatives to the standard containers that are optimized to use custom data structures that don't easily fit into the language framework.
Another reason why libraries might not provide the fastest containers would be challenges in providing a simple interface. For example, cuckoo hashing is a somewhat recent hashing scheme that has excellent performance in practice and guarantees worst-case efficient lookups. But to make cuckoo hashing work, you need the ability to select multiple hash functions for a given data type. Most programming languages have a concept that each data type has "a" hash function (std::hash<T>, Object.hashCode, __hash__, etc.), which isn't compatible with this idea. The languages could in principle require users to write families of hash functions with the idea that there would be many different hashes to pick from per object, but that complicates the logistics of writing your own custom types. Leaving it up to the programmer to write families of hash functions for types that need it then lets the language stay simple.
And finally, there's just plain innovation in the space. New data structures get invented all the time, and languages are often slow to grow and change. There's been a bunch of research into new faster binary search trees recently (check out WAVL trees as an example) or new hashing strategies (cuckoo hashing and the "Swiss Table" that Google developed), and language designers and implementers aren't always able to keep pace with them.
So, overall, the answer is a mix of "because you can't assume your favorite data structure will be there" and "because you might be able to get better performance rolling your own implementations."
There's one last reason I can think of, and that's "to learn how the language and the data structure work." Sometimes it's worthwhile building out custom data types just to sharpen your skills, and you'll often find some really clever techniques in data structures when you do!
All of this being said, I wouldn't recommend defaulting to coding your own version of a data structure every time you need one. Library versions are usually a pretty safe bet unless you're looking for extra performance or you're missing some features that you need. But hopefully this gives you a better sense as to why you may want to consider setting aside the default, well-tested tools and building out your own.
Hope this helps!

B-Trees and its applications

I am studying B Trees and performing their respective implementation in C++. So, I will submit a final project for the course "Analysis and Design of Algorithms I", where the emphasis is on the study of abstract data types, programming techniques for optimizing the complexity of the algorithms and, precisely, planning algorithms on various structures.
The problem is that only deliver the work of the implementation of the structure and their respective operations sounds a bit crude. Then I have to find an application to includ in my project. The point is that the only application that I've found is the creation of a database engine or a file system for an operative system. I'm not sure about database design and to make matters worse, the databases uses B+Trees.
So, can you list some applications that can be implemented using B Trees?
Thanks!
You are right, file systems and databases are the first thing that come to mind with B-Trees.
But, in general, every application that stores some kind of sortable data could be stored with a B-Tree. So you could just write a small address book where you store names, addresses, etc., you can back this with your own B-Tree implementation. (Of course, in practice, it's usually a better idea to use an existing library or database for that ...)

Implementing data structures/algorithms in languages that already support them

Does it makes sense to implement your own version of data structures and algorithms in your language of choice even if they are already supported, knowing that care has been taking into tuning them for best possible performance?
Sometimes - yes. You might need to optimise the data structure for your specific case, or give it some specific extra functionality.
A java example is apache Lucene (A mature, widely used Information Retrieval library). Although the Map<S,T> interface and implementations already exists - for performance issues, its usage is not good enough, since it boxes the int to an Integer, and a more optimized IntToIntMap was developed for this purpose, instead of using a Map<Integer,Integer>.
The question contains a false assumption, that there's such a thing as "best possible performance".
If the already-existing code was tuned for best possible performance with your particular usage patterns, then it would be impossible for you to improve on it in respect of performance, and attempting to do so would be futile.
However, it wasn't tuned for best possible performance with your particular usage. Assuming it was tuned at all, it was designed to have good all-around performance on average, taken across a lot of possible usage patterns, some of which are irrelevant to you.
So, it is possible in principle that by implementing the code yourself, you can apply some tweak that helps you and (if the implementers considered that tweak at all) presumably hinders some other user somewhere else. But that's OK, they don't have to use your code. Maybe you like cuckoo hashing and they like linear probing.
Reasons that the implementers might not have considered the tweak include: they're less smart than you (rare, but it happens); the tweak hadn't been invented when they wrote the code and they aren't following the state of the art for that structure / algorithm; they have better things to do with their time and you don't. In those cases perhaps they'd accept a patch from you once you're finished.
There are also reasons other than performance that you might want a data structure very similar to one that your language supports, but with some particular behavior added or removed. If you can't implement that on top of the existing structure then you might well do it from scratch. Obviously it's a significant cost to do so, up front and in future support, but if it's worth it then you do it.
It may makes sense when you are using a compiled language (like C, Assembly..).
When using an interpreted language you will probably have a performance loss, because the native structure parsers are already compiled, and won't waste time "interpreting" the new structure.
You will probably do it only when the native structure or algorithm lacks something you need.

Modern data structures

I just realized all the data structures I regularly use are really old and really simple. Linked lists, hash tables, trees, and even the more complex variants such as VLists or RBTrees are all pretty old inventions.
Most of them were conceived for a serial, single CPU world and require adapting to work in parallel environments.
What kind of newer, better data structures do we have? Why are they not widely used?
I understand using a plain old linked list if you have to implement it and prefer the simplicity, but having huge STLs and piles of third party libraries like Guava or Boost, why am I still placing locks around hashes?
Don't we have potentially standard, hard-proven modern data structures that can actually replace the trusty old-timers?
There is nothing wrong with the old ones. A good way to keep flexibility is to separate concerns. Normal (old style) datastructures are concerned with the way how data is stored. Locking is a completely different concern, which should not be part of the datastructure.
Locking is a potentially expensive operation, so if you can, you should lock multiple structures at once to optimize your code. I.e. lock critical sections not datastructures. If you directly add locking to your datastructures, you do not have the possibility to optimize this way. Also this will introduce implicit synchronisation points, that you may not want and canot control.
This does not answer a different aspect of your question. The part of "Why do we need locking at all". The answer is, that sometimes there is just no way around it. You either need to have lock somewhere, completely rely on atomic operations or disallow mutation altogether.
Method one is not feasible, as I have pointed out above, because you loose potential for optimization and you have implicit synchronisation points.
Only using atomic operations in your data structure (i.e. non-locking structures) is still an open research question, and might not always be possible. I know of some non-locking structures, i.e. queues, lists etc, but I have never heard of a non locking tree. Also non-locking structures tend to become much more complicated and slower, so we still need some better structure for thread local data, and can only add these to our datastructure zoo.
Not having mutable datastructures at all is in my opinion the best way of all of them. Mutability is often more of a hassle than it is worth. However this is a concept from functional programming and only makes sense in such an environment. Functional programming however is regarded as an esoteric concept by most programmers. Most languages which are actually used in production work mainly use non-functional concepts (this does not mean it actually is more complicated or is more error prone, it is just reflecting the current state of training among developers). In my opinion functional programming will become more wide spread, once people start to note it solves their threading problems automatically in a blink. Several other languages are now borrowing already from functional languages, so this is probably where we will find the next evolution of data structures.
If you want lock-free data structures, study persistent data structures. These are mostly popular in the functional programming world, but are applicable in other domains as well. Most persistent DSs are variants of plain lists, trees etc. but newer ones such as hash tries have surfaced in recent years.

Do you use linked lists, doubly linked lists and so on, in business programming?

Are data structures like linked lists something that are purely academic for real programming or do you really use them? Are they things that are covered by generics so that you don't need to build them (assuming your language has generics)? I'm not debating the importance of understanding what they are, just the usage of them outside of academia. I ask from a front end web, backend database perspective. I'm sure someone somewhere builds these. I'm asking from my context.
Thank you.
EDIT: Are Generics so that you don't have to build linked lists and the like?
It will depend on the language and frameworks you're using. Most modern languages and frameworks won't make you reinvent these wheels. Instead, they'll provide things like List<T> or HashTable.
EDIT:
We probably use linked lists all the time, but don't realize it. We don't have to write implementations of linked lists on our own, because the frameworks we use have already written them for us.
You may also be getting confused about "generics". You may be referring to generic list classes like List<T>. This is just the same as the non-generic class List, but where the element is always of type T. It is probably implemented as a linked list, but we don't have to care about that.
We also don't have to worry about allocation of physical memory, or how interrupts work, or how to create a file system. We have operating systems to do that for us. But we may be taught that information in school just the same.
Certainly. Many "List" implementations in modern languages are actually linked lists, sometimes in combination with arrays or hash tables for direct access (by index as opposed to iteration).
Linked lists (especially doubly linked lists) are very commonly used in "real-world" data structures.
I would dare to say every common language has a pre-built implementation of linked list, either as a language primitive, native template library (e.g. C++), native library (e.g. Java) or some 3rd party implementation (probably open-source).
That being said, several times in the past I wrote a linked list implementation from scratch myself when creating infrastructure code for complex data structures. Sometimes it's a good idea to have full control over the implementation, and sometimes you need to add a "twist" to the classic implementation for it to satisfy your specific requirement. There's no right or wrong when it comes to whether to code your own implementation, as long as you understand the alternatives and trade-offs. In most cases, and certainly in very modern languages like C# I would avoid it.
Another point is when you should use lists versus array/vectors or hash tables. From your question I understand you are aware of the trade-offs here so I won't go too much into it, but basically, if your main usage is traversing lists by-order, and the list size may vary significantly, a list may be a viable option. Another consideration is the type of insertion. If a common use case is "inserting in the middle", than lists have a significant advantage over arrays/vectors. I can go on but this information is in the classic CS books :)
Clarification: My answer is language agnostic and does not relate specifically to Generics which to my understanding have a linked list implementation.
A singly-linked list is the only way to have a memory efficient immutable list which can be composed to "mutate" it. Look at how Erlang does it. It may be slightly slower than an array-backed list but it has very useful properties in multithreaded and purely-functional implementations.
Yes, there are real world application that use linked list, I sometimes have to maintain a huge application that makes very have use of linked lists.
And yes, linked lists are included in just about any class library from C++/STL to .net.
And I wish it used arrays instead.
In the real world linked lists are SLOW because of things like paging and CPU cache size (linked lists tend to spread you data and that makes it more likely that you will need to access data from different areas of memory and that is much slower on todays computers than using arrays that store all the data in one sequence).
Google "locality of reference" for more info.
Never used hand-made lists except for homeworks at university.
Depending on usage a linked list could be the best option. Deletes from the front of the list are much faster with a linked list than an array list.
In a Java program that I maintain profiling showed that I could increase performance by moving from an ArrayList to a LinkedList for a List that had lots of deletes at the beginning.
I've been developing line of business applications (.NET) for years and I can only think of one instance where I've used linked list and even then I did not have to create the object.
This has just been my experience.
I would say it depends on the usage, in some cases they are quicker than typical random access containers.
Also I think they are used by some libraries as an underlying collection type, so what may look like a non-linked list might actually be one underneath.
In a C/C++ application I developed at my last company we used doubly linked lists all the time. They were essential to what we were doing, which was real-time 3D graphics.
Yes all sorts of data-structures are very useful in daily software development. In most languages that I know (C/C++/Python/Objective-C) there are frameworks that implement those data-structures so that you don't have to reinvent the wheel.
And yes, data-structures are not only for academics, they are very useful and you would not be able to write software without them (depends on what you do).
You use data-structures in message queues, data maps, hash tables, keeping data ordered, fast access/removal/insertion and so on depends what needs to be done.
Yes, I do. It all depends on the situation. If I won't be storing a lot of data in them, or if the specific application needs a FIFO structure, I'll use them without a second thought because they are fast to implement.
However, in applications for other developers I know, there are times that a linked list would fit perfectly except that poor locality causes a lot of cache misses.
I cannot imagine many programs that doesn't deal with lists.
The minute you need to deal with more than 1 thing of something, lists in all forms and shapes becomes needed, as you need somewhere to store these things. That list might be a singly/doubly linked list, an array, a set, a hashtable if you need to index your things based on a key, a priority queue if you need to sort it etc.
Typically you'd store these lists in a database system, but somewhere you need to fetch them from the db, store them in your application and manipulate them, even if it's as simple to retrieve a little list of things you populate into a drop-down combobox.
These days, in languages such as C#,Python,Java and many more, you're usually abstracted away from having to implement your own lists. These languages come with a great deal of abstractions of containers you can store stuff in. Either via standard libraries or built into the language.
You're still at an advantage of learning these topics, e.g. if you're working with C# you'd want to know how an ArrayList works, and wheter you'd choose ArrayList or something else depending on your need to add/insert/search/random index such a list.

Resources