Collections Navigate and update, (no new collections) How to do with Java 8 - java-8

I have a aList and a bList, both have one field common which is my refernece to match two lists.
Once the two lists reference matches i want to update the bList Objects with aList.
Conventional approach is as below, How can i achieve same in java 8 ?
// How to save below piece of two iterations (along with compare* and update*)
// using java 8 ?
// Stream filter will return new Collection but not update same (bList)
for (A a : aList)
{
for(B b: bList )
{
// compare*
if(a.getStrObj.equalsIgnoreCase(b.getStrObj))
{
// update*
// assume aObjs is initialized
b.getAObjs().add(a);
}
}
}
// Reference for Objects declaration
List<A> aList;
class A {
String strObj;
public String getStrObj()
{ return strObj; }
}
List<B> bList;
class B {
String strObj;
List<A> aObjs;
public getStrObj()
{ return strObj; }
public setAObjs(List<A> aObjs)
{ this.aObjs= aObjs; }
public getAObjs()
{ return this.aObjs;}
}

Your nested loop is not the best way to do it, even before Java 8 (unless you can prove that the lists will always be rather small). You should use a temporary Map with a fast lookup for one of the lists to avoid to perform m×n operations (string comparisons).
One way to do that with Java 8 is
Map<String, List<A>> m=aList.stream().collect(Collectors.groupingBy(A::getStrObj));
bList.forEach(b -> b.getAObjs()
.addAll(m.getOrDefault(b.getStrObj(), Collections.emptyList())));
Here we are performing m+n operations rather than m×n operations which scales much better with growing list sizes.
You can create an equivalent implementation with pre Java 8 constructs, i.e. two independent loops rather than two nested loops and the resulting code isn’t necessarily worse than the above Java 8 code.
Still, the above code might introduce to you some of the most important features (a method reference, a lambda expression, a stream collect operation and one of the new default operations of the Map interface), so you know where to start next time when solving a similar problem.

Related

JAVA 8 Extract predicates as fields or methods?

What is the cleaner way of extracting predicates which will have multiple uses. Methods or Class fields?
The two examples:
1.Class Field
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty)
.forEach(System.out::println);
}
private IntPredicate isOverFifty = number -> number > 50;
2.Method
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty())
.forEach(System.out::println);
}
private IntPredicate isOverFifty() {
return number -> number > 50;
}
For me, the field way looks a little bit nicer, but is this the right way? I have my doubts.
Generally you cache things that are expensive to create and these stateless lambdas are not. A stateless lambda will have a single instance created for the entire pipeline (under the current implementation). The first invocation is the most expensive one - the underlying Predicate implementation class will be created and linked; but this happens only once for both stateless and stateful lambdas.
A stateful lambda will use a different instance for each element and it might make sense to cache those, but your example is stateless, so I would not.
If you still want that (for reading purposes I assume), I would do it in a class Predicates let's assume. It would be re-usable across different classes as well, something like this:
public final class Predicates {
private Predicates(){
}
public static IntPredicate isOverFifty() {
return number -> number > 50;
}
}
You should also notice that the usage of Predicates.isOverFifty inside a Stream and x -> x > 50 while semantically the same, will have different memory usages.
In the first case, only a single instance (and class) will be created and served to all clients; while the second (x -> x > 50) will create not only a different instance, but also a different class for each of it's clients (think the same expression used in different places inside your application). This happens because the linkage happens per CallSite - and in the second case the CallSite is always different.
But that is something you should not rely on (and probably even consider) - these Objects and classes are fast to build and fast to remove by the GC - whatever fits your needs - use that.
To answer, it's better If you expand those lambda expressions for old fashioned Java. You can see now, these are two ways we used in our codes. So, the answer is, it all depends how you write a particular code segment.
private IntPredicate isOverFifty = new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
private IntPredicate isOverFifty() {
return new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
}
1) For field case you will have always allocated predicate for each new your object. Not a big deal if you have a few instances, likes, service. But if this is a value object which can be N, this is not good solution. Also keep in mind that someMethod() may not be called at all. One of possible solution is to make predicate as static field.
2) For method case you will create the predicate once every time for someMethod() call. After GC will discard it.

Why filter with side effects performs better than a Spliterator based implementation?

Regarding the question How to skip even lines of a Stream obtained from the Files.lines I followed the accepted answer approach implementing my own filterEven() method based on Spliterator<T> interface, e.g.:
public static <T> Stream<T> filterEven(Stream<T> src) {
Spliterator<T> iter = src.spliterator();
AbstractSpliterator<T> res = new AbstractSpliterator<T>(Long.MAX_VALUE, Spliterator.ORDERED)
{
#Override
public boolean tryAdvance(Consumer<? super T> action) {
iter.tryAdvance(item -> {}); // discard
return iter.tryAdvance(action); // use
}
};
return StreamSupport.stream(res, false);
}
which I can use in the following way:
Stream<DomainObject> res = Files.lines(src)
filterEven(res)
.map(line -> toDomainObject(line))
However measuring the performance of this approach against the next one which uses a filter() with side effects I noticed that the next one performs better:
final int[] counter = {0};
final Predicate<String> isEvenLine = item -> ++counter[0] % 2 == 0;
Stream<DomainObject> res = Files.lines(src)
.filter(line -> isEvenLine ())
.map(line -> toDomainObject(line))
I tested the performance with JMH and I am not including the file load in the benchmark. I previously load it into an array. Then each benchmark starts by creating a Stream<String> from previous array, then filtering even lines, then applying a mapToInt() to extract the value of an int field and finally a max() operation. Here it is one of the benchmarks (you can check the whole Program here and here you have the data file with about 186 lines):
#Benchmark
public int maxTempFilterEven(DataSource src){
Stream<String> content = Arrays.stream(src.data)
.filter(s-> s.charAt(0) != '#') // Filter comments
.skip(1); // Skip line: Not available
return filterEven(content) // Filter daily info and skip hourly
.mapToInt(line -> parseInt(line.substring(14, 16)))
.max()
.getAsInt();
}
I am not getting why the filter() approach has better performance (~80ops/ms) than the filterEven() (~50ops/ms)?
Intro
I think I know the reason but unfortunately I have no idea how to improve performance of Spliterator-based solution (at least without rewritting of the whole Streams API feature).
Sidenote 1: performance was not the most important design goal when Stream API was designed. If performance is critical, most probably re-writting the code without Stream API will make the code faster. (For example, Stream API unavoidably increases memory allocation and thus GC-pressure). On the other hand in most of the scenarios Stream API provides a nicer higher-level API at a cost of a relatively small performance degradation.
Part 1 or Short theoretical answer
Stream is designed to implement a kind of internal iteration as the main mean of consuming and external iteration (i.e. Spliterator-based) is an additional mean that is kind of "emulated". Thus external iteration involves some overhead. Laziness adds some limits to the efficiency of external iteration and a need to support flatMap makes it necessary to use some kind of dynamic buffer in this process.
Sidenote 2 In some cases Spliterator-based iteration might be as fast as the internal iteration (i.e. filter in this case). Particularly it is so in the cases when you create a Spliterator directly from that data-containing Stream. To see it, you can modify your tests to materialize your first filter into a Strings array:
String[] filteredData = Arrays.stream(src.data)
.filter(s-> s.charAt(0) != '#') // Filter comments
.skip(1)
.toArray(String[]::new);
and then compare preformance of maxTempFilter and maxTempFilterEven modified to accept that pre-filtered String[] filteredData. If you want to know why this is so, you probably should read the rest of this long answer or at least Part 2.
Part 2 or Longer theoretical answer:
Streams were designed to be mainly consumed as a whole by some terminal operation. Iterating elements one by one although supported is not designed as a main way to consume streams.
Note that using the "functional" Stream API such as map, flatMap, filter, reduce, and collect you can't say at some step "I have had enough data, stop iterating over the source and pushing values". You can discard some incoming data (as filter does) but can't stop iteration. (take and skip transformations are actually implemented using Spliterator inside; and anyMatch, allMatch, noneMatch, findFirst, findAny, etc. use non-public API j.u.s.Sink.cancellationRequested, also they are easier as there can't be several terminal operations). If all transformations in the pipeline are synchronous, you can combine them into a single aggregated function (Consumer) and call it in a simple loop (optionally splitting the loop execution over several thread). This is what my simplified version of the state based filter represents (see the code in the Show me some code section). It gets a bit more complicated if there is a flatMap in the pipeline but idea is still the same.
Spliterator-based transformation is fundamentally different because it adds an asynchronous consumer-driven step to the pipeline. Now the Spliterator rather than the source Stream drives the iteration process. If you ask for a Spliterator directly on the source Stream, it might be able to return you some implementation that just iterates over its internal data structure and this is why materializing pre-filtered data should remove performance difference. However, if you create a Spliterator for some non-empty pipeline, there is no other (simple) choice other than asking the source to push elements one by one through the pipeline until some element passes all the filters (see also second example in the Show me some code section). The fact that source elements are pushed one by one rather than in some batches is a consequence of the fundamental decision to make Streams lazy. The need for a buffer instead of just one element is the consequence of support for flatMap: pushing one element from the source can produce many elements for Spliterator.
Part 3 or Show me some code
This part tries to provide some backing with the code (both links to the real code and simulated code) of what was described in the "theoretical" parts.
First of all, you should know that current Streams API implementation accumulates non-terminal (intermediate) operations into a single lazy pipeline (see j.u.s.AbstractPipeline and its children such as j.u.s.ReferencePipeline. Then, when the terminal operation is applied, all the elements from the original Stream are "pushed" through the pipeline.
What you see is the result of two things:
the fact that streams pipelines are different for cases when you
have a Spliterator-based step inside.
the fact that your OddLines is not the first step in the pipeline
The code with a stateful filter is more or less similar to the following straightforward code:
static int similarToFilter(String[] data)
{
final int[] counter = {0};
final Predicate<String> isEvenLine = item -> ++counter[0] % 2 == 0;
int skip = 1;
boolean reduceEmpty = true;
int reduceState = 0;
for (String outerEl : data)
{
if (outerEl.charAt(0) != '#')
{
if (skip > 0)
skip--;
else
{
if (isEvenLine.test(outerEl))
{
int intEl = parseInt(outerEl.substring(14, 16));
if (reduceEmpty)
{
reduceState = intEl;
reduceEmpty = false;
}
else
{
reduceState = Math.max(reduceState, intEl);
}
}
}
}
}
return reduceState;
}
Note that this is effectively a single loop with some calculations (filtering/transformations) inside.
When you add a Spliterator into the pipeline on the other hand, things change significantly and even with simplifications code that is reasonably similar to what actually happens becomes much larger such as:
interface Sp<T>
{
public boolean tryAdvance(Consumer<? super T> action);
}
static class ArraySp<T> implements Sp<T>
{
private final T[] array;
private int pos;
public ArraySp(T[] array)
{
this.array = array;
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
if (pos < array.length)
{
action.accept(array[pos]);
pos++;
return true;
}
else
{
return false;
}
}
}
static class WrappingSp<T> implements Sp<T>, Consumer<T>
{
private final Sp<T> sourceSp;
private final Predicate<T> filter;
private final ArrayList<T> buffer = new ArrayList<T>();
private int pos;
public WrappingSp(Sp<T> sourceSp, Predicate<T> filter)
{
this.sourceSp = sourceSp;
this.filter = filter;
}
#Override
public void accept(T t)
{
buffer.add(t);
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
while (true)
{
if (pos >= buffer.size())
{
pos = 0;
buffer.clear();
sourceSp.tryAdvance(this);
}
// failed to fill buffer
if (buffer.size() == 0)
return false;
T nextElem = buffer.get(pos);
pos++;
if (filter.test(nextElem))
{
action.accept(nextElem);
return true;
}
}
}
}
static class OddLineSp<T> implements Sp<T>, Consumer<T>
{
private Sp<T> sourceSp;
public OddLineSp(Sp<T> sourceSp)
{
this.sourceSp = sourceSp;
}
#Override
public boolean tryAdvance(Consumer<? super T> action)
{
if (sourceSp == null)
return false;
sourceSp.tryAdvance(this);
if (!sourceSp.tryAdvance(action))
{
sourceSp = null;
}
return true;
}
#Override
public void accept(T t)
{
}
}
static class ReduceIntMax
{
boolean reduceEmpty = true;
int reduceState = 0;
public int getReduceState()
{
return reduceState;
}
public void accept(int t)
{
if (reduceEmpty)
{
reduceEmpty = false;
reduceState = t;
}
else
{
reduceState = Math.max(reduceState, t);
}
}
}
static int similarToSpliterator(String[] data)
{
ArraySp<String> src = new ArraySp<>(data);
int[] skip = new int[1];
skip[0] = 1;
WrappingSp<String> firstFilter = new WrappingSp<String>(src, (s) ->
{
if (s.charAt(0) == '#')
return false;
if (skip[0] != 0)
{
skip[0]--;
return false;
}
return true;
});
OddLineSp<String> oddLines = new OddLineSp<>(firstFilter);
final ReduceIntMax reduceIntMax = new ReduceIntMax();
while (oddLines.tryAdvance(s ->
{
int intValue = parseInt(s.substring(14, 16));
reduceIntMax.accept(intValue);
})) ; // do nothing in the loop body
return reduceIntMax.getReduceState();
}
This code is larger because the logic is impossible (or at least very hard) to represent without some non-trivial stateful callbacks inside the loop. Here interface Sp is a mix of j.u.s.Stream and j.u.Spliterator interfaces.
Class ArraySp represents a result of Arrays.stream.
Class WrappingSp is similar to j.u.s.StreamSpliterators.WrappingSpliterator which in the real code represents an implementation of Spliterator interface for any non-empty pipeline i.e. a Stream with at least one intermediate operation applied to it (see j.u.s.AbstractPipeline.spliterator method). In my code I merged it with a StatelessOp subclass and put there logic responsible for filter method implementation. Also for simplcity I implemented skip using filter.
OddLineSp corresponds to your OddLines and its resulting Stream
ReduceIntMax represents ReduceOps terminal operation for Math.max for int
So what's important in this example? The important thing here is that since you first filter you original stream, your OddLineSp is created from a non-empty pipeline i.e. from a WrappingSp. And if you take a closer look at WrappingSp, you'll notice that every time tryAdvance is called, it delegates the call to the sourceSp and accumulates that result(s) into a buffer. Moreover, since you have no flatMap in the pipeline, elements to the buffer will be copied one by one. I.e. every time WrappingSp.tryAdvance is called, it will call ArraySp.tryAdvance, get back exactly one element (via callback), and pass it further to the consumer provided by the caller (unless the element doesn't match the filter in which case ArraySp.tryAdvance will be called again and again but still the buffer is never filled with more than one element at a time).
Sidenote 3: If you want to look at the real code, the most intersting places are j.u.s.StreamSpliterators.WrappingSpliterator.tryAdvance which calls
j.u.s.StreamSpliterators.AbstractWrappingSpliterator.doAdvance which in turn calls j.u.s.StreamSpliterators.AbstractWrappingSpliterator.fillBuffer which in turn calls pusher that is initialized at j.u.s.StreamSpliterators.WrappingSpliterator.initPartialTraversalState
So the main thing that's hurting performance is this copying into the buffer.
Unfortunately for us, usual Java developers, current implementation of the Stream API is pretty much closed and you can't modify only some aspects of the internal behavior using inheritance or composition.
You may use some reflection-based hacking to make copying-to-buffer more efficient for your specific case and gain some performance (but sacrifice laziness of the Stream) but you can't avoid this copying altogether and thus Spliterator-based code will be slower anyway.
Going back to the example from the Sidenote #2, Spliterator-based test with materialized filteredData works faster because there is no WrappingSp in the pipeline before OddLineSp and thus there will be no copying into an intermediate buffer.

Avoiding duplicate code when performing operation on different object properties

I have recently run into a problem which has had me thinking in circles. Assume that I have an object of type O with properties O.A and O.B. Also assume that I have a collection of instances of type O, where O.A and O.B are defined for each instance.
Now assume that I need to perform some operation (like sorting) on a collection of O instances using either O.A or O.B, but not both at any given time. My original solution is as follows.
Example -- just for demonstration, not production code:
public class O {
int A;
int B;
}
public static class Utils {
public static void SortByA (O[] collection) {
// Sort the objects in the collection using O.A as the key. Note: this is custom sorting logic, so it is not simply a one-line call to a built-in sort method.
}
public static void SortByB (O[] collection) {
// Sort the objects in the collection using O.B as the key. Same logic as above.
}
}
What I would love to do is this...
public static void SortAgnostic (O[] collection, FieldRepresentation x /* some non-bool, non-int variable representing whether to chose O.A or O.B as the sorting key */) {
// Sort by whatever "x" represents...
}
... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me. Perhaps I am incorrect on that (and I am sure someone will correct me if that statement is wrong :D), but that is my current thought nonetheless.
Question: What is the best way to implement this method? The logic that I have to implement is difficult to break down into smaller methods, as it is already fairly optimized. At the root of the issue is the fact that I need to perform the same operation using different properties of an object. I would like to stay away from using codes/flags/etc. in the method signature if possible so that the solution can be as robust as possible.
Note: When answering this question, please approach it from an algorithmic point of view. I am aware that some language-specific features may be suitable alternatives, but I have encountered this problem before and would like to understand it from a relatively language-agnostic viewpoint. Also, please do not constrain responses to sorting solutions only, as I have only chosen it as an example. The real question is how to avoid code duplication when performing an identical operation on two different properties of an object.
"The real question is how to avoid code duplication when performing an identical operation on two different properties of an object."
This is a very good question as this situation arises all the time. I think, one of the best ways to deal with this situation is to use the following pattern.
public class O {
int A;
int B;
}
public doOperationX1() {
doOperationX(something to indicate which property to use);
}
public doOperationX2() {
doOperationX(something to indicate which property to use);
}
private doOperationX(input ) {
// actual work is done here
}
In this pattern, the actual implementation is performed in a private method, which is called by public methods, with some extra information. For example, in this case, it can be
doOperationX(A), or doOperationX(B), or something like that.
My Reasoning: In my opinion this pattern is optimal as it achieves two main requirements:
It keeps the public interface descriptive and clear, as it keeps operations separate, and avoids flags etc that you also mentioned in your post. This is good for the client.
From the implementation perspective, it prevents duplication, as it is in one place. This is good for the development.
A simple way to approach this I think is to internalize the behavior of choosing the sort field to the class O itself. This way the solution can be language-agnostic.
The implementation in Java could be using an Abstract class for O, where the purpose of the abstract method getSortField() would be to return the field to sort by. All that the invocation logic would need to do is to implement the abstract method to return the desired field.
O o = new O() {
public int getSortField() {
return A;
}
};
The problem might be reduced to obtaining the value of the specified field from the given object so it can be use for sorting purposes, or,
TField getValue(TEntity entity, string fieldName)
{
// Return value of field "A" from entity,
// implementation depends on language of choice, possibly with
// some sort of reflection support
}
This method can be used to substitute comparisons within the sorting algorithm,
if (getValue(o[i], "A")) > getValue(o[j], "A"))
{
swap(i, j);
}
The field name can then be parametrized, as,
public static void SortAgnostic (O[] collection, string fieldName)
{
if (getValue(collection[i], fieldName)) > getValue(collection[j], fieldName))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, "A").
Some languages allow you to express the field in a more elegant way,
public static void SortAgnostic (O[] collection, Expression fieldExpression)
{
if (getValue(collection[i], fieldExpression)) >
getValue(collection[j], fieldExpression))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, entity => entity.A).
And yet another option can be passing a pointer to a function which will return the value of the field needed,
public static void SortAgnostic (O[] collection, Function getValue)
{
if (getValue(collection[i])) > getValue(collection[j]))
{
swap(i, j);
}
...
}
which given a function,
TField getValueOfA(TEntity entity)
{
return entity.A;
}
and passing it like SortAgnostic(collection, getValueOfA).
"... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me"
That is why you should use available tools like frameworks or other typo of code libraries that provide you requested solution.
When some mechanism is common that mean it can be moved to higher level of abstraction. When you can not find proper solution try to create own one. Think about the result of operation as not part of class functionality. The sorting is only a feature, that why it should not be part of your class from the beginning. Try to keep class as simple as possible.
Do not worry premature about the sense of having something small just because it is small. Focus on the final usage of it. If you use very often one type of sorting just create a definition of it to reuse it. You do not have to necessary create a utill class and then call it. Sometimes the base functionality enclosed in utill class is fair enough.
I assume that you use Java:
In your case the wheal was already implemented in person of Collection#sort(List, Comparator).
To full fill it you could create a Enum type that implement Comparator interface with predefined sorting types.

Efficient implementation of immutable (double) LinkedList

Having read this question Immutable or not immutable? and reading answers to my previous questions on immutability, I am still a bit puzzled about efficient implementation of simple LinkedList that is immutable. In terms of array tha seems to be easy - copy the array and return new structure based on that copy.
Supposedly we have a general class of Node:
class Node{
private Object value;
private Node next;
}
And class LinkedList based on the above allowing the user to add, remove etc. Now, how would we ensure immutability? Should we recursively copy all the references to the list when we insert an element?
I am also curious about answers in Immutable or not immutable? that mention cerain optimization leading to log(n) time and space with a help of a binary tree. Also, I read somewhere that adding an elem to the front is 0(1) as well. This puzzles me greatly, as if we don't provide the copy of the references, then in reality we are modifying the same data structures in two different sources, which breaks immutability...
Would any of your answers alo work on doubly-linked lists? I look forward to any replies/pointers to any other questions/solution. Thanks in advance for your help.
Supposedly we have a general class of Node and class LinkedList based on the above allowing the user to add, remove etc. Now, how would we ensure immutability?
You ensure immutability by making every field of the object readonly, and ensuring that every object referred to by one of those readonly fields is also an immutable object. If the fields are all readonly and only refer to other immutable data, then clearly the object will be immutable!
Should we recursively copy all the references to the list when we insert an element?
You could. The distinction you are getting at here is the difference between immutable and persistent. An immutable data structure cannot be changed. A persistent data structure takes advantage of the fact that a data structure is immutable in order to re-use its parts.
A persistent immutable linked list is particularly easy:
abstract class ImmutableList
{
public static readonly ImmutableList Empty = new EmptyList();
private ImmutableList() {}
public abstract int Head { get; }
public abstract ImmutableList Tail { get; }
public abstract bool IsEmpty { get; }
public abstract ImmutableList Add(int head);
private sealed class EmptyList : ImmutableList
{
public override int Head { get { throw new Exception(); } }
public override ImmutableList Tail { get { throw new Exception(); } }
public override bool IsEmpty { get { return true; } }
public override ImmutableList Add(int head)
{
return new List(head, this);
}
}
private sealed class List : ImmutableList
{
private readonly int head;
private readonly ImmutableList tail;
public override int Head { get { return head; } }
public override ImmutableList Tail { get { return tail; } }
public override bool IsEmpty { get { return false; } }
public override ImmutableList Add(int head)
{
return new List(head, this);
}
}
}
...
ImmutableList list1 = ImmutableList.Empty;
ImmutableList list2 = list1.Add(100);
ImmutableList list3 = list2.Add(400);
And there you go. Of course you would want to add better exception handling and more methods, like IEnumerable<int> methods. But there is a persistent immutable list. Every time you make a new list, you re-use the contents of an existing immutable list; list3 re-uses the contents of list2, which it can do safely because list2 is never going to change.
Would any of your answers also work on doubly-linked lists?
You can of course easily make a doubly-linked list that does a full copy of the entire data structure every time, but that would be dumb; you might as well just use an array and copy the entire array.
Making a persistent doubly-linked list is quite difficult but there are ways to do it. What I would do is approach the problem from the other direction. Rather than saying "can I make a persistent doubly-linked list?" ask yourself "what are the properties of a doubly-linked list that I find attractive?" List those properties and then see if you can come up with a persistent data structure that has those properties.
For example, if the property you like is that doubly-linked lists can be cheaply extended from either end, cheaply broken in half into two lists, and two lists can be cheaply concatenated together, then the persistent structure you want is an immutable catenable deque, not a doubly-linked list. I give an example of a immutable non-catenable deque here:
http://blogs.msdn.com/b/ericlippert/archive/2008/02/12/immutability-in-c-part-eleven-a-working-double-ended-queue.aspx
Extending it to be a catenable deque is left as an exercise; the paper I link to on finger trees is a good one to read.
UPDATE:
according to the above we need to copy prefix up to the insertion point. By logic of immutability, if w delete anything from the prefix, we get a new list as well as in the suffix... Why to copy only prefix then, and not suffix?
Well consider an example. What if we have the list (10, 20, 30, 40), and we want to insert 25 at position 2? So we want (10, 20, 25, 30, 40).
What parts can we reuse? The tails we have in hand are (20, 30, 40), (30, 40) and (40). Clearly we can re-use (30, 40).
Drawing a diagram might help. We have:
10 ----> 20 ----> 30 -----> 40 -----> Empty
and we want
10 ----> 20 ----> 25 -----> 30 -----> 40 -----> Empty
so let's make
| 10 ----> 20 --------------> 30 -----> 40 -----> Empty
| /
| 10 ----> 20 ----> 25 -/
We can re-use (30, 40) because that part is in common to both lists.
UPDATE:
Would it be possible to provide the code for random insertion and deletion as well?
Here's a recursive solution:
ImmutableList InsertAt(int value, int position)
{
if (position < 0)
throw new Exception();
else if (position == 0)
return this.Add(value);
else
return tail.InsertAt(value, position - 1).Add(head);
}
Do you see why this works?
Now as an exercise, write a recursive DeleteAt.
Now, as an exercise, write a non-recursive InsertAt and DeleteAt. Remember, you have an immutable linked list at your disposal, so you can use one in your iterative solution!
Should we recursively copy all the references to the list when we insert an element?
You should recursively copy the prefix of the list up until the insertion point, yes.
That means that insertion into an immutable linked list is O(n). (As is inserting (not overwriting) an element in array).
For this reason insertion is usually frowned upon (along with appending and concatenation).
The usual operation on immutable linked lists is "cons", i.e. appending an element at the start, which is O(1).
You can see clearly the complexity in e.g. a Haskell implementation. Given a linked list defined as a recursive type:
data List a = Empty | Node a (List a)
we can define "cons" (inserting an element at the front) directly as:
cons a xs = Node a xs
Clearly an O(1) operation. While insertion must be defined recursively -- by finding the insertion point. Breaking the list into a prefix (copied), and sharing that with the new node and a reference to the (immutable) tail.
The important thing to remember about linked lists is :
linear access
For immutable lists this means:
copying the prefix of a list
sharing the tail.
If you are frequently inserting new elements, a log-based structure , such as a tree, is preferred.
There is a way to emulate "mutation" : using immutable maps.
For a linked list of Strings (in Scala style pseudocode):
case class ListItem(s:String, id:UUID, nextID: UUID)
then the ListItems can be stored in a map where the key is UUID:
type MyList = Map[UUID, ListItem]
If I want to insert a new ListItem into val list : MyList :
def insertAfter(l:MyList, e:ListItem)={
val beforeE=l.getElementBefore(e)
val afterE=l.getElementAfter(e)
val eToInsert=e.copy(nextID=afterE.nextID)
val beforeE_new=beforeE.copy(nextID=e.nextID)
val l_tmp=l.update(beforeE.id,beforeE_new)
return l_tmp.add(eToInsert)
}
Where add, update, get takes constant time using Map: http://docs.scala-lang.org/overviews/collections/performance-characteristics
Implementing double linked list goes similarly.

Sorted hash table (map, dictionary) data structure design

Here's a description of the data structure:
It operates like a regular map with get, put, and remove methods, but has a sort method that can be called to sorts the map. However, the map remembers its sorted structure, so subsequent calls to sort can be much quicker (if the structure doesn't change too much between calls to sort).
For example:
I call the put method 1,000,000 times.
I call the sort method.
I call the put method 100 more times.
I call the sort method.
The second time I call the sort method should be a much quicker operation, as the map's structure hasn't changed much. Note that the map doesn't have to maintain sorted order between calls to sort.
I understand that it might not be possible, but I'm hoping for O(1) get, put, and remove operations. Something like TreeMap provides guaranteed O(log(n)) time cost for these operations, but always maintains a sorted order (no sort method).
So what's the design of this data structure?
Edit 1 - returning the top-K entries
Alhough I'd enjoy hearing the answer to the general case above, my use case has gotten more specific: I don't need the whole thing sorted; just the top K elements.
Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)
Thanks!
For "O(1) get, put, and remove operations" you essentially need O(1) lookup, which implies a hash function (as you know), but the requirements of a good hash function often break the requirement to be easily sorted. (If you had a hash table where adjacent values mapped to the same bucket, it would degenerate to O(N) on lots of common data, which is a worse case you typically want a hash function to avoid.)
I can think of how to get you 90% of the way there. Set up a hashtable alongside a parallel index that is sorted. The index has a clean part (ordered) and a dirty part (unordered). The index would map keys to the values (or references to the values stored in the hashtable - whichever suits you in terms of performance or memory use). When you add to the hashtable, the new entry is pushed onto the back of the dirty list. When you remove from the hashtable, the entry is nulled/removed from the clean and dirty parts of the index. You can sort the index, which sorts the dirty entries only, then merges them into the already sorted 'clean' part of the index. And obviously you can iterate over the index.
As far as I can see, this gives you the O(1) everywhere except on the remove operation and is still fairly simple to implement with standard containers (at least as provided by C++, Java, or Python). It also gives you the "second sort is cheaper" condition by only needing to sort the dirty index entries and then letting you do an O(N) merge. The cost of all this is obviously extra memory for the index and extra indirection when using it.
Why exactly do you need a sort() function ?
What do you perhaps want and need is a Red-Black Tree.
http://en.wikipedia.org/wiki/Red-black_tree
These trees are automatically sorting your input by a comparator you give. They are complex, but have excellent O(n) characteristics. Couple your tree entries as key with a hash
map as dictionary and you get your datastructure.
In Java it is implemented as TreeMap as instance of SortedMap.
What you're looking at is a hashtable with pointers in the entries to the next entry in sorted order. It's a lot like the LinkedHashMap in java except that the links are tracking a sort order rather than the insertion order. You can actually implement this totally by wrapping a LinkedHashMap and having the implementation of sort transfer the entries from the LinkedHashMap into a TreeMap and then back into a LinkedHashMap.
Here's an implementation that sorts the entries in an array list rather than transferring to a tree map. I think the sort algorithm used by Collection.sort will do a good job of merging the new entries into the already sorted portion.
public class SortaSortedMap<K extends Comparable<K>,V> implements Map<K,V> {
private LinkedHashMap<K,V> innerMap;
public SortaSortedMap() {
this.innerMap = new LinkedHashMap<K,V>();
}
public SortaSortedMap(Map<K,V> map) {
this.innerMap = new LinkedHashMap<K,V>(map);
}
public Collection<V> values() {
return innerMap.values();
}
public int size() {
return innerMap.size();
}
public V remove(Object key) {
return innerMap.remove(key);
}
public V put(K key, V value) {
return innerMap.put(key, value);
}
public Set<K> keySet() {
return innerMap.keySet();
}
public boolean isEmpty() {
return innerMap.isEmpty();
}
public Set<Entry<K, V>> entrySet() {
return innerMap.entrySet();
}
public boolean containsKey(Object key) {
return innerMap.containsKey(key);
}
public V get(Object key) {
return innerMap.get(key);
}
public boolean containsValue(Object value) {
return innerMap.containsValue(value);
}
public void clear() {
innerMap.clear();
}
public void putAll(Map<? extends K, ? extends V> m) {
innerMap.putAll(m);
}
public void sort() {
List<Map.Entry<K,V>> entries = new ArrayList<Map.Entry<K,V>>(innerMap.entrySet());
Collections.sort(entries, new KeyComparator());
LinkedHashMap<K,V> newMap = new LinkedHashMap<K,V>();
for (Map.Entry<K,V> e: entries) {
newMap.put(e.getKey(), e.getValue());
}
innerMap = newMap;
}
private class KeyComparator implements Comparator<Map.Entry<K,V>> {
public int compare(Entry<K, V> o1, Entry<K, V> o2) {
return o1.getKey().compareTo(o2.getKey());
}
}
}
I don't know if there's a name, but you could store the current index of each item on the hash.
That is, you have a HashMap< Object, Pair( Integer, Object ) >
and a List<Object> objects
When you put, add to the tail or head of the list and insert into the hashmap with your data and the index of insertion. This is O(1).
When you get, pull from the hashmap and ignore the index. This is O(1).
When you remove, you pull from the map. Take the index and remove from the list as well. This is O(1)
When you sort, just sort the list. Either update the indexes in the map during the sort, or update after the sort is complete. This does not affect the O(nlgn) sort, as it's a linear step. O(nlgn + n) == O(nlgn)
Ordered Dictionary
Recent versions of Python (2.7, 3.1) have "ordered dictionaries" which sound like what you're describing.
The official Python "ordered dictionary" implementation is inspired by previous 3rd-party implementations, as described in the PEP 372.
References:
collections.OrderedDict documentation for Python 2.7
collections.OrderedDict documentation for Python 3.1
PEP 372
ActiveState Ordered Dictionary recipe for Python ≥ 2.4
I'm not aware of a data structure classification with that exact behavior, at least not in Java Collections (or from nonlinear data structures class). Perhaps you can implement it, and it will henceforth be known as the RudigerMap.

Resources