Mux & Demux w/ LINQ - linq

I'm playing around with using LINQ to Objects for multiplexing and demultiplexing but it seems to me that this is a pretty tricky problem.
See this demuxer signature:
public static IEnumerable<IEnumerable<TSource>> Demux<TSource>(this IEnumerable<TSource> source, int multiplexity)
On an abstract level this is easy but ideally one would want to
remain lazy for the source stream
remain lazy for each multiplexed stream
not reiterate over the same elements
How would you do this?
I'm a bit tired so it could be my concentration failing me here...

Assuming that you want (0, 1, 2, 3) to end up as (0, 2) and (1, 3) when demuxing to two streams, you basically can't do it without buffering. You could buffer only when necessary, but that would be difficult. Basically you need to be able to cope with two contradictory ways of using the call...
Getting both iterators, and reading one item from each of them:
// Ignoring disposing of iterators etc
var query = source.Demux(2);
var demuxIterator = query.GetEnumerator();
demuxIterator.MoveNext();
var first = demuxIterator.Current;
demuxIterator.MoveNext();
var second = demuxIterator.Current;
first.MoveNext();
Console.WriteLine(first.Current); // Prints 0
second.MoveNext();
Console.WriteLine(second.Current); // Prints 1
Or getting one iterator, then reading both items:
// Ignoring disposing of iterators etc
var query = source.Demux(2);
var demuxIterator = query.GetEnumerator();
demuxIterator.MoveNext();
var first = demuxIterator.Current;
first.MoveNext();
Console.WriteLine(first.Current); // Prints 0
first.MoveNext();
Console.WriteLine(first.Current); // Prints 2
In the second case, it has to either remember the 1, or be able to reread it.
Any chance you could deal with IList<T> instead of IEnumerable<T>? That would "break" the rest of LINQ to Objects, admittedly - lazy projections etc would be a thing of the past.
Note that this is quite similar to the problem that operations like GroupBy have - they're deferred, but not lazy: as soon as you start reading from a GroupBy result, it reads the whole of the input data.

Related

Recursion transformation without stack frames code repetitions

I have the following pseudo-code:
function X(data, limit, level = 0)
{
result = [];
foreach (Y(data, level) as entity) {
if (level < limit) {
result = result + X(entity, limit, level + 1);
} else {
//trivial recursion case:
result = result + Z(entity);
}
}
return result;
}
which I need to turn into a plain (e.g. without recursive calls). So far I'm out of ideas regarding how to do that elegantly. Following this answer I see that I must construct the entire stack frames which are basically the code repetitions (i.e. I will place same code again and again with different return addresses).
Or I tried stuff like these suggestions - where there is a phrase
Find a recursive call that’s not a tail call.
Identify what work is being done between that call and its return statement.
But I do not understand how can the "work" be identified in the case when it is happening from within internal loop.
So, my problem is that all examples above are providing cases when the "work can be easily identified" because there are no control instructions from within the function. I understand the concept behind recursion on a compilation level, but what I want to avoid is code repetition. So,
My question: how to approach transformation of the pseudo-code above which does not mean code repetitions for simulating stack frames?
It looks like an algorithm to descend a nested data structure (lists of lists) and flatten it out into a single list. It would have been good to have a simple description like that in the question.
To do that, you need to keep track of multiple indices / iterators / cursors, one at each level that you've descended through. A recursive implementation does that by using the call stack. A non-recursive implementation will need a manually-implemented stack data structure where you can push/pop stuff.
Since you don't have to save context (registers) and a return address on the call stack, just the actual iterator (e.g. array index), this can be a lot more space efficient.
When you're looping over the result of Y and need to call X or Z, push the current state onto the stack. Branch back to the beginning of the foreach, and call Y on the new entity. When you get to the end of a loop, pop the old state if there is any, and pick up in the middle of that loop.

Java8 style for comparing arrays? (Streams and Math3)

I'm just beginning to learn Java8 streams and Apache commons Math3 at the same time, and looking for missed opportunities to simplify my solution for comparing instances for equality. Consider this Math3 RealVector:
RealVector testArrayRealVector =
new ArrayRealVector(new double [] {1d, 2d, 3d});
and consider this member variable containing boxed doubles, plus this copy of it as an array list collection:
private final Double [] m_ADoubleArray = {13d, 14d, 15d};
private final Collection<Double> m_CollectionArrayList =
new ArrayList<>(Arrays.asList(m_ADoubleArray));
Here is my best shot at comparing these in a functional style in a JUnit class (full gist here), using protonpack from codepoetix because I couldn't find zip in the Streams library. This looks really baroque to my eyes and I wonder whether I've missed ways to make this shorter, faster, simpler, better because I'm just beginning to learn this stuff and don't know much.
// Make a stream out of the RealVector:
DoubleStream testArrayRealVectorStream =
Arrays.stream(testArrayRealVector.toArray());
// Check the type of that Stream
assertTrue("java.util.stream.DoublePipeline$Head" ==
testArrayRealVectorStream.getClass().getTypeName());
// Use up the stream:
assertEquals(3, testArrayRealVectorStream.count());
// Old one is used up; make another:
testArrayRealVectorStream = Arrays.stream(testArrayRealVector.toArray());
// Make a new stream from the member-var arrayList;
// do arithmetic on the copy, leaving the original unmodified:
Stream<Double> collectionStream = getFreshMemberVarStream();
// Use up the stream:
assertEquals(3, collectionStream.count());
// Stream is now used up; make new one:
collectionStream = getFreshMemberVarStream();
// Doesn't seem to be any way to use zip on the real array vector
// without boxing it.
Stream<Double> arrayRealVectorStreamBoxed =
testArrayRealVectorStream.boxed();
assertTrue(zip(
collectionStream,
arrayRealVectorStreamBoxed,
(l, r) -> Math.abs(l - r) < DELTA)
.reduce(true, (a, b) -> a && b));
where
private Stream<Double> getFreshMemberVarStream() {
return m_CollectionArrayList
.stream()
.map(x -> x - 12.0);
}
Again, here is a gist of my entire JUnit test class.
It seems you are trying to bail in Streams at all cost.
If I understand you correctly, you have
double[] array1=testArrayRealVector.toArray();
Double[] m_ADoubleArray = {13d, 14d, 15d};
as starting point. Then, the first thing you can do is to verify the lengths of these arrays:
assertTrue(array1.length==m_ADoubleArray.length);
assertEquals(3, array1.length);
There is no point in wrapping the arrays into a stream and calling count() and, of course, even less in wrapping an array into a collection to call stream().count() on it. Note that if your starting point is a Collection, calling size() will do as well.
Given that you already verified the length, you can simply do
IntStream.range(0, 3).forEach(ix->assertEquals(m_ADoubleArray[ix]-12, array1[ix], DELTA));
to compare the elements of the arrays.
or when you want to apply arithmetic as a function:
// keep the size check as above as the length won’t change
IntToDoubleFunction f=ix -> m_ADoubleArray[ix]-12;
IntStream.range(0, 3).forEach(ix -> assertEquals(f.applyAsDouble(ix), array1[ix], DELTA));
Note that you can also just create a new array using
double[] array2=Arrays.stream(m_ADoubleArray).mapToDouble(d -> d-12).toArray();
and compare the arrays similar to above:
IntStream.range(0, 3).forEach(ix -> assertEquals(array1[ix], array2[ix], DELTA));
or just using
assertArrayEquals(array1, array2, DELTA);
as now both arrays have the same type.
Don’t think about that temporary three element array holding the intermediate result. All other attempts consume far more memory…

java 8 stream interference versus non-interference

I understand why the following code is ok. Because the collection is being modified before calling the terminal operation.
List<String> wordList = ...;
Stream<String> words = wordList.stream();
wordList.add("END"); // Ok
long n = words.distinct().count();
But why is this code is not ok?
Stream<String> words = wordList.stream();
words.forEach(s -> if (s.length() < 12) wordList.remove(s)); // Error—interference
Stream.forEach() is a terminal operation, and the underlying wordList collection is modified after the terminal has been started/called.
Joachim's answer is correct, +1.
You didn't ask specifically, but for the benefit of other readers, here are a couple techniques for rewriting the program a different way, avoiding stream interference problems.
If you want to mutate the list in-place, you can do so with a new default method on List instead of using streams:
wordList.removeIf(s -> s.length() < 12);
If you want to leave the original list intact but create a modified copy, you can use a stream and a collector to do that:
List<String> newList = wordList.stream()
.filter(s -> s.length() >= 12)
.collect(Collectors.toList());
Note that I had to invert the sense of the condition, since filter takes a predicate that keeps values in the stream if the condition is true.

Scala collection transformation performance: single looping vs. multiple looping

When there is a collection and you must perform two or more operations on all of its elements, what is faster?:
val f1: String => String = _.reverse
val f2: String => String = _.toUpperCase
val elements: Seq[String] = List("a", "b", "c")
iterate multiple times and perform one operation on one loop
val result = elements.map(f1).map(f2)
This approach does have the advantage, that the result after application of the first function could be reused.
iterate one time and perform all operation on each element together
val result = elements.map(element => f2(f1(element)))
or
val result = elements.map(element => f1.compose(f2)
Is there any difference in performance between these two approaches? And if yes, which is faster?
Here's the thing, transformation of a collection is more or less of runtime O(N) , * runtime cost of all the functions applied. So I doubt the 2nd set of choices you present above would make even the slightest difference in runtime. The first option you list, is a different story. New collection creation can be avoided, because that could result in overhead. That's where "view" collections come in (see this good example I spotted)
In Scala, what does "view" do?
If you had the apply several mapping operations you might do this:
val result = elements.view.map(f1).map(f2).force
(force at the end, causes all functions to evaluate)
The 2nd set of examples above would maybe be a tiny bit faster, but the "view" option could make your code more readable if you had a lot of these or complex anonymous functions used in the mapping.
Composing functions to produce a single pass transformation will probably gain you some performance, but will quickly become unreadable. Consider using views as an alernative. While this will create intermediate collections:
val result = elements.map(f1).map(f2)
This will perform lazy evaluation and will perform functional composition the same way you do:
val result = elements.view.map(f1).map(f2)
Notice that result type will be SeqView so you might want to convert it to list later with toList.

Scala: Mutable vs. Immutable Object Performance - OutOfMemoryError

I wanted to compare the performance characteristics of immutable.Map and mutable.Map in Scala for a similar operation (namely, merging many maps into a single one. See this question). I have what appear to be similar implementations for both mutable and immutable maps (see below).
As a test, I generated a List containing 1,000,000 single-item Map[Int, Int] and passed this list into the functions I was testing. With sufficient memory, the results were unsurprising: ~1200ms for mutable.Map, ~1800ms for immutable.Map, and ~750ms for an imperative implementation using mutable.Map -- not sure what accounts for the huge difference there, but feel free to comment on that, too.
What did surprise me a bit, perhaps because I'm being a bit thick, is that with the default run configuration in IntelliJ 8.1, both mutable implementations hit an OutOfMemoryError, but the immutable collection did not. The immutable test did run to completion, but it did so very slowly -- it takes about 28 seconds. When I increased the max JVM memory (to about 200MB, not sure where the threshold is), I got the results above.
Anyway, here's what I really want to know:
Why do the mutable implementations run out of memory, but the immutable implementation does not? I suspect that the immutable version allows the garbage collector to run and free up memory before the mutable implementations do -- and all of those garbage collections explain the slowness of the immutable low-memory run -- but I'd like a more detailed explanation than that.
Implementations below. (Note: I don't claim that these are the best implementations possible. Feel free to suggest improvements.)
def mergeMaps[A,B](func: (B,B) => B)(listOfMaps: List[Map[A,B]]): Map[A,B] =
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableMaps[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] =
(mutable.Map[A,B]() /: (for (m <- listOfMaps; kv <- m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableImperative[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] = {
val toReturn = mutable.Map[A,B]()
for (m <- listOfMaps; kv <- m) {
if (toReturn contains kv._1) {
toReturn(kv._1) = func(toReturn(kv._1), kv._2)
} else {
toReturn(kv._1) = kv._2
}
}
toReturn
}
Well, it really depends on what the actual type of Map you are using. Probably HashMap. Now, mutable structures like that gain performance by pre-allocating memory it expects to use. You are joining one million maps, so the final map is bound to be somewhat big. Let's see how these key/values get added:
protected def addEntry(e: Entry) {
val h = index(elemHashCode(e.key))
e.next = table(h).asInstanceOf[Entry]
table(h) = e
tableSize = tableSize + 1
if (tableSize > threshold)
resize(2 * table.length)
}
See the 2 * in the resize line? The mutable HashMap grows by doubling each time it runs out of space, while the immutable one is pretty conservative in memory usage (though existing keys will usually occupy twice the space when updated).
Now, as for other performance problems, you are creating a list of keys and values in the first two versions. That means that, before you join any maps, you already have each Tuple2 (the key/value pairs) in memory twice! Plus the overhead of List, which is small, but we are talking about more than one million elements times the overhead.
You may want to use a projection, which avoids that. Unfortunately, projection is based on Stream, which isn't very reliable for our purposes on Scala 2.7.x. Still, try this instead:
for (m <- listOfMaps.projection; kv <- m) yield kv
A Stream doesn't compute a value until it is needed. The garbage collector ought to collect the unused elements as well, as long as you don't keep a reference to the Stream's head, which seems to be the case in your algorithm.
EDIT
Complementing, a for/yield comprehension takes one or more collections and return a new collection. As often as it makes sense, the returning collection is of the same type as the original collection. So, for example, in the following code, the for-comprehension creates a new list, which is then stored inside l2. It is not val l2 = which creates the new list, but the for-comprehension.
val l = List(1,2,3)
val l2 = for (e <- l) yield e*2
Now, let's look at the code being used in the first two algorithms (minus the mutable keyword):
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv))
The foldLeft operator, here written with its /: synonymous, will be invoked on the object returned by the for-comprehension. Remember that a : at the end of an operator inverts the order of the object and the parameters.
Now, let's consider what object is this, on which foldLeft is being called. The first generator in this for-comprehension is m <- listOfMaps. We know that listOfMaps is a collection of type List[X], where X isn't really relevant here. The result of a for-comprehension on a List is always another List. The other generators aren't relevant.
So, you take this List, get all the key/values inside each Map which is a component of this List, and make a new List with all of that. That's why you are duplicating everything you have.
(in fact, it's even worse than that, because each generator creates a new collection; the collections created by the second generator are just the size of each element of listOfMaps though, and are immediately discarded after use)
The next question -- actually, the first one, but it was easier to invert the answer -- is how the use of projection helps.
When you call projection on a List, it returns new object, of type Stream (on Scala 2.7.x). At first you may think this will only make things worse, because you'll now have three copies of the List, instead of a single one. But a Stream is not pre-computed. It is lazily computed.
What that means is that the resulting object, the Stream, isn't a copy of the List, but, rather, a function that can be used to compute the Stream when required. Once computed, the result will be kept so that it doesn't need to be computed again.
Also, map, flatMap and filter of a Stream all return a new Stream, which means you can chain them all together without making a single copy of the List which created them. Since for-comprehensions with yield use these very functions, the use of Stream inside the prevent unnecessary copies of data.
Now, suppose you wrote something like this:
val kvs = for (m <- listOfMaps.projection; kv <-m) yield kv
(Map[A,B]() /: kvs) { ... }
In this case you aren't gaining anything. After assigning the Stream to kvs, the data hasn't been copied yet. Once the second line is executed, though, kvs will have computed each of its elements, and, therefore, will hold a complete copy of the data.
Now consider the original form::
(Map[A,B]() /: (for (m <- listOfMaps.projection; kv <-m) yield kv))
In this case, the Stream is used at the same time it is computed. Let's briefly look at how foldLeft for a Stream is defined:
override final def foldLeft[B](z: B)(f: (B, A) => B): B = {
if (isEmpty) z
else tail.foldLeft(f(z, head))(f)
}
If the Stream is empty, just return the accumulator. Otherwise, compute a new accumulator (f(z, head)) and then pass it and the function to the tail of the Stream.
Once f(z, head) has executed, though, there will be no remaining reference to the head. Or, in other words, nothing anywhere in the program will be pointing to the head of the Stream, and that means the garbage collector can collect it, thus freeing memory.
The end result is that each element produced by the for-comprehension will exist just briefly, while you use it to compute the accumulator. And this is how you save keeping a copy of your whole data.
Finally, there is the question of why the third algorithm does not benefit from it. Well, the third algorithm does not use yield, so no copy of any data whatsoever is being made. In this case, using projection only adds an indirection layer.

Resources