Converting collections performance - performance

In my code I working with different types of collections and often converting one to another. I do it easily calling toList, toVector, toSet, toArray functions.
Now I am interested in performance of this operations. I find information about length, head, tail, apply performance in documentation. What actually happens when I call functions(toList, toVector, toSet, toArray) on List, Set, Array and Vector implementation in scala?
P.S. Question is only about standard scala collections which is immutable.

Well my advice would be: look yourself into the source code ! For instance, method toSet is defined as follow in the TraversableOnce trait (annotated by myself) :
def to[Col[_]](implicit cbf: CanBuildFrom[Nothing, A, Col[A #uV]]): Col[A #uV] = {
val b = cbf() //generic way to build the collection, if it would be a List, it would create an empty List
b ++= seq // add all the elements
b.result() //transform the result to the target collection
}
So it means that the toSet method has a performance of O(N) since you traverse all the list once! I believe that all the collections inheriting this trait are using this implementation.

Related

How to join two publishers based on a common attribute and construct a single publisher out of it, in Spring reactor/ web flux?

Suppose I have two fluxes Flux<Class1> and Flux<Class2> and both Class1 and Class2 have a common attribute, say "id".
The use case is to join the two fluxes based on the common attribute "id" and construct a single Flux<Tuple<Class1, Class2>>, similar to joining two sql tables.
-There will always be a 1 to 1 match, for the attribute id, between the two fluxes.
-The fluxes won't contain more than 100 objects.
-The fluxes are not ordered by id.
How do I achieve this in Project Reactor/Spring web flux?
Assuming that:
both collections aren't very big (you can hold them in memory without risking OOM issues)
they're not sorted by id
each element in a collection has its counterpart in the other
First, you should make those Class1, Class2 implement Comparable or at least prepare a comparator implementation that you can use to sort them by their id.
Then you can use the zip operator for that:
Flux<Class1> flux1 = ...
Flux<Class2> flux2 = ...
Flux<Tuple2<Class1,Class2>> zipped = Flux.zip(flux1.sort(comparator1), flux2.sort(comparator2));
Tuple2 is a Reactor core class that lets you access each element of the Tuple like this
Tuple2<Class1,Class2> tuple = ...
Class1 klass1 = tuple.getT1();
Class2 klass2 = tuple.getT2();
In this case, sort will buffer all elements and this might cause memory/latency issues if the collections are large. Depending on how the ordering is done in those collections (let's say the ordering is not guaranteed, but those were batch inserted), you could also buffer some of them (using window) and do the sorting on each window (with sort).
Of course, ideally, being able to fetch both already sorted would avoid buffering data and would improve backpressure support in your application.
I think this should work with the following constraints:
the 2nd Flux needs to emit the same elements to all subscribers since it gets subscribed to over and over again.
this is basically the equivalent of a nested loop join so highly inefficient for large fluxes.
every element of the first Flux has a matching element in the second one.
flux1.flatMap(
f1 -> flux2.filter(f2 -> f2.id.equals(f1.id)).take(1)) // take the first with matching id
.map(f2 -> Tuple.of(f1,f2))) // convert to tuple.
writen without IDE. Consider pseudo code.

Is there any different between two types of union in spark streaming

Dstream provide two types of union :
StreamingContext.union(Dstreams)
Dstream.union(anotherDstream)
So I want to know what is the different, especially in parallelism performance.
Looking at the source code of the two operations, there is no difference other than one taking a single DStream as input and the other a list.
StreamingContext:
def union[T: ClassTag](streams: Seq[DStream[T]]): DStream[T] = withScope {
new UnionDStream[T](streams.toArray)
}
Dstream:
def union(that: DStream[T]): DStream[T] = ssc.withScope {
new UnionDStream[T](Array(this, that))
}
Hence, which one you use depends on your preference, there is no performance gains to be had. When you have a list of streams to unite, the method in StreamingConext simplifies the code a bit, hence, it could be preferable in this case.
Your claim "DStream provide two types of union" is not quite right.
The ref mentions differnt signatures, and more specifically different classes that provide the union operation.
StreamingContext.union(*dstreams)
Create a unified DStream from multiple DStreams of the same type and same slide duration.
DStream.union(other)
Return a new DStream by unifying data of another DStream with this DStream.
Parameters: other – Another DStream having the same interval (i.e., slideDuration) as this DStream.
The later is discussed in the Spark User List: "The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs".
Source code of StreamingContext:
def union(self, *dstreams):
...
first = dstreams[0]
jrest = [d._jdstream for d in dstreams[1:]]
return DStream(self._jssc.union(first._jdstream, jrest), self, first._jrdd_deserializer)
Source code of DStream:
def union(self, other):
return self.transformWith(lambda a, b: a.union(b), other, True)
You can see that the first uses recursion (as expected), while the other uses transformWith, which is defined in the same class and transforms each RDD.
The thing to remember is Level of Parallelism in Data Receiving, where in cases that the data receiving becomes a bottleneck in the system, then consider parallelizing the data receiving process would be a good idea.
As a result, the process of applying the union() method to multiple DStreams is encouraged`, which resulted in providing a method to do this easily, while keeping your code clean. IMHO, there wouldn't be a difference in performance.

Why does Map<K,V> not extends Function<K,V>?

While playing around with the new Java 8 Stream API I got to wondering, why not:
public interface Map<K,V> extends Function<K, V>
Or even:
public interface Map<K,V> extends Function<K, V>, Predicate<K>
It would be fairly easy to implement with default methods on the Map interface:
#Override default boolean test(K k) {
return containsKey(k);
}
#Override default V apply(K k) {
return get(k);
}
And it would allow for the use of a Map in a map method:
final MyMagicMap<String, Integer> map = new MyMagicHashMap<>();
map.put("A", 1);
map.put("B", 2);
map.put("C", 3);
map.put("D", 4);
final Stream<String> strings = Arrays.stream(new String[]{"A", "B", "C", "D"});
final Stream<Integer> remapped = strings.map(map);
Or as a Predicate in a filter method.
I find that a significant proportion of my use cases for a Map are exactly that construct or a similar one - as a remapping/lookup Function.
So, why did the JDK designers not decide to add this functionality to the Map during the redesign for Java 8?
The JDK team was certainly aware of the mathematical relationship between java.util.Map as a data structure and java.util.function.Function as a mapping function. After all, Function was named Mapper in early JDK 8 prototype builds. And the stream operation that calls a function on each stream element is called Stream.map.
There was even a discussion about possibly renaming Stream.map to something else like transform because of possible confusion between a transforming function and a Map data structure. (Sorry, can't find a link.) This proposal was rejected, with the rationale being the conceptual similarity (and that map for this purpose is in common usage).
The main question is, what would be gained if java.util.Map were a subtype of java.util.function.Function? There was some discussion in comments about whether subtyping implies an "is-a" relationship. Subtyping is less about "is-a" relationships of objects -- since we're talking about interfaces, not classes -- but it does imply substitutability. So if Map were a subtype of Function, one would be able to do this:
Map<K,V> m = ... ;
source.stream().map(m).collect(...);
Right away we're confronted with baking in the behavior of what is now Function.apply to one of the existing Map methods. Probably the only sensible one is Map.get, which returns null if the key isn't present. These semantics are, frankly, kind of lousy. Real applications are probably going to have to write their own methods that supply key-missing policy anyway, so there seems to be very little advantage of being able to write
map(m)
instead of
map(m::get)
or
map(x -> m.getOrDefault(x, def))
The question is “why should it extend Function?”
Your example of using strings.map(map) doesn’t really justify the idea of changing the type inheritance (implying adding methods to the Map interface), given the little difference to strings.map(map::get). And it’s not clear whether using a Map as a Function is really that common that it should get that special treatment compared to, e.g. using map::remove as a Function or using map::get of a Map<…,Integer> as ToIntFunction or map::get of a Map<T,T> as BinaryOperator.
That’s even more questionable in the case of a Predicate; should map::containsKey really get a special treatment compared to map::containsValue?
It’s also worth noting the type signature of the methods. Map.get has a functional signature of Object → V while you suggests that Map<K,V> should extend Function<K,V> which is understandable from a conceptional view of maps (or just by looking at the type), but it shows that there are two conflicting expectations, depending on whether you look at the method or at the type. The best solution is not to fix the functional type. Then you can assign map::get to either Function<Object,V> or Function<K,V> and everyone is happy…
Because a Map is not a Function. Inheritance is for A is a B relationships. Not for A can be the subject of various kinds of B relationships.
To have a function transforming a key to its value, you just need
Function<K, V> f = map::get;
To have a predicate testing if an object is contained in a map, you just need
Predicate<Object> p = map::contains;
That is both clearer and more readable than your proposal.

IEnumerable<T>.Count() vs List<T>.Count with Entity Framework

I am retrieving a list of items using Entity Framework and if there are some items retrieved I do something with them.
var items = db.MyTable.Where(t => t.Expiration < DateTime.Now).ToList();
if(items.Count != 0)
{
// Do something...
}
The if statement could also be written as
if(items.Count() != 0)
{
// Do something...
}
In the first case, the .Count is a List<T>.Count property. In the second case, the .Count() is IEnumerable<T>.Count() extension method.
Although both approaches achieve the same result, however, is one more preferred than the other? (Possibly some difference in performance?)
Enumerable.Count<T> (the extension method for IEnumerable<T>) just calls Count if the underlying type is an ICollection<T>, so for List<T> there is no difference.
Queryable.Count<T> (the extension method for IQueryable<T>) will use the underlying query provider, which in many cases will push the count down to the actual SQL, which will perform faster than counting the objects in memory.
If a filter is applied (e.g. Count(i => i.Name = "John")) or if the underlying type is not an ICollection<T>, the collection is enumerated to compute the count.
is one more preferred than the other?
I generally prefer to use Count() since 1) it's more portable (the underlying type can be anything that implements IEnumerable<T> or IQueryable<T>) and 2) it's easier to add a filter later if necessary.
As Tim states in his comment, I also prefer using Any() to Count() > 0 since it doesn't have to actually count the items - it will just check for the existence of one item. Conversely I use !Any() instead of Count() == 0.
It depends on the underlying collection and where Linq will be pulling from. For example if it's SQL then using .ToList() will cause the query to pull back the entire list, and then count it. However, the .Count() extension method will translate it into a SQL COUNT statement on the database side. In which case there will be an obvious performance difference.
For just a standard List or Collection it's as stated in D. Stanley's answer.
I would say that it depends on what's going on inside the if block. If you're simply doing the check to determine whether to perform a sequence of operations on the underlying enumeration, then it's probably not needed in any event. Simply iterate over the enumeration (omitting ToList as well). If you're not using the collection inside the if block, then you should avoid using ToList and definitely use Any over any Count/Count() method.
Once you've performed the ToList then you're no longer using Entity Framework and I expect that Count() is only marginally slower than Count since, if the underlying collection is ICollection<T> it defers to that implementation. The only overhead would be determining whether it implements that interface.
http://msdn.microsoft.com/en-us/library/bb338038.aspx
Remarks:
If the type of source implements ICollection<T>, that implementation is used to obtain the count of elements. Otherwise, this method determines the count.

java customize a hashmap values

I am working on using a real time application in java, I have a data structure that looks like this.
HashMap<Integer, Object> myMap;
now this works really well for storing the data that I need but it kills me on getting data out. The underlying problems that I run into is that if i call
Collection<Object> myObjects = myMap.values();
Iterator<object> it = myObjects.iterator();
while(it.hasNext(){ object o = it.next(); }
I declare the iterator and collection as variable in my class, and assign them each iteration, but iterating over the collection is very slow. This is a real time application so need to iterate at least 25x per second.
Looking at the profiler I see that there is a new instance of the iterator being created every update.
I was thinking of two ways of possibly changing the hashmap to possibly fix my problems.
1. cache the iterator somehow although i'm not sure if that's possible.
2. possibly changing the return type of hashmap.values() to return a list instead of a collection
3. use a different data structure but I don't know what I could use.
If this is still open use Google Guava collections. They have things like multiMap for the structures you are defining. Ok, these might not be an exact replacement, but close:
From the website here: https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained
Every experienced Java programmer has, at one point or another, implemented a Map> or Map>, and dealt with the awkwardness of that structure. For example, Map> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.

Resources