ParallelStream with Maps - parallel-processing

I want to read values from complex map and execute one function in parallel way. For that I am using ForkJoinPool from Java 8.
Issue I am facing here is, for few values, function is executed twice.
Initially I thought, Hashmap is not thread safe so I have tried to use HashTable, but its the same...
final Map<CakeOrder, List<CakeOrderDetails>> cakeOrderMap = cakeUpdatesQueueItemDao.getCakeOrdersByStatus(EItemStatus.OPENED, country.getDbTableSuffix(), true);
final Map<CakeOrder, List<CakeOrderDetails>> cakeOrderTableMap = new Hashtable<>();
cakeOrderMap.forEach((k, v) -> cakeOrderTableMap.put(k, v));
final ForkJoinPool pool = new ForkJoinPool(20);
pool.submit(() -> cakeOrderTableMap.entrySet()
.stream()
.parallel()
.forEach(entry -> cakeService.updateCakeOrderStatusAndPrice(entry.getKey(), entry.getValue()))).invoke();

For parallel stream the thread-safety of the input collection is unnecessary as long as it's not updated. On the other way, Hashtable.entrySet() is a very bad source for parallel stream as Hashtable does not have custom spliterator implementation. Using HashMap is much better. Nevertheless, the provided code everything should work correctly, either with HashMap or with Hashtable (though much less efficient with Hashtable). I suspect that the problem is inside not shown here cakeService.updateCakeOrderStatusAndPrice which is likely to be not thread-safe.

Related

How expensive is the new Gson() constructor in production?

I am creating a new Netty pipeline and I am trying to:
avoid premature optimization.
write code that is easy to explain to one of my interns.
This sort of factory method is certainly easy to explain:
public String toJSON()
{
Gson gson = new Gson();
return gson.toJson(this);
}
In a somewhat related question, the OP asks if it is OK (meaning thread-safe) to re-use a single instance of a Gson object. My question is slightly different: is there a good reason to share the object? If so, at what level of complexity is it worth sharing the Gson object? Where is the trade-off?
It’s expensive, and the cost scales with the complexity of the data models you're using Gson to bind. I wrote a post, Reflection Machines, that explains why creating Gson instances is expensive and should be minimized.

Is there any different between two types of union in spark streaming

Dstream provide two types of union :
StreamingContext.union(Dstreams)
Dstream.union(anotherDstream)
So I want to know what is the different, especially in parallelism performance.
Looking at the source code of the two operations, there is no difference other than one taking a single DStream as input and the other a list.
StreamingContext:
def union[T: ClassTag](streams: Seq[DStream[T]]): DStream[T] = withScope {
new UnionDStream[T](streams.toArray)
}
Dstream:
def union(that: DStream[T]): DStream[T] = ssc.withScope {
new UnionDStream[T](Array(this, that))
}
Hence, which one you use depends on your preference, there is no performance gains to be had. When you have a list of streams to unite, the method in StreamingConext simplifies the code a bit, hence, it could be preferable in this case.
Your claim "DStream provide two types of union" is not quite right.
The ref mentions differnt signatures, and more specifically different classes that provide the union operation.
StreamingContext.union(*dstreams)
Create a unified DStream from multiple DStreams of the same type and same slide duration.
DStream.union(other)
Return a new DStream by unifying data of another DStream with this DStream.
Parameters: other – Another DStream having the same interval (i.e., slideDuration) as this DStream.
The later is discussed in the Spark User List: "The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs".
Source code of StreamingContext:
def union(self, *dstreams):
...
first = dstreams[0]
jrest = [d._jdstream for d in dstreams[1:]]
return DStream(self._jssc.union(first._jdstream, jrest), self, first._jrdd_deserializer)
Source code of DStream:
def union(self, other):
return self.transformWith(lambda a, b: a.union(b), other, True)
You can see that the first uses recursion (as expected), while the other uses transformWith, which is defined in the same class and transforms each RDD.
The thing to remember is Level of Parallelism in Data Receiving, where in cases that the data receiving becomes a bottleneck in the system, then consider parallelizing the data receiving process would be a good idea.
As a result, the process of applying the union() method to multiple DStreams is encouraged`, which resulted in providing a method to do this easily, while keeping your code clean. IMHO, there wouldn't be a difference in performance.

Why does Map<K,V> not extends Function<K,V>?

While playing around with the new Java 8 Stream API I got to wondering, why not:
public interface Map<K,V> extends Function<K, V>
Or even:
public interface Map<K,V> extends Function<K, V>, Predicate<K>
It would be fairly easy to implement with default methods on the Map interface:
#Override default boolean test(K k) {
return containsKey(k);
}
#Override default V apply(K k) {
return get(k);
}
And it would allow for the use of a Map in a map method:
final MyMagicMap<String, Integer> map = new MyMagicHashMap<>();
map.put("A", 1);
map.put("B", 2);
map.put("C", 3);
map.put("D", 4);
final Stream<String> strings = Arrays.stream(new String[]{"A", "B", "C", "D"});
final Stream<Integer> remapped = strings.map(map);
Or as a Predicate in a filter method.
I find that a significant proportion of my use cases for a Map are exactly that construct or a similar one - as a remapping/lookup Function.
So, why did the JDK designers not decide to add this functionality to the Map during the redesign for Java 8?
The JDK team was certainly aware of the mathematical relationship between java.util.Map as a data structure and java.util.function.Function as a mapping function. After all, Function was named Mapper in early JDK 8 prototype builds. And the stream operation that calls a function on each stream element is called Stream.map.
There was even a discussion about possibly renaming Stream.map to something else like transform because of possible confusion between a transforming function and a Map data structure. (Sorry, can't find a link.) This proposal was rejected, with the rationale being the conceptual similarity (and that map for this purpose is in common usage).
The main question is, what would be gained if java.util.Map were a subtype of java.util.function.Function? There was some discussion in comments about whether subtyping implies an "is-a" relationship. Subtyping is less about "is-a" relationships of objects -- since we're talking about interfaces, not classes -- but it does imply substitutability. So if Map were a subtype of Function, one would be able to do this:
Map<K,V> m = ... ;
source.stream().map(m).collect(...);
Right away we're confronted with baking in the behavior of what is now Function.apply to one of the existing Map methods. Probably the only sensible one is Map.get, which returns null if the key isn't present. These semantics are, frankly, kind of lousy. Real applications are probably going to have to write their own methods that supply key-missing policy anyway, so there seems to be very little advantage of being able to write
map(m)
instead of
map(m::get)
or
map(x -> m.getOrDefault(x, def))
The question is “why should it extend Function?”
Your example of using strings.map(map) doesn’t really justify the idea of changing the type inheritance (implying adding methods to the Map interface), given the little difference to strings.map(map::get). And it’s not clear whether using a Map as a Function is really that common that it should get that special treatment compared to, e.g. using map::remove as a Function or using map::get of a Map<…,Integer> as ToIntFunction or map::get of a Map<T,T> as BinaryOperator.
That’s even more questionable in the case of a Predicate; should map::containsKey really get a special treatment compared to map::containsValue?
It’s also worth noting the type signature of the methods. Map.get has a functional signature of Object → V while you suggests that Map<K,V> should extend Function<K,V> which is understandable from a conceptional view of maps (or just by looking at the type), but it shows that there are two conflicting expectations, depending on whether you look at the method or at the type. The best solution is not to fix the functional type. Then you can assign map::get to either Function<Object,V> or Function<K,V> and everyone is happy…
Because a Map is not a Function. Inheritance is for A is a B relationships. Not for A can be the subject of various kinds of B relationships.
To have a function transforming a key to its value, you just need
Function<K, V> f = map::get;
To have a predicate testing if an object is contained in a map, you just need
Predicate<Object> p = map::contains;
That is both clearer and more readable than your proposal.

java customize a hashmap values

I am working on using a real time application in java, I have a data structure that looks like this.
HashMap<Integer, Object> myMap;
now this works really well for storing the data that I need but it kills me on getting data out. The underlying problems that I run into is that if i call
Collection<Object> myObjects = myMap.values();
Iterator<object> it = myObjects.iterator();
while(it.hasNext(){ object o = it.next(); }
I declare the iterator and collection as variable in my class, and assign them each iteration, but iterating over the collection is very slow. This is a real time application so need to iterate at least 25x per second.
Looking at the profiler I see that there is a new instance of the iterator being created every update.
I was thinking of two ways of possibly changing the hashmap to possibly fix my problems.
1. cache the iterator somehow although i'm not sure if that's possible.
2. possibly changing the return type of hashmap.values() to return a list instead of a collection
3. use a different data structure but I don't know what I could use.
If this is still open use Google Guava collections. They have things like multiMap for the structures you are defining. Ok, these might not be an exact replacement, but close:
From the website here: https://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained
Every experienced Java programmer has, at one point or another, implemented a Map> or Map>, and dealt with the awkwardness of that structure. For example, Map> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.

ConcurrentModificationException when processing HashMap

I'm trying to put a HashMap<Object, List<Object>> into my dataModel, but when i call the template.process() method, I get the following exception:
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
at java.util.HashMap$KeyIterator.next(HashMap.java:828)
at freemarker.template.SimpleCollection$SimpleTemplateModelIterator.next(SimpleCollection.java:142)
at freemarker.core.IteratorBlock$Context.runLoop(IteratorBlock.java:157)
at freemarker.core.Environment.visit(Environment.java:351)
at freemarker.core.IteratorBlock.accept(IteratorBlock.java:95)
at freemarker.core.Environment.visit(Environment.java:196)
at freemarker.core.MixedContent.accept(MixedContent.java:92)
at freemarker.core.Environment.visit(Environment.java:196)
at freemarker.core.IteratorBlock$Context.runLoop(IteratorBlock.java:172)
at freemarker.core.Environment.visit(Environment.java:351)
at freemarker.core.IteratorBlock.accept(IteratorBlock.java:95)
at freemarker.core.Environment.visit(Environment.java:196)
at freemarker.core.MixedContent.accept(MixedContent.java:92)
at freemarker.core.Environment.visit(Environment.java:196)
at freemarker.core.Environment.process(Environment.java:176)
at freemarker.template.Template.process(Template.java:232)
After looking over some articles and older questions, I've tried to use a ConcurrentHashMap instead, to the same result. I've also tried making a copy using new HashMap<Object, List<Object>>(oldHashMap). Are there any other common fixes to this problem I could try?
EDIT: I know the general cause of ConcurrentModificationExceptions. Please only reply if you can help me understand why the framework Freemarker is throwing these exceptions, mkay? =)
Thanks!
The ConcurrentModificationException is caused by using an invalid iterator after the underlying collection has been changed. The only way to fix this is not changing the collection you are iterating over. In most cases this is not caused by multi-threading.
Simple Example:
//throws an exception in the second iteration
for(String s: list){
list.remove(s);//changes the collection
}
fix 1, not supported by all iterators:
Iterator<String> iter = list.iterator();
while(iter.hasNext()){
iter.next();
iter.remove();//iterator still valid
}
fix 2:
List<String> toRemove = ...;
for(String s: list){
toRemove.add(s);
}
list.removeAll(toRemove);
The exception means that, while you're iterating over the map, something has changed the map's contents.
Your best course of action is figure out what that "something" is. For example, it could be another thread, or it could be that you have a foreach loop and modify the map from within the loop.
It is very hard to give advice on how to best fix the problem until we understand what exactly is causing it and what the desired behaviour is.
You'll get this kind of problem on List and Map when doing something like this:
List<A> list = ...; //a list with few elements
for(A anObject : list){
list.add(anotherObject); //modify list inside the loop
}
The same goes with maps. The solution is to look for possible places where you might be modifying the map inside the loop over that map. Or if you are using a multi-threaded application, then it's possible that another thread is looping over the map while you are modifying it (or visa-versa). In such case you'll need to synchronize access to the map in both places: looping code and map modifying code.
There some info on it in the Java API for TreeMap here.
The iterators returned by the iterator
method of the collections returned by
all of this class's "collection view
methods" are fail-fast: if the map is
structurally modified at any time
after the iterator is created, in any
way except through the iterator's own
remove method, the iterator will throw
a ConcurrentModificationException.
Thus, in the face of concurrent
modification, the iterator fails
quickly and cleanly, rather than
risking arbitrary, non-deterministic
behavior at an undetermined time in
the future.
Note that the fail-fast behavior of an
iterator cannot be guaranteed as it
is, generally speaking, impossible to
make any hard guarantees in the
presence of unsynchronized concurrent
modification. Fail-fast iterators
throw ConcurrentModificationException
on a best-effort basis. Therefore, it
would be wrong to write a program that
depended on this exception for its
correctness: the fail-fast behavior of
iterators should be used only to
detect bugs.
Synchronise access to the hashmap so that only one thread can be accessing the hashmap at once.

Resources