sort map groupingBy counting - sorting

Is there way to sort map by value inside this collector without creation a new stream?
Now it prints: AAS : 5 ABA : 195 ABC : 12 ABE : 52.
Desired: ABA : 195 ABE : 52 ...
getTrigraphStream(path) returns: HOW THE TRA GRI ONC GRI ONC INS WHE INS WHE
public Map<String, Long> getTrigraphStatisticsMapAlphabet(String path) {
return getTrigraphStream(path)
.collect(groupingByConcurrent(Function.identity(),
ConcurrentSkipListMap::new, counting()));
}

Is there way to sort map by value inside this collector without
creation a new stream?
The answer is No. You can only sort by the value of the grouping after you've finished grouping.
Options:
You'll need to invoke the stream() method after grouping and sort by
value then collect to a map.
use collectingAndThen with a finishing function to do the sorting
and then collect to map.
i.e.
return getTrigraphStream(path)
.collect(collectingAndThen(groupingByConcurrent(Function.identity(), counting()),
map -> map.entrySet().stream().sorted(....).collect(...)));
By the way, what's the problem with creating a new stream? it's a cheap operation.

Related

Print distinct from the Array Stream in Java 8

How to print the distinct element from the Array Stream in java 8?
I am playing with Java-8 and trying to understand how it works with distinct.
Collection<String> list = Arrays.asList("A", "B", "C", "D", "A", "B", "C");
// Get collection without duplicate i.e. distinct only
List<String> distinctElements = list.stream().distinct().collect(Collectors.toList());
//Let's verify distinct elements
System.out.println(distinctElements);
// Array Stream
System.out.println("------------------------------");
int[] numbers = {2, 3, 5, 7, 11, 13, 2,3};
System.out.println(Arrays.stream(numbers).sum()); // ==> Sum
System.out.println(Arrays.stream(numbers).count()); // ==> Count
System.out.println(Arrays.stream(numbers).distinct()); // ==> Distinct
The last line Just merely gives me reference of object, I want actual values
[A, B, C, D]
------------------------------
46
8
java.util.stream.ReferencePipeline$4#2d98a335
You don't see distinct values directly because IntStream.distinct() is not a terminal operation and it returns IntStream as stated in the documentation:
Returns a stream consisting of the distinct elements of this stream.
You have to terminate your stream, similarly to code you already have in your example:
List<String> distinctElements = list.stream()
.distinct()
.boxed()
.collect(Collectors.toList());
Here you call Stream.collect(Collector<? super T,A,R> collector) method which is a terminal operation and you get a list of distinct elements in return.
Both Stream.count() and IntStream.sum() are terminal operations and they perform calculation right away, closing your stream and returning a value.
Arrays.stream() normally returns a Stream, but it has an overloaded version: stream(int[] array), which returns an IntStream, which is a stream of primitive ints. IntStream.distinct() returns an IntStream as well.
In order to collect it, you could use collect(Collectors.toList()):
Arrays.stream(numbers)
.distinct()
.boxed()
.collect(Collectors.toList());
You could also store the result into an int[]:
Arrays.stream(numbers)
.distinct()
.toArray();

Get the maximum value using stream for Map

I have a class called Test. This class has a method called getNumber which returns an int value.
public class Test{
.
.
.
.
public int getNumber(){
return number;
}
}
Also I have a HashMap which the key is a Long and the value is a Test object.
Map<Long, Test> map = new HashMap<Long, Test>();
I want to print the key and also getNumber which has a maximum getNumber using a Stream Line code.
I can print the maximum Number via below lines
final Comparator<Test> comp = (p1, p2) -> Integer.compare(p1.getNumber(), p2.getNumber());
map.entrySet().stream().map(m -> m.getValue())
.max(comp).ifPresent(d -> System.out.println(d.getNumber()));
However my question is How can I return the key of the maximum amount? Can I do it with one round using stream?
If I understood you correctly:
Entry<Long, Test> entry = map.entrySet()
.stream()
.max(Map.Entry.comparingByValue(Comparator.comparingInt(Test::getNumber)))
.get();
If you want to find the key-value pair corresponding to the maximum 'number' value in the Test instances, you can use Collections.max() combined with a comparator that compares the entries with this criteria.
import static java.util.Comparator.comparingInt;
...
Map.Entry<Long, Test> maxEntry =
Collections.max(map.entrySet(), comparingInt(e -> e.getValue().getNumber()))
If you want to use the stream way, then remove the mapping (because you lost the key associated with the value), and provide the same comparator:
map.entrySet()
.stream()
.max(comparingInt(e -> e.getValue().getNumber()))
.ifPresent(System.out::println);
Note that there is a small difference in both snippets, as the first one will throw a NoSuchElementException if the provided map is empty.

How to retainAll of List of Lists using stream reduce

I faced following problem. I have a list of lists which i simply want to retainAll. I'm trying to do with streams
private List<List<Long>> ids = new ArrayList<List<Long>>();
// some ids.add(otherLists);
List<Long> reduce = ids.stream().reduce(ids.get(0), (a, b) -> a.addAll(b));
unfortunately I got the error
Error:(72, 67) java: incompatible types: bad return type in lambda expression
boolean cannot be converted to java.util.List<java.lang.Long>
If you want to reduce (I think you mean flatten by that) the list of lists, you should do it like this:
import static java.util.stream.Collectors.toList
...
List<Long> reduce = ids.stream().flatMap(List::stream).collect(toList());
Using reduce, the first value should be the identity value which is not the case in your implementation, and your solution will produce unexpected results when running the stream in parallel (because addAll modifies the list in place, and in this case the identity value will be the same list for partial results).
You'd need to copy the content of the partial result list, and add the other list in it to make it working when the pipeline is run in parallel:
List<Long> reduce = ids.parallelStream().reduce(new ArrayList<>(), (a, b) -> {
List<Long> list = new ArrayList<Long>(a);
list.addAll(b);
return list;
});
addAll returns a boolean, not the union of the two lists. You want
List<Long> reduce = ids.stream().reduce(ids.get(0), (a, b) -> {
a.addAll(b);
return a;
});

Conversion between different implementations of Seq in Scala

From what I have read, one should prefer to use generic Seq when defining sequences instead of specific implementations such as List or Vector.
Though I have parts of my code when a sequence will be used mostly for full traversal (mapping, filtering, etc) and some parts of my code where the same sequence will be used for indexing operations (indexOf, lastIndexWhere).
In the first case, I think it is better to use LinearSeq (implem is List) whereas in the second case it is better to use IndexedSeq (implem is Vector).
My question is: do I need to explicitly call the conversion method toList and toIndexedSeqin my code or is the conversion done under the hood in an intelligent manner ? If I use these conversions, is it a penalty for performance when going back and forth between IndexedSeq and LinearSeq ?
Thanks in advance
Vector will almost always out-perform List. Unless your algorithm uses only ::, head and tail, Vector will be faster than List.
Using List is more of a conceptual question on your algorithm (data is stack-structured, only accessing head/tail, only adding elements by prepending, use of pattern matching (which can be used with Vector, just feels more natural to me to use it with List)).
You might want to look at Why should I use vector in scala
Now for some nice number to compare (obviously not a 'real' benchmark, but eh) :
val l = List.range(1,1000000)
val a = Vector.range(1,1000000)
import System.{currentTimeMillis=> milli}
val startList = milli
l.map(_*2).map(_+2).filter(_%2 == 0)
println(s"time for list map/filter operations : ${milli - startList}")
val startVector = milli
a.map(_*2).map(_+2).filter(_%2 == 0)
println(s"time for vector map/filter operations : ${milli - startVector}")
Output :
time for list map/filter operations : 1214
time for vector map/filter operations : 364
Edit :
Just realized this doesn't actually answer your question. As far as I know, you will have to call toList/toVector yourself. As for performances, it depends on your sequence, but unless you're going back and forth all the time, it shouldn't be a problem.
Once again, not a serious benchmark, but :
val startConvToList = milli
a.toList
println(s"time for conversion to List: ${milli - startConvToList}")
val startConvToVector = milli
l.toVector
println(s"time for conversion to Vector: ${milli - startConvToVector}")
Output :
time for conversion to List: 48
time for conversion to Vector: 18
I have done the same for indexOf and Vector is also more performant
val l = List.range(1,1000000)
val a = Vector.range(1,1000000)
import System.{currentTimeMillis=> milli}
val startList = milli
l.indexOf(500000)
println("time for list index operation : " + (milli - startList))
val startVector = milli
a.indexOf(500000)
println("time for vector index operation : " + (milli - startVector))
Output :
time for list index operation : 36
time for vector index operation : 33
So I guess I should use Vector all the times in internal implementations but I must use Seq when I build interface as specified here :
Difference between a Seq and a List in Scala

how to sort numerically in hadoop's shuffle/sort phase?

The data looks like this, first field is a number,
3 ...
1 ...
2 ...
11 ...
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ...
2 ...
3 ...
11 ...
But hadoop keeps giving me this,
1 ...
11 ...
2 ...
3 ...
How do correct it?
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
EXAMPLE :
Create an identity mapper and reducer with the following code
This is the mapper.py & reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print "%s" % (line.strip())
This is the input.txt
1
11
2
20
7
3
40
This is the Streaming command
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-D mapred.text.key.comparator.options=-n
-input /user/input.txt
-output /user/output.txt
-file ~/mapper.py
-mapper ~/mapper.py
-file ~/reducer.py
-reducer ~/reducer.py
And you will get the required output
1
2
3
7
11
20
40
NOTE :
I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.
Hadoop's default comparator compares your keys based on the Writable type (more precisely WritableComparable) you use. If you are dealing with IntWritable or LongWritable then it will sort them numerically.
I assume you are using Text in your example therefore you'll end up having natural sort order.
In special cases, however, you can also write your own comparator.
E.g: for testing purposes only, here's a quick sample how to change the sort order of Text keys: this will treat them as integers and will produce numerical sort order:
public class MyComparator extends WritableComparator {
public MyComparator() {
super(Text.class);
}
#Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
try {
String v1 = Text.decode(b1, s1, l1);
String v2 = Text.decode(b2, s2, l2);
int v1Int = Integer.valueOf(v1.trim());
int v2Int = Integer.valueOf(v2.trim());
return (v1Int < v2Int) ? -1 : ((v1Int > v2Int) ? 1 : 0);
}
catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
}
In the jobrunner class set:
Job job = new Job();
...
job.setSortComparatorClass(MyComparator.class);
For streaming with order Hadoop (which may use -jobconf instead of -D for configuration), you can sort by key:
-jobconf stream.num.map.output.key.fields=2\
-jobconf mapreduce.partition.keycomparator.options="-k2,2nr"\
-jobconf mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
By stream.num.map.output.key.fields, 1st and 2nd columns are key 1 and key 2.
mapreduce.partition.keycomparator.options="-k2,2nr" means sorting in reverse order by using 2nd key (from 2nd to 2nd keys) as numeric value.
It is pretty much like Linux sort command!

Resources