Sorting Map/TreeMap/LinkedHashMap by Value (custom data structures) - sorting

I am currently having some issues in trying to sort a Map<String, IncreaseDetails>, where IncreaseDetails is simply a custom data structure holding a few fields.
So far I have understood fairly clearly that using a TreeMap is heavily discouraged as a TreeMap should be sorted by the KeySet rather than the actual values.
I have therefore tried to switch to both HashMap and LinkedHashMap but simply calling
Collections.sort(map,comparator) doesn't seem to do the trick. Since Java 8 I was planning on trying to use the Stream API, but I don't really know it too well.
So far my comparator looks like this:
import java.util.Comparator;
import java.util.Map;
public class CompTool implements Comparator<Float> {
Map<String, IncreaseDetails> unsortedMap;
public CompTool(Map<String, IncreaseDetails> unsortedMap)
{
this.unsortedMap = unsortedMap;
}
public int compare(Float countryOne, Float countryTwo)
{
Float countryOneValue = unsortedMap.get(countryOne).getRealIncrease();
Float countryTwoValue = unsortedMap.get(countryTwo).getRealIncrease();
return countryTwoValue.compareTo(countryOneValue);
}
}
Any suggestion would be very much welcome, as I have found a lot of similar questions or videos but none too useful for my current situation.

Your question is somewhat unclear. I assume that you want to sort the unsortedMap entries by the value stored in getRealIncrease in reversed order. This can be done by creating the stream of original map entries, sorting and collecting the result into the LinkedHashMap, which preserves insertion order:
Map<String, IncreaseDetails> sortedMap = unsortedMap.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(
(Comparator.comparing(IncreaseDetails::getRealIncrease).reversed())))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(a, b) -> a,
LinkedHashMap::new));

Related

Collections Navigate and update, (no new collections) How to do with Java 8

I have a aList and a bList, both have one field common which is my refernece to match two lists.
Once the two lists reference matches i want to update the bList Objects with aList.
Conventional approach is as below, How can i achieve same in java 8 ?
// How to save below piece of two iterations (along with compare* and update*)
// using java 8 ?
// Stream filter will return new Collection but not update same (bList)
for (A a : aList)
{
for(B b: bList )
{
// compare*
if(a.getStrObj.equalsIgnoreCase(b.getStrObj))
{
// update*
// assume aObjs is initialized
b.getAObjs().add(a);
}
}
}
// Reference for Objects declaration
List<A> aList;
class A {
String strObj;
public String getStrObj()
{ return strObj; }
}
List<B> bList;
class B {
String strObj;
List<A> aObjs;
public getStrObj()
{ return strObj; }
public setAObjs(List<A> aObjs)
{ this.aObjs= aObjs; }
public getAObjs()
{ return this.aObjs;}
}
Your nested loop is not the best way to do it, even before Java 8 (unless you can prove that the lists will always be rather small). You should use a temporary Map with a fast lookup for one of the lists to avoid to perform m×n operations (string comparisons).
One way to do that with Java 8 is
Map<String, List<A>> m=aList.stream().collect(Collectors.groupingBy(A::getStrObj));
bList.forEach(b -> b.getAObjs()
.addAll(m.getOrDefault(b.getStrObj(), Collections.emptyList())));
Here we are performing m+n operations rather than m×n operations which scales much better with growing list sizes.
You can create an equivalent implementation with pre Java 8 constructs, i.e. two independent loops rather than two nested loops and the resulting code isn’t necessarily worse than the above Java 8 code.
Still, the above code might introduce to you some of the most important features (a method reference, a lambda expression, a stream collect operation and one of the new default operations of the Map interface), so you know where to start next time when solving a similar problem.

Creating composite key class for Secondary Sort

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. Can anyone tell me, what are the steps for the same.
Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). From the java lang reference
The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
You can also implement a RawComparator so that you can compare them without deserializing for some faster performance.
However, I recommend (as I always do) to not write things like Secondary Sort yourself. These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. E.g. if you were using Hive, all you need to write is:
SELECT ...
FROM my_table
ORDER BY month, carrier;
The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive.
EDIT: Forgot to mention about Grouping comparators, see Matt's answer
Your compareTo() implementation is incorrect. You need to sort first on uniqueCarrier, then on month to break equality:
#Override
public int compareTo(CompositeKey other) {
if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
return this.getMonth().compareTo(other.getMonth());
} else {
return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
}
}
One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier). This allows me to call write and readFields directly on them, and also use their compareTo. Less code to write is always good...
Speaking of less code, you don't have to call the parent constructor for your composite key.
Now for what is left to be done:
My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier. This method is called by the default Hadoop partitionner to distribute work across reducers.
I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier, and that sorting behaves according to CompositeKey compareTo():
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator()
{
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
Then, tell your Driver to use those two:
job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);
Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further.

Avoiding duplicate code when performing operation on different object properties

I have recently run into a problem which has had me thinking in circles. Assume that I have an object of type O with properties O.A and O.B. Also assume that I have a collection of instances of type O, where O.A and O.B are defined for each instance.
Now assume that I need to perform some operation (like sorting) on a collection of O instances using either O.A or O.B, but not both at any given time. My original solution is as follows.
Example -- just for demonstration, not production code:
public class O {
int A;
int B;
}
public static class Utils {
public static void SortByA (O[] collection) {
// Sort the objects in the collection using O.A as the key. Note: this is custom sorting logic, so it is not simply a one-line call to a built-in sort method.
}
public static void SortByB (O[] collection) {
// Sort the objects in the collection using O.B as the key. Same logic as above.
}
}
What I would love to do is this...
public static void SortAgnostic (O[] collection, FieldRepresentation x /* some non-bool, non-int variable representing whether to chose O.A or O.B as the sorting key */) {
// Sort by whatever "x" represents...
}
... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me. Perhaps I am incorrect on that (and I am sure someone will correct me if that statement is wrong :D), but that is my current thought nonetheless.
Question: What is the best way to implement this method? The logic that I have to implement is difficult to break down into smaller methods, as it is already fairly optimized. At the root of the issue is the fact that I need to perform the same operation using different properties of an object. I would like to stay away from using codes/flags/etc. in the method signature if possible so that the solution can be as robust as possible.
Note: When answering this question, please approach it from an algorithmic point of view. I am aware that some language-specific features may be suitable alternatives, but I have encountered this problem before and would like to understand it from a relatively language-agnostic viewpoint. Also, please do not constrain responses to sorting solutions only, as I have only chosen it as an example. The real question is how to avoid code duplication when performing an identical operation on two different properties of an object.
"The real question is how to avoid code duplication when performing an identical operation on two different properties of an object."
This is a very good question as this situation arises all the time. I think, one of the best ways to deal with this situation is to use the following pattern.
public class O {
int A;
int B;
}
public doOperationX1() {
doOperationX(something to indicate which property to use);
}
public doOperationX2() {
doOperationX(something to indicate which property to use);
}
private doOperationX(input ) {
// actual work is done here
}
In this pattern, the actual implementation is performed in a private method, which is called by public methods, with some extra information. For example, in this case, it can be
doOperationX(A), or doOperationX(B), or something like that.
My Reasoning: In my opinion this pattern is optimal as it achieves two main requirements:
It keeps the public interface descriptive and clear, as it keeps operations separate, and avoids flags etc that you also mentioned in your post. This is good for the client.
From the implementation perspective, it prevents duplication, as it is in one place. This is good for the development.
A simple way to approach this I think is to internalize the behavior of choosing the sort field to the class O itself. This way the solution can be language-agnostic.
The implementation in Java could be using an Abstract class for O, where the purpose of the abstract method getSortField() would be to return the field to sort by. All that the invocation logic would need to do is to implement the abstract method to return the desired field.
O o = new O() {
public int getSortField() {
return A;
}
};
The problem might be reduced to obtaining the value of the specified field from the given object so it can be use for sorting purposes, or,
TField getValue(TEntity entity, string fieldName)
{
// Return value of field "A" from entity,
// implementation depends on language of choice, possibly with
// some sort of reflection support
}
This method can be used to substitute comparisons within the sorting algorithm,
if (getValue(o[i], "A")) > getValue(o[j], "A"))
{
swap(i, j);
}
The field name can then be parametrized, as,
public static void SortAgnostic (O[] collection, string fieldName)
{
if (getValue(collection[i], fieldName)) > getValue(collection[j], fieldName))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, "A").
Some languages allow you to express the field in a more elegant way,
public static void SortAgnostic (O[] collection, Expression fieldExpression)
{
if (getValue(collection[i], fieldExpression)) >
getValue(collection[j], fieldExpression))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, entity => entity.A).
And yet another option can be passing a pointer to a function which will return the value of the field needed,
public static void SortAgnostic (O[] collection, Function getValue)
{
if (getValue(collection[i])) > getValue(collection[j]))
{
swap(i, j);
}
...
}
which given a function,
TField getValueOfA(TEntity entity)
{
return entity.A;
}
and passing it like SortAgnostic(collection, getValueOfA).
"... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me"
That is why you should use available tools like frameworks or other typo of code libraries that provide you requested solution.
When some mechanism is common that mean it can be moved to higher level of abstraction. When you can not find proper solution try to create own one. Think about the result of operation as not part of class functionality. The sorting is only a feature, that why it should not be part of your class from the beginning. Try to keep class as simple as possible.
Do not worry premature about the sense of having something small just because it is small. Focus on the final usage of it. If you use very often one type of sorting just create a definition of it to reuse it. You do not have to necessary create a utill class and then call it. Sometimes the base functionality enclosed in utill class is fair enough.
I assume that you use Java:
In your case the wheal was already implemented in person of Collection#sort(List, Comparator).
To full fill it you could create a Enum type that implement Comparator interface with predefined sorting types.

Two arrays or one in Map structure?

I'm trying to create a Map where the data will be static and not change after the program starts (actually loaded from a server)
Is it better to have two arrays, e.g. in Java:
String keys[] = new String[10];
String values[] = new String[10];
where keys[i] corresponds to values[i]?
or to keep them in a single array, e.g.
String[][] map[] = new String[10][2];
where map[i][0] is the key and map[i][1] is the value?
Personally, the first makes more sense to me, but the second makes more sense to my partner. Is either better performance-wise? Easier to understand?
Update: I'm looking to do this in JavaScript where Map and KeyValuePairs don't exist
Using a Map implementation (in Java) would make this easier to understand as the association is clearer:
static final Map<String, String> my_map;
static
{
my_map = new HashMap<String, String>();
// Populate.
}
A Hashtable looks like what you need. It hashes the keys in such a way that lookup can happen in O(1).
So, you're looking to do this in javascript. Any array or object in js in a map, so you could just do
var mymap = {'key1':'value1','key2':'value2'};

Sorted hash table (map, dictionary) data structure design

Here's a description of the data structure:
It operates like a regular map with get, put, and remove methods, but has a sort method that can be called to sorts the map. However, the map remembers its sorted structure, so subsequent calls to sort can be much quicker (if the structure doesn't change too much between calls to sort).
For example:
I call the put method 1,000,000 times.
I call the sort method.
I call the put method 100 more times.
I call the sort method.
The second time I call the sort method should be a much quicker operation, as the map's structure hasn't changed much. Note that the map doesn't have to maintain sorted order between calls to sort.
I understand that it might not be possible, but I'm hoping for O(1) get, put, and remove operations. Something like TreeMap provides guaranteed O(log(n)) time cost for these operations, but always maintains a sorted order (no sort method).
So what's the design of this data structure?
Edit 1 - returning the top-K entries
Alhough I'd enjoy hearing the answer to the general case above, my use case has gotten more specific: I don't need the whole thing sorted; just the top K elements.
Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)
Thanks!
For "O(1) get, put, and remove operations" you essentially need O(1) lookup, which implies a hash function (as you know), but the requirements of a good hash function often break the requirement to be easily sorted. (If you had a hash table where adjacent values mapped to the same bucket, it would degenerate to O(N) on lots of common data, which is a worse case you typically want a hash function to avoid.)
I can think of how to get you 90% of the way there. Set up a hashtable alongside a parallel index that is sorted. The index has a clean part (ordered) and a dirty part (unordered). The index would map keys to the values (or references to the values stored in the hashtable - whichever suits you in terms of performance or memory use). When you add to the hashtable, the new entry is pushed onto the back of the dirty list. When you remove from the hashtable, the entry is nulled/removed from the clean and dirty parts of the index. You can sort the index, which sorts the dirty entries only, then merges them into the already sorted 'clean' part of the index. And obviously you can iterate over the index.
As far as I can see, this gives you the O(1) everywhere except on the remove operation and is still fairly simple to implement with standard containers (at least as provided by C++, Java, or Python). It also gives you the "second sort is cheaper" condition by only needing to sort the dirty index entries and then letting you do an O(N) merge. The cost of all this is obviously extra memory for the index and extra indirection when using it.
Why exactly do you need a sort() function ?
What do you perhaps want and need is a Red-Black Tree.
http://en.wikipedia.org/wiki/Red-black_tree
These trees are automatically sorting your input by a comparator you give. They are complex, but have excellent O(n) characteristics. Couple your tree entries as key with a hash
map as dictionary and you get your datastructure.
In Java it is implemented as TreeMap as instance of SortedMap.
What you're looking at is a hashtable with pointers in the entries to the next entry in sorted order. It's a lot like the LinkedHashMap in java except that the links are tracking a sort order rather than the insertion order. You can actually implement this totally by wrapping a LinkedHashMap and having the implementation of sort transfer the entries from the LinkedHashMap into a TreeMap and then back into a LinkedHashMap.
Here's an implementation that sorts the entries in an array list rather than transferring to a tree map. I think the sort algorithm used by Collection.sort will do a good job of merging the new entries into the already sorted portion.
public class SortaSortedMap<K extends Comparable<K>,V> implements Map<K,V> {
private LinkedHashMap<K,V> innerMap;
public SortaSortedMap() {
this.innerMap = new LinkedHashMap<K,V>();
}
public SortaSortedMap(Map<K,V> map) {
this.innerMap = new LinkedHashMap<K,V>(map);
}
public Collection<V> values() {
return innerMap.values();
}
public int size() {
return innerMap.size();
}
public V remove(Object key) {
return innerMap.remove(key);
}
public V put(K key, V value) {
return innerMap.put(key, value);
}
public Set<K> keySet() {
return innerMap.keySet();
}
public boolean isEmpty() {
return innerMap.isEmpty();
}
public Set<Entry<K, V>> entrySet() {
return innerMap.entrySet();
}
public boolean containsKey(Object key) {
return innerMap.containsKey(key);
}
public V get(Object key) {
return innerMap.get(key);
}
public boolean containsValue(Object value) {
return innerMap.containsValue(value);
}
public void clear() {
innerMap.clear();
}
public void putAll(Map<? extends K, ? extends V> m) {
innerMap.putAll(m);
}
public void sort() {
List<Map.Entry<K,V>> entries = new ArrayList<Map.Entry<K,V>>(innerMap.entrySet());
Collections.sort(entries, new KeyComparator());
LinkedHashMap<K,V> newMap = new LinkedHashMap<K,V>();
for (Map.Entry<K,V> e: entries) {
newMap.put(e.getKey(), e.getValue());
}
innerMap = newMap;
}
private class KeyComparator implements Comparator<Map.Entry<K,V>> {
public int compare(Entry<K, V> o1, Entry<K, V> o2) {
return o1.getKey().compareTo(o2.getKey());
}
}
}
I don't know if there's a name, but you could store the current index of each item on the hash.
That is, you have a HashMap< Object, Pair( Integer, Object ) >
and a List<Object> objects
When you put, add to the tail or head of the list and insert into the hashmap with your data and the index of insertion. This is O(1).
When you get, pull from the hashmap and ignore the index. This is O(1).
When you remove, you pull from the map. Take the index and remove from the list as well. This is O(1)
When you sort, just sort the list. Either update the indexes in the map during the sort, or update after the sort is complete. This does not affect the O(nlgn) sort, as it's a linear step. O(nlgn + n) == O(nlgn)
Ordered Dictionary
Recent versions of Python (2.7, 3.1) have "ordered dictionaries" which sound like what you're describing.
The official Python "ordered dictionary" implementation is inspired by previous 3rd-party implementations, as described in the PEP 372.
References:
collections.OrderedDict documentation for Python 2.7
collections.OrderedDict documentation for Python 3.1
PEP 372
ActiveState Ordered Dictionary recipe for Python ≥ 2.4
I'm not aware of a data structure classification with that exact behavior, at least not in Java Collections (or from nonlinear data structures class). Perhaps you can implement it, and it will henceforth be known as the RudigerMap.

Resources