Spark - sort by value with a JavaPairRDD - sorting

Working with apache spark using Java. I got an JavaPairRDD<String,Long> and I want to sort this dataset by its value. However, it seems that there only is sortByKey method in it. How could I sort it by the value of Long type?

dataset.mapToPair(x -> x.swap()).sortByKey(false).mapToPair(x -> x.swap()).take(100)

'Secondary sort' is not supported by Spark yet (See SPARK-3655 for details).
As a workaround, you can sort by value by swaping key <-> value and sorting by key as usual.
In Scala would be something like:
val kv:RDD[String, Long] = ???
// swap key and value
val vk = kv.map(_.swap)
val vkSorted = vk.sortByKey

I did this using a List, which now has a sort(Comparator c) method
List<Tuple2<String,Long>> touples = new ArrayList<>();
touples.addAll(myRdd.collect()); //
touples.sort((Tuple2<String, Long> o1, Tuple2<String, Long> o2) -> o2._2.compareTo(o1._2));
It is longer than #Atul solution and i dont know if performance wise is better, on an RDD with 500 items shows no difference, i wonder how does it work with a million records RDD.
You can also use Collections.sort and pass in the list provided by the collect and the lambda based Comparator

Related

MongoTemplate, Criteria and Hashmap

Good Morning.
I'm starting to learn some mongo right now.
I'm facing this problem right now, and i'm start to think if this is the best approach to resolve this "task", or if is bettert to turn around and write another way to solve this "problem".
My goal is to iterate a simple map of values (key) and vector\array (values)
My test map will be recived by a rest layer.
{
"1":["1","2","3"]
}
now after some logic, i need to use the Dao in order to look into db.
The Key will be "realm", the value inside vector are "castle".
Every Realm have some castle and every castle have some "rules".
I need to find every rules for each avaible combination of realm-castle.
AccessLevel is a pojo labeled by #Document annotation and it will have various params, such as castle and realm (both simple int)
So the idea will be to iterate a map and write a long query for every combination of key-value.
public AccessLevel searchAccessLevel(Map<String,Integer[]> request){
Query q = new Query();
Criteria c = new Criteria();
request.forEach((k,v)-> {
for (int i: Arrays.asList(v)
) {
q.addCriteria(c.andOperator(
Criteria.where("realm").is(k),
Criteria.where("castle").is(v))
);
}
});
List<AccessLevel> response=db.find(q,AccessLevel.class);
for (AccessLevel x: response
) {
System.out.println(x.toString());
}
As you can see i'm facing an error concerning $and.
Due to limitations of the org.bson.Document, you can't add a second '$and' expression specified as [...]
it seems mongo can't handle various $and, something i'm pretty used to abuse over sql
select * from a where id =1 and id=2 and id=3 and id=4
(not the best, sincei can use IN(), but sql allow me)
So, the point is: mongo can actualy work in this way and i need to dig more into the problem, or i need to do another approach, like using criterion.in(), and make N interrogation via mongotemplate one for every key in my Map?

Pass comparator into map with collectors [duplicate]

Let's say I have a list of Brand objects. The POJO contains a getName() that returns a string. I want to build a
Map<String, Brand>
out of this with the String being the name... but I want the key to be case insensitive.
How do I make this work using Java streams? Trying:
brands.stream().collect(Collectors.groupingBy(brand -> brand.getName().toLowerCase()));
doesn't work, which I think is because I'm not using groupBy correctly.
Collect the results into a case insensitive map
Map<String, Brand> map = brands
.stream()
.collect(
Collectors.toMap(
Brand::getName, // the key
Function.identity(), // the value
(first, second) -> first, // how to handle duplicates
() -> new TreeMap<String, Brand>(String.CASE_INSENSITIVE_ORDER))); // supply the map implementation
Collectors#groupBy won't work here because it returns a Map<KeyType, List<ValueType>>, but you don't want a List as a value, you just want a Brand, from what I've understood.

Spring Sorting with Nulls_Last Handling not working

I created a sort rule like this for sorting my list:
Sort sort = new Sort(new Sort.Order(Sort.Direction.ASC,"productOrder", Sort.NullHandling.NULLS_LAST), new Sort.Order(Sort.Direction.ASC,"producedYear", Sort.NullHandling.NULLS_LAST));
with this rule I want to sort the productOrder first, and then if the productOrder is the same then the producedYear will be compared and sorted. If there are null values presented, it should be sorted at the end of the list. productOrder will have type Long and producedYear will have type Double.
My Repository interface extends the JpaRepository:
public interface ProductRepository extends JpaRepository<Product, String> {
List<Product> findByDisabledAndValid(int disabled, int valid, Sort sort);
}
But the sorted list I received is containing the null values always at the beginning of the list. This means the null values will be sorted first, then come the productOrder, and finally the producedYear will be sorted. It seems that the third parameter that I defined on my Sort.Order method is not working.
Does anyone have an idea why? Thank you very much
Instead of using the Sort.NullHandling.NULLS_LAST integrated with Spring, I solved the above problem by creating my own Sort function, like this:
Collections.sort(resultList, Comparator
.comparing(Product::getProductOrder, Comparator.nullsLast(Comparator.naturalOrder()))
.thenComparing(Product::getProducedYear, Comparator.nullsLast(Comparator.naturalOrder())));
Hope this help if someone also faces the same problem.
Regards!

How to get keySet() and size() for entire GridGain cluster?

GridCache.keySet(), .primarySize(), and .size() only return information for that node.
How do I get these information but for the whole cluster?
Scanning the entire cluster "works", but all I need is the keys or the count, not the values.
The problem is SQL query works if I want to find based on an indexed field, but I can't find based on the grid cache entry key itself.
My workaround that works but far from elegant and performant is:
Set<String> ruleIds = FluentIterable.from(cache.queries().createSqlFieldsQuery("SELECT property FROM YagoRule").execute().get())
.<String>transform((it) -> (String) it.iterator().next()).toSet();
This requires the key is the same as one of the field, and the field need to be indexed for performance reasons.
Next release of GridGain (6.2.0) will have globalSize() and globalPrimarySize() methods which will ask the cluster for the sizes.
For now you can use the following code:
// Only grab nodes on which cache "mycache" is started.
GridCompute compute = grid.forCache("mycache").compute();
Collection<Integer> res = compute.broadcast(
// This code will execute on every caching node.
new GridCallable<Integer>() {
#Override public Integer call() {
return grid.cache("mycache").size();
}
}
).get();
int sum = 0;
for (Integer i : res)
sum += i;

Does HBase scan returns sorted columns?

I am working on a HBase map reduce job and need to understand if the columns in a single column family are returned sorted by their names (key). If so, I wouldnt need to do it in the shuffle sort stage.
Thanks
I have a very similar data model as you. Upon insertion however, I set my own values for the timestamps on the Put object. However, I did so in a way that took a "seed" of the current time and appended a incrementing counter for each event I persisted in the batch.
When I pulled the results out from the Scan, I wrote a comparator:
public class KVTimestampComparator implements Comparator<KeyValue> {
#Override
public int compare(KeyValue kv1, KeyValue kv2) {
Long kv1Timestamp = kv1.getTimestamp();
Long kv2Timestamp = kv2.getTimestamp();
return kv1Timestamp.compareTo(kv2Timestamp);
}
}
Then sorted the raw row:
List<KeyValue> row = Arrays.asList(result.raw());
Collections.sort(row, new KVTimestampComparator());
Got this idea from person who answered this : Sorted results from hbase scanner
no, columns are not sorted
They are stored internally as key-value pairs in a long byte array. But, you should clarify your question about what you actually need this for.

Resources