For a research project, I tried sorting the elements in an RDD. I did this in two different approaches.
In the first method, I applied a mapPartitions() function on the RDD, so that it would sort the contents of the RDD, and provide a result RDD that contains the sorted list as the only record in the RDD. Then, I applied a reduce function which basically merges sorted lists.
I ran these experiments on an EC2 cluster containing 30 nodes. I set it up using the spark ec2 script. The data file was stored in HDFS.
In the second approach I used the sortBy method in Spark.
I performed these operation on the US census data(100MB) found here
A single lines looks like this
9, Not in universe, 0, 0, Children, 0, Not in universe, Never married, Not in universe or children, Not in universe, White, All other, Female, Not in universe, Not in universe, Children or Armed Forces, 0, 0, 0, Nonfiler, Not in universe, Not in universe, Child <18 never marr not in subfamily, Child under 18 never married, 1758.14, Nonmover, Nonmover, Nonmover, Yes, Not in universe, 0, Both parents present, United-States, United-States, United-States, Native- Born in the United States, 0, Not in universe, 0, 0, 94, - 50000.
I sorted based on the 25th value in the CSV. In this line that is 1758.14.
I noticed that sortBy performs worse than the other method. Is this the expected scenario? If it is, why wouldn't the mapPartitions() and reduce() be the default sorting approach?
Here is my implementation
public static void sortBy(JavaSparkContext sc){
JavaRDD<String> rdd = sc.textFile("/data.txt",32);
long start = System.currentTimeMillis();
rdd.sortBy(new Function<String, Double>(){
#Override
public Double call(String v1) throws Exception {
// TODO Auto-generated method stub
String [] arr = v1.split(",");
return Double.parseDouble(arr[24]);
}
}, true, 9).collect();
long end = System.currentTimeMillis();
System.out.println("SortBy: " + (end - start));
}
public static void sortList(JavaSparkContext sc){
JavaRDD<String> rdd = sc.textFile("/data.txt",32); //parallelize(l, 8);
long start = System.currentTimeMillis();
JavaRDD<LinkedList<Tuple2<Double, String>>> rdd3 = rdd.mapPartitions(new FlatMapFunction<Iterator<String>, LinkedList<Tuple2<Double, String>>>(){
#Override
public Iterable<LinkedList<Tuple2<Double, String>>> call(Iterator<String> t)
throws Exception {
// TODO Auto-generated method stub
LinkedList<Tuple2<Double, String>> lines = new LinkedList<Tuple2<Double, String>>();
while(t.hasNext()){
String s = t.next();
String arr1[] = s.split(",");
Tuple2<Double, String> t1 = new Tuple2<Double, String>(Double.parseDouble(arr1[24]),s);
lines.add(t1);
}
Collections.sort(lines, new IncomeComparator());
LinkedList<LinkedList<Tuple2<Double, String>>> list = new LinkedList<LinkedList<Tuple2<Double, String>>>();
list.add(lines);
return list;
}
});
rdd3.reduce(new Function2<LinkedList<Tuple2<Double, String>>, LinkedList<Tuple2<Double, String>>, LinkedList<Tuple2<Double, String>>>(){
#Override
public LinkedList<Tuple2<Double, String>> call(
LinkedList<Tuple2<Double, String>> a,
LinkedList<Tuple2<Double, String>> b) throws Exception {
// TODO Auto-generated method stub
LinkedList<Tuple2<Double, String>> result = new LinkedList<Tuple2<Double, String>>();
while (a.size() > 0 && b.size() > 0) {
if (a.getFirst()._1.compareTo(b.getFirst()._1) <= 0)
result.add(a.poll());
else
result.add(b.poll());
}
while (a.size() > 0)
result.add(a.poll());
while (b.size() > 0)
result.add(b.poll());
return result;
}
});
long end = System.currentTimeMillis();
System.out.println("MapPartitions: " + (end - start));
}
Collect() is a major bottleneck as it returns all the results to the driver.
It produces both IO hit & additional network traffic to a single source (in this case - the driver).
It also blocks other operations.
Instead of collect() in your first sortBy() code segment,
try performing a parallel operation such as saveAsTextFile(tmp) than read back using sc.textFile(tmp).
The other sortBy() code segment utilizes both mapPartitions() and reduce() parallel APIs - so the entire work is done in parallel.
It would seem that this is the cause for the difference in end-to-end performance times.
Note that your findings don't necessarily mean that the sum of execution times over all machines is worse.
Related
I am a Hibernate novice. I have the following code which persists a large number (say 10K) of rows from a List<String>:
#Override
#Transactional(readOnly = false)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {
StudyT study = studyDAO.getStudyByStudyId(studyId);
Authentication auth = SecurityContextHolder.getContext().getAuthentication();
for(String subjectId: subjectIds) { // LOOP with saveAndFlush() for each
// ...
user.setRoleTypeId(4);
user.setActiveFlag("Y");
user.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
user.setCreatedDate(new Date());
List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
StudyParticipantsT sp = new StudyParticipantsT();
sp.setStudyT(study);
sp.setUsersT(user);
sp.setSubjectId(subjectId);
sp.setLocked("N");
sp.setCreatedBy(auth.getPrincipal().toString().toLowerCase());
sp.setCreatedDate(new Date());
participants.add(sp);
user.setStudyParticipantsTs(participants);
userDAO.saveAndFlush(user);
}
}
}
But this operation takes too long, about 5-10 min. for 10K rows. What is the proper way to improve this? Do I really need to rewrite the whole thing with a Batch Insert, or is there something simple I can tweak?
NOTE I also tried userDAO.save() without the Flush, and userDAO.flush() at the end outside the for-loop. But this didn't help, same bad performance.
We solved it. Batch-Inserts are done with saveAll. We define a batch size, say 1000, and saveAll the list and then reset. If at the end (an edge condition) we also save. This dramatically sped up all the inserts.
int batchSize = 1000;
// List for Batch-Inserts
List<UsersT> batchInsertUsers = new ArrayList<UsersT>();
for(int i = 0; i < subjectIds.size(); i++) {
String subjectId = subjectIds.get(i);
UsersT user = new UsersT();
// Fill out the object here...
// ...
// Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
batchInsertUsers.add(user);
if (batchInsertUsers.size() == maxBatchSize || i == subjectIds.size() - 1) {
userDAO.saveAll(batchInsertUsers);
// Reset list
batchInsertUsers.clear();
}
}
I have class Element with a list, my intended output is like this:
Map<String , List<Element>>
{
1 = [Element3, Element1],
2 = [Element2, Element1],
3 = [Element2, Element1], 4=[Element2]
}
And my input is set of element objects, I used forEach to get the desired outcome, but I'm looking for how to collect it using collectors.toMap. Any inputs are much appreciated
Set<Element> changes = new HashSet();
List<String> interesetList = new ArrayList();
interesetList.add("1");
interesetList.add("2");
interesetList.add("3");
Element element = new Element(interesetList);
changes.add(element);
interesetList = new ArrayList();
interesetList.add("2");
interesetList.add("3");
interesetList.add("4");
element = new Element(interesetList);
changes.add(element);
Map<String, List<Element>> collect2 = new HashMap();
changes.forEach(element -> {
element.getInterestedList().forEach(tracker -> {
collect2.compute(tracker, ( key , val) -> {
List<Element> elementList = val == null ? new ArrayList<Element>() : val;
elementList.add(Element);
return elementList;
});
});
});
class Element {
List<String> interestedList;
static AtomicInteger sequencer = new AtomicInteger(0);
String mName;
public Element(List<String> aList) {
interestedList = aList;
mName = "Element" + sequencer.incrementAndGet();
}
public List<String> getInterestedList() {
return interestedList;
}
#Override
public String toString() {
return mName;
}
}
You can do it by using Collectors.groupingBy instead of Collectors.toMap, along with Collectors.mapping, which adapts a collector to another collector:
Map<String, List<Element>> result = changes.stream()
.flatMap(e -> e.getInterestedList().stream().map(t -> Map.entry(t, e)))
.collect(Collectors.groupingBy(
Map.Entry::getKey,
Collectors.mapping(Map.Entry::getValue, Collectors.toList())));
You need to use the Stream.flatMap method first and then pair the elements of the inner lists with the current Element instance. I did this via the new Java 9's Map.entry(key, value) method. If you're not on Java 9 yet, you could change it to new AbstractMap.SimpleEntry<>(key, value).
After flatmapping, we need to collect instances of Map.Entry. So I'm using Collectors.groupingBy to classify entries by key (where we had previously stored each element of the inner lists, aka what you call tracker in your code). Then, as we don't want to have instances of List<Map.Entry<String, Element>> as the values of the map, we need to transform each Map.Entry<String, Element> of the stream to just Element (that's why I'm using Map.Entry::getValue as the first argument of Collectors.mapping). We also need to specify a downstream collector (here Collectors.toList()), so that the outer Collectors.groupingBy collector knows where to place all the adapted elements of the stream that belong to each group.
A shorter and surely more efficient way to do the same (similar to your attempt) could be:
Map<String, List<Element>> result = new HashMap<>();
changes.forEach(e ->
e.getInterestedList().forEach(t ->
result.computeIfAbsent(t, k -> new ArrayList<>()).add(e)));
This uses Map.computeIfAbsent, which is a perfect fit for your use case.
My reducer gives this o/p
Country-Year,Medals
India-2008,60
United States-2008,1237
Zimbabwe-2008, 2
Namibia-2009,22
China-2009,43
United States-2009,54
And i want this, sorting should happen based on medals and top three should be shown.
Country-Year,Medals
United States-2008,1237
India-2008,60
United States-2009,54
Someone suggested me to do this sorting in custom recordreader(understood that it is used in mapper part) and i browsed through some resources but couldn't find much on sorting. Please share any ideas or link to resources. Advance thanks !
You can implement Map Reduce Top K Design pattern to achieve your objective.
Top K Design pattern will sort your records on values and picks the top k records.
You can go through this link for implementing Top K Design pattern on your data.
When you aggregate the result of mapper in Reducer class rather than writing it to output take it into a map then sort the map and display result accordingly.
Key = Country-Year , Value = Medals
Dummy code to showcase how to implement
public class Medal_reducer extends Reducer<Text,IntWritable, Text , IntWritable> {
// Change access modifier as per your need
public Map<String , Integer > map = new HashMap<String , Integer>();
public void reduce(Text key , Iterable<IntWritable> values ,Context context )
{
// write logic for your reducer
// Enter reduced values in map for each key
for (IntWritable value : values ){
// calculate count
}
map.put(key.toString() , count);
}
public void cleanup(Context context){
//Cleanup is called once at the end to finish off anything for reducer
//Here we will write our final output
Map<String , Integer> sortedMap = new HashMap<String , Integer>();
sortedMap = sortMap(map);
for (Map.Entry<String,Integer> entry = sortedMap.entrySet()){
context.write(new Text(entry.getKey()),new IntWritable(entry.getValue()));
}
}
public Map<String , Integer > sortMap (Map<String,Integer> unsortMap){
Map<String ,Integer> hashmap = new HashMap<String,Integer>();
int count=0;
List<Map.Entry<String,Integer>> list = new LinkedList<Map.Entry<String,Integer>>(unsortMap.entrySet());
//Sorting the list we created from unsorted Map
Collections.sort(list , new Comparator<Map.Entry<String,Integer>>(){
public int compare (Map.Entry<String , Integer> o1 , Map.Entry<String , Integer> o2 ){
//sorting in descending order
return o2.getValue().compareTo(o1.getValue());
}
});
for(Map.Entry<String, Integer> entry : list){
// only writing top 3 in the sorted map
if(count>2)
break;
hashmap.put(entry.getKey(),entry.getValue());
}
return hashmap ;
}
}
Hopefully this will help.
Initial data:
public class Stats {
int passesNumber;
int tacklesNumber;
public Stats(int passesNumber, int tacklesNumber) {
this.passesNumber = passesNumber;
this.tacklesNumber = tacklesNumber;
}
public int getPassesNumber() {
return passesNumber;
}
public void setPassesNumber(int passesNumber) {
this.passesNumber = passesNumber;
}
public int getTacklesNumber() {
return tacklesNumber;
}
public void setTacklesNumber(int tacklesNumber) {
this.tacklesNumber = tacklesNumber;
}
}
Map<String, List<Stats>> statsByPosition = new HashMap<>();
statsByPosition.put("Defender", Arrays.asList(new Stats(10, 50), new Stats(15, 60), new Stats(12, 100)));
statsByPosition.put("Attacker", Arrays.asList(new Stats(80, 5), new Stats(90, 10)));
I need to calculate an average of Stats by position. So result should be a map with the same keys, however values should be aggregated to single Stats object (List should be reduced to single Stats object)
{
"Defender" => Stats((10 + 15 + 12) / 3, (50 + 60 + 100) / 3),
"Attacker" => Stats((80 + 90) / 2, (5 + 10) / 2)
}
I don't think there's anything new in Java8 that could really help in solving this problem, at least not efficiently.
If you look carefully at all new APIs, then you will see that majority of them are aimed at providing more powerful primitives for working on single values and their sequences - that is, on sequences of double, int, ? extends Object, etc.
For example, to compute an average on sequence on double, JDK introduces a new class - DoubleSummaryStatistics which does an obvious thing - collects a summary over arbitrary sequence of double values.
I would actually suggest that you yourself go for similar approach: make your own StatsSummary class that would look along the lines of this:
// assuming this is what your Stats class look like:
class Stats {
public final double a ,b; //the two stats
public Stats(double a, double b) {
this.a = a; this.b = b;
}
}
// summary will go along the lines of:
class StatsSummary implements Consumer<Stats> {
DoubleSummaryStatistics a, b; // summary of stats collected so far
StatsSummary() {
a = new DoubleSummaryStatistics();
b = new DoubleSummaryStatistics();
}
// this is how we collect it:
#Override public void accept(Stats stat) {
a.accept(stat.a); b.accept(stat.b);
}
public void combine(StatsSummary other) {
a.combine(other.a); b.combine(other.b);
}
// now for actual methods that return stuff. I will implement only average and min
// but rest of them are not hard
public Stats average() {
return new Stats(a.getAverage(), b.getAverage());
}
public Stats min() {
return new Stats(a.getMin(), b.getMin());
}
}
Now, above implementation will actually allow you to express your proper intents when using Streams and such: by building a rigid API and using classes available in JDK as building blocks, you get less errors overall.
However, if you only want to compute average once somewhere and don't need anything else, coding this class is a little overkill, and here's a quick-and-dirty solution:
Map<String, Stats> computeAverage(Map<String, List<Stats>> statsByPosition) {
Map<String, Stats> averaged = new HashMap<>();
statsByPosition.forEach((position, statsList) -> {
averaged.put(position, averageStats(statsList));
});
return averaged;
}
Stats averageStats(Collection<Stats> stats) {
double a, b;
int len = stats.size();
for(Stats stat : stats) {
a += stat.a;
b += stat.b;
}
return len == 0d? new Stats(0,0) : new Stats(a/len, b/len);
}
There is probably a cleaner solution with Java 8, but this works well and isn't too complex:
Map<String, Stats> newMap = new HashMap<>();
statsByPosition.forEach((key, statsList) -> {
newMap.put(key, new Stats(
(int) statsList.stream().mapToInt(Stats::getPassesNumber).average().orElse(0),
(int) statsList.stream().mapToInt(Stats::getTacklesNumber).average().orElse(0))
);
});
The functional forEach method lets you iterate over every key value pair of your given map.
You just put a new entry in your map for the averaged values. There you take the key you have already in your given map. The new value is a new Stats, where the arguments for the constructor are calculated directly.
Just take the value of your old map, which is the statsList in the forEach function, map the values from the given stats to Integer value with mapToInt and use the average function.
This function returns an OptionalDouble which is nearly the same as Optional<Double>. Preventing that anything didn't work, you use its orElse() method and pass a default value (like 0). Since the average values are double you have to cast the value to int.
As mentioned, there doubld probably be a even shorter version, using reduce.
You might as well use custom collector. Let's add the following methods to Stats class:
public Stats() {
}
public void accumulate(Stats stats) {
passesNumber += stats.passesNumber;
tacklesNumber += stats.tacklesNumber;
}
public Stats combine(Stats acc) {
passesNumber += acc.passesNumber;
tacklesNumber += acc.tacklesNumber;
return this;
}
#Override
public String toString() {
return "Stats{" +
"passesNumber=" + passesNumber +
", tacklesNumber=" + tacklesNumber +
'}';
}
Now we can use Stats in collect method:
System.out.println(statsByPosition.entrySet().stream().collect(
Collectors.toMap(
entity -> entity.getKey(),
entity -> {
Stats entryStats = entity.getValue().stream().collect(
Collector.of(Stats::new, Stats::accumulate, Stats::combine)
); // get stats for each map key.
// get average
entryStats.setPassesNumber(entryStats.getPassesNumber() / entity.getValue().size());
// get average
entryStats.setTacklesNumber(entryStats.getTacklesNumber() / entity.getValue().size());
return entryStats;
}
))); // {Attacker=Stats{passesNumber=85, tacklesNumber=7}, Defender=Stats{passesNumber=12, tacklesNumber=70}}
If java-9 is available and StreamEx, you could do :
public static Map<String, Stats> third(Map<String, List<Stats>> statsByPosition) {
return statsByPosition.entrySet().stream()
.collect(Collectors.groupingBy(e -> e.getKey(),
Collectors.flatMapping(e -> e.getValue().stream(),
MoreCollectors.pairing(
Collectors.averagingDouble(Stats::getPassesNumber),
Collectors.averagingDouble(Stats::getTacklesNumber),
(a, b) -> new Stats(a, b)))));
}
public List<Errand> interestFeed(Person person, int skip, int limit)
throws ControllerException {
person = validatePerson(person);
String query = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n ORDER BY n.added DESC SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
String queryFast = String
.format("START n=node:ErrandLocation('withinDistance:[%.2f, %.2f, %.2f]') RETURN n SKIP %s LIMIT %S",
person.getLongitude(), person.getLatitude(),
person.getWidth(), skip, limit);
Set<Errand> errands = new TreeSet<Errand>();
System.out.println(queryFast);
Result<Map<String, Object>> results = template.query(queryFast, null);
Iterator<Errand> objects = results.to(Errand.class).iterator();
return copyIterator (objects);
}
public List<Errand> copyIterator(Iterator<Errand> iter) {
Long start = System.currentTimeMillis();
Double startD = start.doubleValue();
List<Errand> copy = new ArrayList<Errand>();
while (iter.hasNext()) {
Errand e = iter.next();
copy.add(e);
System.out.println(e.getType());
}
Long end = System.currentTimeMillis();
Double endD = end.doubleValue();
p ((endD - startD)/1000);
return copy;
}
When I profile the copyIterator function it takes about 6 seconds to fetch just 10 results. I use Spring Data Neo4j Rest to connect with a Neo4j server running on my local machine. I even put a print function to see how fast the iterator is converted to a list and it does appear slow. Does each iterator.next() make a new Http call?
If Errand is a node entity then yes, spring-data-neo4j will make a http call for each entity to fetch all its labels (it's fault of neo4j, which doesn't return labels when you return whole node in cypher).
You can enable debug level logging in org.springframework.data.neo4j.rest.SpringRestCypherQueryEngine to log all cypher statements going to neo4j.
To avoid this call use #QueryResult http://docs.spring.io/spring-data/data-neo4j/docs/current/reference/html/#reference_programming-model_mapresult