How to implement sort in hadoop? - sorting

My problem is sorting values in a file.
keys and values are integers and need to maintain the keys of sorted values.
key value
1 24
3 4
4 12
5 23
output:
1 24
5 23
4 12
3 4
I am working with massive data and must run the code in a cluster of hadoop machines.
How can i do it with mapreduce?

You can probably do this (I'm assuming you are using Java here)
From maps emit like this -
context.write(24,1);
context.write(4,3);
context.write(12,4)
context.write(23,5)
So, all you values that needs to be sorted should be the key in your mapreduce job.
Hadoop by default sorts by ascending order of key.
Hence, either you do this to sort in descending order,
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
Or, this,
You need to set a custom Descending Sort Comparator, which goes something like this in your job.
public static class DescendingKeyComparator extends WritableComparator {
protected DescendingKeyComparator() {
super(Text.class, true);
}
#SuppressWarnings("rawtypes")
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
LongWritable key1 = (LongWritable) w1;
LongWritable key2 = (LongWritable) w2;
return -1 * key1.compareTo(key2);
}
}
The suffle and sort phase in Hadoop will take care of sorting your keys in descending order 24,4,12,23
After comment:
If you require a Descending IntWritable Comparable, you can create one and use it like this -
job.setSortComparatorClass(DescendingIntComparable.class);
In case if you are using JobConf, use this to set
jobConfObject.setOutputKeyComparatorClass(DescendingIntComparable.class);
Put the following code below your main() function -
public static void main(String[] args) {
int exitCode = ToolRunner.run(new YourDriver(), args);
System.exit(exitCode);
}
//this class is defined outside of main not inside
public static class DescendingIntWritableComparable extends IntWritable {
/** A decreasing Comparator optimized for IntWritable. */
public static class DecreasingComparator extends Comparator {
public int compare(WritableComparable a, WritableComparable b) {
return -super.compare(a, b);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return -super.compare(b1, s1, l1, b2, s2, l2);
}
}
}

Related

Datastructure for a log search service?

A few months back, i was asked to design a service which takes a start and end time interval and lists down the number of exceptions/errors grouped by the exception type/code. Basically the intention was to create or use an existing data structure for an efficient search. Here is what I coded.
public class ErrorDetail{
private int code;
private String message;
}
public class ExceptionSearchService
{
private Map<Long, ErrorDetail> s = new TreeMap<Long, ErrorDetail>();
public ArrayList<ErrorDetail> getErrors(long start, long end){
ErrorDetail e1 = find(start, s);
ErrorDetail e2 = find(end, s);
//do an in order traversal between e1 and e2 and add it to an array list and return
}
public void addError(long time, ErrorDetail e){
s.put(time,e);
}
}
I realized that i should not have mentioned a TreeMap, instead should have had my own class like TreeNode but the idea was to have a tree structure and a distributed one because we are talking about thousands of services serving millions of request per minute and generating error.
Could i have used a better data structure in this case ?

LRU Cache Key, Value, Node.Value Real World Interpretation

I understand how in principle an LRU cache works. For example, see here: https://java2blog.com/lru-cache-implementation-java/
However, I am having great difficulty understanding how this is interpreted in a real world setting. For example, if I want to store objects (which have no natural numbering/order), I understand that the value (in the hashmap) is just a pointer to a node in the linked list, but what does the key represent?
Furthermore, what does the node.value represent? I think this the actual object which is being cached. However, how does this correspond to the key in the hashmap?
A typical hashmap has a key and a value, both of arbitrary type. The key is the thing you want to index the structure by, and the value is the thing you want to store and retrieve. Consider a normal hashmap in Java:
Map<UUID, Person> peopleById = new HashMap<>();
You can pass in a UUID to a .get method and get the person associated with that UUID, if it exists.
The LRU caches used in the real world are like that as well:
Map<UUID, Person> cachedPeopleById = new LRUCache<>(10);
The UUID is the key, and the Person is the value.
The reference implementation you linked to doesn't use generics, it only supports int to int, which is the equivalent of Map<Integer, Integer>. The Node class in the reference implementation isn't something that ought to be exposed in public methods. So in that reference implementation, Node should be hidden, and delete(Node) and setHead(Node) should be private, because otherwise they expose implementation details of the cache.
A better implementation would be something more like this (doing this off the top of my head, might have compilation errors, for illustrative purposes only):
public class LRUCache <KeyType, ValueType> implements Map<KeyType, ValueType> {
private static class Node <KeyType, ValueType> {
KeyType key;
ValueType value;
Node prev;
Node next;
public Node(KeyType key, ValueType value){
this.key = key;
this.value = value;
}
}
int capacity;
HashMap<KeyType, Node> map = new HashMap<>();
Node head=null;
Node end=null;
public LRUCache(int capacity) {
this.capacity = capacity;
}
public ValueType get(KeyType key) {
...
}
public set(KeyType key, ValueType value) {
...
}
private void delete(Node<KeyType, ValueType> node) {
...
}
private void setHead(Node<KeyType, ValueType> node) {
...
}

How can I use Java 8 streams to sort an ArrayList of objects by a primitive int member?

Here is an example class. I know the simplest thing would be to change the members from primitive type int to object Integer and use stream/lambda/sorted, but there may be reasons to only have a primitive type int such as space.
How could I use the streams API to sort a List<DateRange> by int member startRange?
List<DateRange> listToBeSorted = new ArrayList<DateRange>();
static private class DateRange
{
private int startRange ;
private int endRange ;
public int getStartRange() {
return startRange;
}
public void setStartRange(int startRange) {
this.startRange = startRange;
}
public int getEndRange() {
return endRange;
}
public void setEndRange(int endRange) {
this.endRange = endRange;
}
}
You may do it like so,
List<DateRange> sortedList = listToBeSorted.stream()
.sorted(Comparator.comparingInt(DateRange::getStartRange))
.collect(Collectors.toList());
I know you asked for a way to do it with streams, but if you are OK with sorting the original list in-place, you don't need streams for this. Just use the List.sort method:
listToBeSorted.sort(Comparator.comparingInt(DateRange::getStartRange));

MapReduce sorting with heap

I am trying to analyze the social network data which contains follower and followee pairs. I want to find the top 10 users who have the most followees using MapReduce.
I made pairs of userID and number_of_followee with one MapReduce step.
With this data, however, I am not sure how to sort them in distributed systems.
I am not sure how priority queue can be used in either of Mappers and Reducers since they have the distributed data.
Can someone explain me how I can use data structures to sort the massive data?
Thank you very much.
If you have big input file (files) of format user_id = number_of_followers, simple map-reduce algorithm to find top N users is:
each mapper processes its own input and finds top N users in its file, writes them to a single reducer
single reducer receives number_of_mappers * N rows and finds top N users among them
To Sort the data in descending order, you need another mapreduce job. The Mapper would emit "number of followers" as key and twitter handle as value.
class SortingMap extends Map<LongWritable, Text, LongWritable, Text> {
private Text value = new Text();
private LongWritable key = new LongWritable(0);
#Overwrite
public void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
// Assuming that the input data is "TweeterId <number of follower>" separated by tab
String tokens[] = value.split(Pattern.quote("\t"));
if(tokens.length > 1) {
key.set(Long.parseLong(tokens[1]));
value.set(tokens[0]);
context.write(key, value);
}
}
}
For reducer, use IdentityReducer<K,V>
// SortedComparator Class
public class DescendingOrderKeyComparator extends WritableComparator {
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
return -1 * w1.compareTo(w2);
}
}
In the Driver Class, set SortedComparator
job.setSortComparatorClass(DescendingOrderKeyComparator.class);

Implementation of an ArrayWritable for a custom Hadoop type

How do I define an ArrayWritable for a custom Hadoop type ? I am trying to implement an inverted index in Hadoop, with custom Hadoop types to store the data
I have an Individual Posting class which stores the term frequency, document id and list of byte offsets for the term in the document.
I have a Posting class which has a document frequency (number of documents the term appears in) and list of Individual Postings
I have defined a LongArrayWritable extending the ArrayWritable class for the list of byte offsets in IndividualPostings
When i defined a custom ArrayWritable for IndividualPosting I encountered some problems after local deployment (using Karmasphere, Eclipse).
All the IndividualPosting instances in the list in Posting class would be the same, even though I get different values in the Reduce method
From the documentation of ArrayWritable:
A Writable for arrays containing instances of a class. The elements of this writable must all be instances of the same class. If this writable will be the input for a Reducer, you will need to create a subclass that sets the value to be of the proper type. For example: public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } }
You've already cited doing this with a WritableComparable type defined by Hadoop. Here's what I assume your implementation looks like for LongWritable:
public static class LongArrayWritable extends ArrayWritable
{
public LongArrayWritable() {
super(LongWritable.class);
}
public LongArrayWritable(LongWritable[] values) {
super(LongWritable.class, values);
}
}
You should be able to do this with any type that implements WritableComparable, as given by the documentation. Using their example:
public class MyWritableComparable implements
WritableComparable<MyWritableComparable> {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable other) {
int thisValue = this.counter;
int thatValue = other.counter;
return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
}
}
And that should be that. This assumes you're using revision 0.20.2 or 0.21.0 of the Hadoop API.

Resources