Suppose I have a tab delimited file containing user activity data formatted like this:
timestamp user_id page_id action_id
I want to write a hadoop job to count user actions on each page, so the output file should look like this:
user_id page_id number_of_actions
I need something like composite key here - it would contain user_id and page_id. Is there any generic way to do this with hadoop? I couldn't find anything helpful. So far I'm emitting key like this in mapper:
context.write(new Text(user_id + "\t" + page_id), one);
It works, but I feel that it's not the best solution.
Just compose your own Writable. In your example a solution could look like this:
public class UserPageWritable implements WritableComparable<UserPageWritable> {
private String userId;
private String pageId;
#Override
public void readFields(DataInput in) throws IOException {
userId = in.readUTF();
pageId = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(userId);
out.writeUTF(pageId);
}
#Override
public int compareTo(UserPageWritable o) {
return ComparisonChain.start().compare(userId, o.userId)
.compare(pageId, o.pageId).result();
}
}
Although I think your IDs could be a long, here you have the String version. Basically just the normal serialization over the Writable interface, note that it needs the default constructor so you should always provide one.
The compareTo logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.
ComparisionChain is a nice util of Guava.
Don't forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.
You could write your own class that implements Writable and WritableComparable that would compare your two fields.
Pierre-Luc Bertrand
Related
In order to increase performance, my API is going to receive an array of string that I need to convert to an array of objects.
My array looks like this:
List<String> listPersons = ["1, Franck, 1980-01-01T00:00:00, 00.00", "2, Martin, 1989-01-01T00:00:00, 00.00"];
How could I easily convert it to a list of Persons (List), if possible using Java 8 so I don't have to create a loop and manually explode the String?
class Person {
private Integer id;
private String name;
private Date dateOfBirth;
// getter and setter
}
Ideally I'd like to automate this directly using SpringBoot - Using a custom converter such as:
public class StringToPersonConverter implements Converter<String, Person> {
#Override
public Person convert(String from) {
String[] data = from.split(",");
return new Person(Integer.parseInt(data[0]), data[1], new Date(data[2]));
}
}
Declaring the converter:
#Configuration
public class WebConfig implements WebMvcConfigurer {
#Override
public void addFormatters(FormatterRegistry registry) {
registry.addConverter(new StringToCreditCardConverter());
}
}
And ideally map it from my controller directly?
#RequestMapping(value = "/insertPersons", method = RequestMethod.POST)
#ResponseBody
public String savePersons(#RequestBody List<Person> listPersons) {}
Unfortunately, it doesn't seem to detect my converter and it's throwing an error ;/
Any idea? Thanks
As a side note : String[] data = from.split(","); will not work with , as character separator because values use this character, for example : 1980-01-01T00:00:00, 00.00.
In order to increase performance, my API is going to receive an array
of string that I need to convert to an array of objects.
Here is a non answer. JSON is a format that consumes very few memory (literally characters) to convey the structure. So you don't need and even don't have to concatenate in a single String distinct information that here represent difference Person.
Designing an API that waits for unstructured JSON/text and restructuring the JSON/text in the backend in an anti pattern : it is not more efficient, it makes the API unclear and add boiler plate code both in the front and the back end.
I am trying to analyze the social network data which contains follower and followee pairs. I want to find the top 10 users who have the most followees using MapReduce.
I made pairs of userID and number_of_followee with one MapReduce step.
With this data, however, I am not sure how to sort them in distributed systems.
I am not sure how priority queue can be used in either of Mappers and Reducers since they have the distributed data.
Can someone explain me how I can use data structures to sort the massive data?
Thank you very much.
If you have big input file (files) of format user_id = number_of_followers, simple map-reduce algorithm to find top N users is:
each mapper processes its own input and finds top N users in its file, writes them to a single reducer
single reducer receives number_of_mappers * N rows and finds top N users among them
To Sort the data in descending order, you need another mapreduce job. The Mapper would emit "number of followers" as key and twitter handle as value.
class SortingMap extends Map<LongWritable, Text, LongWritable, Text> {
private Text value = new Text();
private LongWritable key = new LongWritable(0);
#Overwrite
public void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
// Assuming that the input data is "TweeterId <number of follower>" separated by tab
String tokens[] = value.split(Pattern.quote("\t"));
if(tokens.length > 1) {
key.set(Long.parseLong(tokens[1]));
value.set(tokens[0]);
context.write(key, value);
}
}
}
For reducer, use IdentityReducer<K,V>
// SortedComparator Class
public class DescendingOrderKeyComparator extends WritableComparator {
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
return -1 * w1.compareTo(w2);
}
}
In the Driver Class, set SortedComparator
job.setSortComparatorClass(DescendingOrderKeyComparator.class);
I have a method that fetches all the data and i am caching the result of that method but i am not able to evict the result.
#Component("cacheKeyGenerator")
public class CacheKeyGenerator implements KeyGenerator {
#Override
public Object generate(Object target, Method method, Object... params) {
final List<Object> key = new ArrayList<>();
key.add(method.getDeclaringClass().getName());
return key;
}
}
CachedMethod:-
#Override
#Cacheable(value="appCache",keyGenerator="cacheKeyGenerator")
public List<Contact> showAllContacts() {
return contactRepository.findAll();
}
#Override
#CachePut(value="appCache",key="#result.id")
public Contact addData(Contact contact) {
return contactRepository.save(contact);
}
Now when ever addData is called i want the data in the cache "appCache" with the key ="cacheKeyGenerator" to be evicted.So that the data returned by the method "showAllContacts()" is accurate.Can anyone please help!
The Entire code can be found at - https://github.com/iftekharkhan09/SpringCaching
Assuming you have a known constant cache key for showAllContacts then the solution should be to simply add #CacheEvict on addData passing in the cache name and key value:
#Override
#Caching(
put = {#CachePut(value="appCache", key="#result.id")},
evict = {#CacheEvict(cacheNames="appCache", key="someConstant")}
)
public Contact addData(Contact contact) {
return contactRepository.save(contact);
}
However because you use a key generator it is a bit more involved. Now given what your key generator does, you could instead pick a value for that cache key, making sure there can't be any collisions with the values from #result.id and use that value instead of a the key generator returned one.
I am caching some object using Spring cache implementation, the underline cache is EhCache. I am trying to evict the cache based on wildcard search for the keys,the reason is the way I stored them and I only know the partial key. Hence I wanted to do something like below. I did search this forum for relevant answer but did not find any.
#CacheEvict(beforeInvocation=true, key="userId+%")
public User getUser(String userId)
{
//some implementation
}
Now if I try this I get an error for the SPEL. Also I tried to create a custom keygenerator for this, here the eviction works if the key generator returns one key, but I have a couple of keys based on my search.
#CacheEvict(beforeInvocation=true, keyGenerator="cacheKeyEvictor")
public User getUser(String userId)
{
//some implementation
}
//Custom key generator for eviction
public class cacheKeyEvictor implements KeyGenerator {
#Override
public Object generate(Object arg0, Method arg1, Object... arg2) {
//loop the cache and do a like search and return the keys
return object; //works if I send one key. Won't work for a list of keys
}
}
Any help on this is appreciated.
How do I define an ArrayWritable for a custom Hadoop type ? I am trying to implement an inverted index in Hadoop, with custom Hadoop types to store the data
I have an Individual Posting class which stores the term frequency, document id and list of byte offsets for the term in the document.
I have a Posting class which has a document frequency (number of documents the term appears in) and list of Individual Postings
I have defined a LongArrayWritable extending the ArrayWritable class for the list of byte offsets in IndividualPostings
When i defined a custom ArrayWritable for IndividualPosting I encountered some problems after local deployment (using Karmasphere, Eclipse).
All the IndividualPosting instances in the list in Posting class would be the same, even though I get different values in the Reduce method
From the documentation of ArrayWritable:
A Writable for arrays containing instances of a class. The elements of this writable must all be instances of the same class. If this writable will be the input for a Reducer, you will need to create a subclass that sets the value to be of the proper type. For example: public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } }
You've already cited doing this with a WritableComparable type defined by Hadoop. Here's what I assume your implementation looks like for LongWritable:
public static class LongArrayWritable extends ArrayWritable
{
public LongArrayWritable() {
super(LongWritable.class);
}
public LongArrayWritable(LongWritable[] values) {
super(LongWritable.class, values);
}
}
You should be able to do this with any type that implements WritableComparable, as given by the documentation. Using their example:
public class MyWritableComparable implements
WritableComparable<MyWritableComparable> {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable other) {
int thisValue = this.counter;
int thatValue = other.counter;
return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
}
}
And that should be that. This assumes you're using revision 0.20.2 or 0.21.0 of the Hadoop API.