I'm a newbie of the Hadoop ecosystem.
What I want to ask is that: "Are member variables of Reducer class thread-safe?"
Mapper passes data to Reducer with unique key.
There is a collection(ConcurrentLinkedQueue) which is a member variable in Reducer class.
The collection is initialized in the setup(Context) method of Reducer class.
Some Query objects(jOOQ) are created and appended into the collection in the reduce(...) method of Reducer class.
jooq.batch(collection).execute() method will be called in the last line of reduce(...) method within specified threshold(e.g 1000). And then the collection will be cleared by clear() method.
The remains of collection from step 4 will be processed as same as step 5 in cleanup(Context) method.
Question: Do I need to synchronize step 5?
Codes
public class SomeReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
private Queue<Query> queries;
#Override
protected void setup(Context context) {
...
queries = new ConcurrentLinkedQueue<>();
}
#Override
protected void cleanup(Context context) {
if (!queries.isEmpty()) db.batch(queries).execute();
...
}
#Override
public void reduce(Text key, Iterable<Session> sessions, Context context) {
for (...iteration...) { queries.add(...create Query object...); }
// Is this code snippet below should be synchronized?
if (queries.size() >= 1000) {
db.batch(queries).execute();
queries.clear();
}
}
}
A Reducer is threadsafe. You will most likely have multiple Reducers running in parallel, but they are completely isolated from each other and only see their own data and instance variables.
So to answer your qustion, you do not need to synchronize your code or even use a ConcurrentLinkedQueue, it could just be a normal ArrayList.
Related
I am calling a function which has CacheEvict annotation on it. This is being called from a function that is itself executed asynchronously.
It seems that the cache is not being evicted after the function has been executed.
Here is sample code
#Async("executor1")
public void function1()
{
// do something
anotherFunction("name", 123, 12);
// do something more
}
#CacheEvict(cacheNames = {"cache1", "cache2", "cache3"}, key = "#testId")
public List<Integer> anotherFunction(String name, int testId, int packageId)
{
// some code here
}
What I want is that entries corresponding to testId should be cleared from all the caches.
However, in another call, I can see old entries of cache1. function1 is being called from the controller. Both these functions are present inside the service. Now, Is this configuration correct? If yes, What may be the possible reasons that cache is not being cleared?
Any help appreciated. Thanks in advance.
I think your problem is that Spring proxies are not reentrant. To implement Async and CacheEvict, Spring creates a proxy. So, in your example, the call stack will be:
A -> B$$proxy.function1() -> B.function1() -> B.anotherFunction()
B$$proxy contains the logic for async and eviction. Which won't apply when calling directly anotherFunction. In fact, even if you remove the #Async, it will still don't work.
A trick you can use is to inject the proxied bean into the class. To delegate to the proxy of the class instead this.
public class MyClass {
private MyClass meWithAProxy;
#Autowired
ApplicationContext applicationContext;
#PostConstruct
public void init() {
meWithAProxy = applicationContext.getBean(MyClass.class);
}
#Async("executor1")
public void function1() {
meWithAProxy.anotherFunction("name", 123, 12);
}
#CacheEvict(cacheNames = "cache1", key = "#testId")
public List<Integer> anotherFunction(String name, int testId, int packageId) {
return Collections.emptyList();
}
}
It works. But there's a catch. If you now call anotherFunction directly, it won't work. I consider this to be a Spring bug and will file it as is.
I have a driver class, mapper class and reducer class. The mapreduce job runs fine. But the desired out is not coming. I have put System.out.println statements in the reducer. I looked at the logs of mapper and reducer. System.out.println statements that I put in mapper can be seen in the logs but println statements in the reducer are not seen in the logs. Could it be possible that reducer is not at all launched?
This is the log fine from reducer.
I assume this question is based on the code in your earlier question: mapreduce composite Key sample - doesn't show the desired output
public class CompositeKeyReducer extends Reducer<Country, IntWritable, Country, IntWritable> {
public void reduce(Country key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
}
}
The reduce isn't running because the reduce method signature is wrong. You have:
public void reduce(Country key, Iterator<IntWritable> values, Context context)
It should be:
public void reduce(Country key, Iterable<IntWritable> values, Context context)
To make sure this doesn't happen again you should add the #Override annotation to the class. This will tell you if you've got the signature wrong.
No change in the code. It works now.
All I did was restarted my Hadoop Cloudera image and it works now. I can't believe this happended.
This is a question regarding the performance of writable variables and allocation within a map reduce step. Here is a reducer:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
context.write(key, new Text(val));
}
}
}
Or is this better performance-wise:
static public class MyReducer extends Reducer<Text, Text, Text, Text> {
private Text myText = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) {
for (Text val : values) {
myText.set(val);
context.write(key, myText);
}
}
}
In the Hadoop Definitive Guide all the examples are in the first form but I'm not sure if that is for shorter code samples or because it's more idiomatic.
The book may use the first form because it is more concise. However, it is less efficient. For large input files, that approach will create a large number of objects. This excessive object creation would slow down your performance. Performance-wise, the second approach is preferable.
Some references that discuss this issue:
Tip 7 here,
On Hadoop object re-use, and
This JIRA.
Yeah, second approach is preferable if reducer has large data to process. The first approach, will keep creating references and cleaning it up depends on the garbage collector.
Suppose I have a tab delimited file containing user activity data formatted like this:
timestamp user_id page_id action_id
I want to write a hadoop job to count user actions on each page, so the output file should look like this:
user_id page_id number_of_actions
I need something like composite key here - it would contain user_id and page_id. Is there any generic way to do this with hadoop? I couldn't find anything helpful. So far I'm emitting key like this in mapper:
context.write(new Text(user_id + "\t" + page_id), one);
It works, but I feel that it's not the best solution.
Just compose your own Writable. In your example a solution could look like this:
public class UserPageWritable implements WritableComparable<UserPageWritable> {
private String userId;
private String pageId;
#Override
public void readFields(DataInput in) throws IOException {
userId = in.readUTF();
pageId = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(userId);
out.writeUTF(pageId);
}
#Override
public int compareTo(UserPageWritable o) {
return ComparisonChain.start().compare(userId, o.userId)
.compare(pageId, o.pageId).result();
}
}
Although I think your IDs could be a long, here you have the String version. Basically just the normal serialization over the Writable interface, note that it needs the default constructor so you should always provide one.
The compareTo logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.
ComparisionChain is a nice util of Guava.
Don't forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.
You could write your own class that implements Writable and WritableComparable that would compare your two fields.
Pierre-Luc Bertrand
How do I define an ArrayWritable for a custom Hadoop type ? I am trying to implement an inverted index in Hadoop, with custom Hadoop types to store the data
I have an Individual Posting class which stores the term frequency, document id and list of byte offsets for the term in the document.
I have a Posting class which has a document frequency (number of documents the term appears in) and list of Individual Postings
I have defined a LongArrayWritable extending the ArrayWritable class for the list of byte offsets in IndividualPostings
When i defined a custom ArrayWritable for IndividualPosting I encountered some problems after local deployment (using Karmasphere, Eclipse).
All the IndividualPosting instances in the list in Posting class would be the same, even though I get different values in the Reduce method
From the documentation of ArrayWritable:
A Writable for arrays containing instances of a class. The elements of this writable must all be instances of the same class. If this writable will be the input for a Reducer, you will need to create a subclass that sets the value to be of the proper type. For example: public class IntArrayWritable extends ArrayWritable { public IntArrayWritable() { super(IntWritable.class); } }
You've already cited doing this with a WritableComparable type defined by Hadoop. Here's what I assume your implementation looks like for LongWritable:
public static class LongArrayWritable extends ArrayWritable
{
public LongArrayWritable() {
super(LongWritable.class);
}
public LongArrayWritable(LongWritable[] values) {
super(LongWritable.class, values);
}
}
You should be able to do this with any type that implements WritableComparable, as given by the documentation. Using their example:
public class MyWritableComparable implements
WritableComparable<MyWritableComparable> {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable other) {
int thisValue = this.counter;
int thatValue = other.counter;
return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
}
}
And that should be that. This assumes you're using revision 0.20.2 or 0.21.0 of the Hadoop API.