Getting friends within a specified degree with MapReduce - hadoop

Do you know how can I implement this algorithm using the MapReduce paradigm?
def getFriends(self, degree):
friendList = []
self._getFriends(degree, friendList)
return friendList
def _getFriends(self, degree, friendList):
friendList.append(self)
if degree:
for friend in self.friends:
friend._getFriends(degree-1, friendList)
Let's say that we have the following bi-directional friendships:
(1,2), (1,3), (1,4), (4,5), (4,6), (5,7), (5,8)
How can, for example, to get the 1st, 2nd and 3rd degree connections of user 1? The answer must be 1 -> 2, 3, 4, 5, 7, 8
Thanks

Maybe you can use hive which support the sql-like query!

As far as I understand, you want to collect all friends in the n-th circle of some person in a social graph. Most graph algorithms are recursive, and recursion is not well-suitable for a MapReduce way of solving tasks.
I can suggest you to use Apache Giraph to solve this problem (actually it uses MapReduce under the hood). It's mostly async and you write your jobs describing behaviour of a single node like:
1. Send a message from root node to all friends to get their friendlist.
2.1. Each friend sends a message with friendlist to root node.
2.2. Each friend sends a message to all it's sub-friends to get their friendlist.
3.1. Each sub-friend sends a message with friendlist to root node.
3.2. Each sub-friend sends a message to all it's sub-sub-friends to get their friendlist.
...
N. Root node collects all these messages and merges them in a single list.
Also you can use a cascade of map-reduce jobs to collect circles, but it's not very effective way to solve the task:
Export root user friends to a file circle-001
Use circle-001 as an input to a job that exports each user friends from circle-001 to a circle-002
Do the same, but use circle-002 as an input
...
Repeat N times
The first approach is more suitable if you have a lot of users to calculate their circles. The second has huge overhead of starting multiple MR jobs, but it's much simpler and is OK for small input set of users.

I am novice in this field but here is my though on that.
You could use a conventional BFS algorithm following the below pseudo code.
At each iteration you launch an Hadoop job that discovers all the child nodes of the current working set that were not yet visited.
BFS (list curNodes, list visited, int depth){
if (depth <= 0){
return visited;
}
//run Hadoop job on the current working set curNodes restricted by visited
//the job will populate some result list with the list of child nodes of the current working set
//then,
visited.addAll(result);
curNodes.empty();
curNodes.addAll(result);
BFS(curNodes, visited, depth-1);
}
The mapper and reducer of this job will look as below.
In this example I just used static members to hold the working set, visited and result sets.
It should have been implemented using a temp file. Probably there are ways to optimize the persistence of the temporary data accumulated from one iteration to the next.
the input file I used for the job contains list of topples one topple per line e.g.
1,2
2,3
5,4
...
...
public static class VertexMapper extends
Mapper<Object, Text, IntWritable, IntWritable> {
private static Set<IntWritable> curVertex = null;
private static IntWritable curLevel = null;
private static Set<IntWritable> visited = null;
private IntWritable key = new IntWritable();
private IntWritable value = new IntWritable();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
if (itr.countTokens() == 2) {
String keyStr = itr.nextToken();
String valueStr = itr.nextToken();
try {
this.key.set(Integer.parseInt(keyStr));
this.value.set(Integer.parseInt(valueStr));
if (VertexMapper.curVertex.contains(this.key)
&& !VertexMapper.visited.contains(this.value)
&& !key.equals(value)) {
context.write(VertexMapper.curLevel, this.value);
}
} catch (NumberFormatException e) {
System.err.println("Found key,value <" + keyStr + "," + valueStr
+ "> which cannot be parsed as int");
}
} else {
System.err.println("Found malformed line: " + value.toString());
}
}
}
public static class UniqueReducer extends
Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
private static Set<IntWritable> result = new HashSet<IntWritable>();
public void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable val : values) {
UniqueReducer.result.add(new IntWritable(val.get()));
}
// context.write(key, key);
}
}
Running a job will be something like that
UniqueReducer.result.clear();
VertexMapper.curLevel = new IntWritable(1);
VertexMapper.curVertex = new HashSet<IntWritable>(1);
VertexMapper.curVertex.add(new IntWritable(1));
VertexMapper.visited = new HashSet<IntWritable>(1);
VertexMapper.visited.add(new IntWritable(1));
Configuration conf = getConf();
Job job = new Job(conf, "BFS");
job.setJarByClass(BFSExample.class);
job.setMapperClass(VertexMapper.class);
job.setCombinerClass(UniqueReducer.class);
job.setReducerClass(UniqueReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
job.setOutputFormatClass(NullOutputFormat.class);
boolean result = job.waitForCompletion(true);
BFSExample bfs = new BFSExample();
ToolRunner.run(new Configuration(), bfs, args);

Related

Partitioner is not working correctly

I am trying to code one MapReduce scenario in which i have created some User ClickStream data in the form of JSON. After that i have written Mapper class to fetch the required data from the file my mapper code is :-
private final static String URL = "u";
private final static String Country_Code = "c";
private final static String Known_User = "nk";
private final static String Session_Start_time = "hc";
private final static String User_Id = "user";
private final static String Event_Id = "event";
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String aJSONRecord = value.toString();
try {
JSONObject aJSONObject = new JSONObject(aJSONRecord);
StringBuilder aOutputString = new StringBuilder();
aOutputString.append(aJSONObject.get(User_Id).toString()+",");
aOutputString.append(aJSONObject.get(Event_Id).toString()+",");
aOutputString.append(aJSONObject.get(URL).toString()+",");
aOutputString.append(aJSONObject.get(Known_User)+",");
aOutputString.append(aJSONObject.get(Session_Start_time)+",");
aOutputString.append(aJSONObject.get(Country_Code)+",");
context.write(new Text(aOutputString.toString()), key);
System.out.println(aOutputString.toString());
} catch (JSONException e) {
e.printStackTrace();
}
}
}
And my reducer code is :-
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
String aString = key.toString();
context.write(new Text(aString.trim()), new Text(""));
}
And my partitioner code is :-
public int getPartition(Text key, LongWritable value, int numPartitions) {
String aRecord = key.toString();
if(aRecord.contains(Country_code_Us)){
return 0;
}else{
return 1;
}
}
And here is my driver code
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Click Stream Analyzer");
job.setNumReduceTasks(2);
job.setJarByClass(ClickStreamDriver.class);
job.setMapperClass(ClickStreamMapper.class);
job.setReducerClass(ClickStreamReducer.class);
job.setPartitionerClass(ClickStreamPartitioner.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Here i am trying to partition my data on the basis of country code. But its not working, it is sending each and every record in a single reducer file i think file other then the one created for US reduce.
One more thing when i see the output of mappers it shows some extra space added at the end of each record.
Please suggest if i am making any mistake here.
Your problem with the partitioning is due to the number of reducers. If it is 1, all your data will be sent to it, independently to what you return from your partitioner. Thus, setting mapred.reduce.tasks to 2 will solve this issue. Or you can simply write:
job.setNumReduceTasks(2);
In order to have 2 reducers as you want.
Unless you have very specific requirement, you can set reducers as below for job parameters.
mapred.reduce.tasks (in 1.x) & mapreduce.job.reduces(2.x)
Or
job.setNumReduceTasks(2) as per mark91 answer.
But leave the job to Hadoop fraemork by using below API. Framework will decide number of reducers as per the file & block sizes.
job.setPartitionerClass(HashPartitioner.class);
I have used NullWritable and it works. Now i can see records are getting partitioned in different files. Since i was using longwritable as a null value instead of null writable , space is added in the last of each line and due to this US was listed as "US " and partition was not able to divide the orders.

map emitted keys change inside reducer/combiner

I need to do two separate matrix multiplications (S*A and P*A) inside my mappers and emit the results of both. I know i can do that easily with two mapreduce job, but inorder to save running time I need to do them in one job. So what I do is that after doing both multiplications I put both of outputs in context object, but with different keys so that I can distinguish them inside reducer:
LongWritable One = new LongWritable();
One.set(1);
context.write(One, partialSA);
LongWritable two = new LongWritable();
two.set(2);
context.write(two, partialPA);
In reduce, I just need to add all partialSA matrices together and add all partialPA matrices together too. The problem is that If I use combiner, the emitted keys I receive inside combiner are 0 and 1 instead of 1 and two!!! And if i dont use combiner, inside reducer I receive 0 and 1 as keys instead of 1 and 2.
Why is this happening? what is the problem?
Here is the exact cleanup function of my mapper:
public void cleanup(Context context) throws IOException, InterruptedException{
LongWritable one = new LongWritable();
one.set(1);
LongWritable two = new LongWritable();
two.set(2)
context.write(one, partialSA);
context.write(two, partialPA);
}
Here is the reducer() code:
public void reduce(LongWritable key, Iterable<MatrixWritable> values, Context context) throws IOException, InterruptedException{
System.out.println("*** In reduce() **** "+key.get());
Iterator<MatrixWritable> itr = values.iterator();
if(key.get() == 1){
while(itr.hasNext()){
SA.addMatrices(itr.next());
}
}else if(key.get() == 2){
while(itr.hasNext()){
PA.addMatrices(itr.next());
}
}
}

why Hadoop combiner output not merged by reducer

I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows
Test:
Map -> Combiner ->Reducer
In combiner i added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.
Text t = new Text("different"); // Added a my own output
context.write(t, new IntWritable(1)); // Added my own output
public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
Text t = new Text("different"); // Added my own output
context.write(t, new IntWritable(1)); // Added my own output
}
}
Input:
I ran a simple wordcount MapReduce example adding combiner with a small change in combiner output, The output of combiner is not merged by reducer. scenario is as follows
In combiner I added two extra lines to out put a word different and count 1, reducer is not suming the "different" word count. output pasted below.
Output:
"different" 1
different 1
different 1
I 2
different 1
In 1
different 1
MapReduce 1
different 1
The 1
different 1
...
How can this happen?
fullcode:
I ran wordcount program with combiner and just for fun i tweaked it in combiner, so i faced this issue.
I have three separate classes for mapper, combiner and reducer.
Driver:
public class WordCount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Job job = Job.getInstance(new Configuration());
job.setJarByClass(wordcountmapper.class);
job.setJobName("Word Count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(wordcountmapper.class);
job.setCombinerClass(wordcountcombiner.class);
job.setReducerClass(wordcountreducer.class);
job.getConfiguration().set("fs.file.impl", "com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true)? 0 : 1);
}
}
Mapper:
public class wordcountmapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens())
{
word.set(token.nextToken());
context.write(word, one);
}
}
}
Combiner:
public class wordcountcombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
Text t = new Text("different");
context.write(t, new IntWritable(1));
}
}
Reducer:
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
The output is normal because you're having two lines doing wrong things :
Why are you having this code
Text t = new Text("different"); // Added my own output
context.write(t, new IntWritable(1)); // Added my own output
In your reducer you're doing the sum and then you're adding to the output different 1 ....
You are writing in the final output of the job a new "1 different" in the reduce function, without doing any kind of aggregation. The reduce function is called once per key, as you can see in the method signature, it takes as arguments a key and the list of values for that key, which means that it is called once for each of the keys.
Since you are using as key a word, and in each call of reduce you are writing to the output "1 different", you will get one of those for each of the words in the input data.
hadoop requires that the reduce method in the combiner writes only the same key that it receives as input. This is required because hadoop sorts the keys only before the combiner is called, it does not re-sort them after the combiner has run. In your program, the reduce method writes the key "different" in addition to the key that it received as input. This means that the key "different" then appears in different positions in the order of keys, and these occurrences are not merged before they get passed to the reducer.
For example:
Assume the sorted list of keys output by the mapper is: "alpha", "beta", "gamma"
Your combiner is then called three times (once for "alpha", once for "beta", once for "gamma") and produces keys "alpha", "different", then keys "beta", "different", then keys "gamma", "different".
The "sorted" (but actually not sorted) list of keys after the combiner has executed is then:
"alpha", "different", "beta", "different", "gamma", "different"
This list does not get sorted again, so the different occurrences of "different" do not get merged.
The reducer is then called separately six times, and the key "different" appears 3 times in the output of the reducer.

Why is MultipleOutputs not working for this Map Reduce program?

I have a Mapper class that is giving a text key and IntWritable value which could be 1 two or three. Depending upon the values I have to write three different files with different keys. I am getting a Single File output with No record in it.
Also, is there any good Multiple Outputs example(with explanation) you could guide me to?
My Driver Class Had this code:
MultipleOutputs.addNamedOutput(job, "name", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "attributes", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "others", TextOutputFormat.class, Text.class, IntWritable.class);
My reducer class is:
public static class Reduce extends Reducer<Text, IntWritable, Text, NullWritable> {
private MultipleOutputs mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String CheckKey = values.toString();
if("1".equals(CheckKey)) {
mos.write("name", key, new IntWritable(1));
}
else if("2".equals(CheckKey)) {
mos.write("attributes", key, new IntWritable(2));
}
else if("3".equals(CheckKey)) {
mos.write("others", key,new IntWritable(3));
}
/* for (IntWritable val : values) {
sum += val.get();
}*/
//context.write(key, null);
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
P.S I am new to HADOOP/MAP-Reduce Programming.
ArrayList<Integer> l = new ArrayList<Integer>();
l.add(1);
System.out.println(l.toString());
results in "[1]" not 1 so
values.toString()
will not give "1"
Apart from that I just tried to print an Iterable and it just gave a reference, so that is definitely your problem. If you want to iterate over the values do as in the example below:
Iterator<Text> valueIterator = values.iterator();
while (valueIterator.hasNext()){
}
Note that you can only iterate once!
Your problem statement is muddled. What do you mean, "depending on the values"? The reducer gets an Iterable of values, not a single value. Something tells me that you need to move the multiple output code in your reducer inside the loop you have commented out for taking the sum.
Or perhaps you don't need a reducer at all and can take care of this in the map phase. If you are using the reduce phase to end up with exactly 4 files by using a single reduce task, then you can also achieve what you want by flipping the key and value in your map phase and forgetting about MultipleOutputs altogether, because you'll end up with only 3 working reduce tasks, one for each of your int values. To get the 4th one you can output two copies of the record in each map call using a special key to indicate that the output is meant for the normal file, not one of the three special files. Normally I would not recommend such a course of action as you have severe bounds on the level of parallelism you can achieve in the reduce phase when the number of keys is small.
You should also include some anomalous data handling code to the end of your 'if' ladder that increments a counter or something if you encounter a value that is not one of the three you are expecting.

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources