HADOOP - Mapreduce - I obtain same value for all keys - hadoop

I have a problem with mapreduce. Giving as input a list of song ("Songname"#"UserID"#"boolean") i must have as result a song list in which is specified how many time different users listen them... so a '' output ("Songname","timelistening").
I used hashtable to allow only one couple .
With short files it works well but when I put as input a list about 1000000 of records it returns me the same value (20) for all records.
This is my mapper:
public static class CanzoniMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable userID = new IntWritable(0);
private Text song = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] caratteri = value.toString().split("#");
if(caratteri[2].equals("1")){
song.set(caratteri[0]);
userID.set(Integer.parseInt(caratteri[1]));
context.write(song,userID);
}
}
}
This is my reducer:
public static class CanzoniReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
Hashtable<IntWritable,Text> doppioni = new Hashtable<IntWritable,Text>();
for (IntWritable val : values) {
doppioni.put(val,key);
}
result.set(doppioni.size());
doppioni.clear();
context.write(key,result);
}
}
and main:
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(Canzoni.class);
job.setMapperClass(CanzoniMapper.class);
//job.setCombinerClass(CanzoniReducer.class);
//job.setNumReduceTasks(2);
job.setReducerClass(CanzoniReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Any idea???

Maybe I solved it. It's an input problem. There were too many records compared to the number of songs, so in these records' list each song was listed at least once by each user.
In my test I had 20 different users, so naturally the result gives me 20 for each song.
I must increase the number of different songs.

Related

How to reduce running time of prime BigInteger mapreduce code?

I am trying to generate prime BigIntegers of size 1764 bits(531 digits).when I do this on local computer it takes very long time. So I try mapreduce for generating BigIntegers and run on single node cloudera (CDH 4). But this takes lots of time in maping. Can I reduce the time by applying mapreduce and implementing it on multinode cluster? and my second question is is this program can be improved for better efficiency and How?
My input file consist of 90 entries containing "1764" which is the number of bits random BigInteger generated. Here is my code for mapreduce
public final class Primes {
public final static void main(final String[] args) throws Exception {
final Configuration conf = new Configuration();
final Job job = new Job(conf, "Primes");
job.setJarByClass(Primes.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PrimesMap.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
public static final class PrimesMap extends Mapper<LongWritable, Text, NullWritable, Text> {
final NullWritable nw = NullWritable.get();
private Text str=new Text();
public final void map(final LongWritable key, final Text value, final Context context)
throws IOException, InterruptedException {
final int number = Integer.parseInt(value.toString());
BigInteger num=new BigInteger("1");
num=num.probablePrime(number,new SecureRandom());
str.set(num.toString());
context.write(nw, str);
}
}
}

Hadoop not all values get assembled for one key

I have some data that I would like to aggregate by key using Mapper code and then perform something on all values that belong to a key using Reducer code. For example if I have:
key = 1, val = 1,
key = 1, val = 2,
key = 1, val = 3
I would like to get key=1, val=[1,2,3] in my Reducer.
The thing is, I get something like
key = 1, val=[1,2]
key = 1, val=[3]
Why is that so?
I thought that all the values for one specific key will be assembled in one reducer, but now it seems that there can be more key, val [ ] pairs, since there can be multiple reducers, is that so?
Should I set number of reducers to be 1?
I'm new to Hadoop so this confuses me.
Here's the code
public class SomeJob {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(SomeJob.class);
FileInputFormat.addInputPath(job, new Path("/home/pera/data/input/some.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/pera/data/output"));
job.setMapperClass(SomeMapper.class);
job.setReducerClass(SomeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
public class SomeMapper extends Mapper<LongWritable, Text, Text, Text>{
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String parts[] = line.split(";");
context.write(new Text(parts[0]), new Text(parts[4]));
}
}
public class SomeReducer extends Reducer<Text, Text, Text, Text>{
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String properties = "";
for(Text value : values)
{
properties += value + " ";
}
context.write(key, new Text(properties));
}
}

Writing a value to file without moving to reducer

I have an input of records like this,
a|1|Y,
b|0|N,
c|1|N,
d|2|Y,
e|1|Y
Now, in mapper, i has to check the value of third column. If it is 'Y' then that record has to write directly to output file without moving that record to reducer or else i.e, 'N' value records has to move to reducer for further processing..
So,
a|1|Y,
d|2|Y,
e|1|Y
should not go to reducer but
b|0|N,
c|1|N
should go to reducer and then to output file.
How can i do this??
What you can probably do is use MultipleOutputs - click here to separate out records of 'Y' and 'N' type to two different files from mappers.
Next, you run saparate jobs for the two newly generated 'Y' and 'N' type data sets.
For 'Y' types set number of reducers to 0, so that, Reducers aren't use. And, for 'N' types do it the way you want using reducers.
Hope this helps.
See if this works,
public class Xxxx {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Random r = new Random();
FileSplit split = (FileSplit)context.getInputSplit();
String fileName = split.getPath().getName();
FSDataOutputStream out = fs.create(new Path(fileName + "-m-" + r.nextInt()));
String parts[];
String line = value.toString();
String[] splits = line.split(",");
for(String s : splits) {
parts = s.split("\\|");
if(parts[2].equals("Y")) {
out.writeBytes(line);
}else {
context.write(key, value);
}
}
out.close();
fs.close();
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text t : values) {
context.write(key, t);
}
}
}
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Job job = new Job(conf, "Xxxx");
job.setJarByClass(Xxxx.class);
Path outPath = new Path("/output_path");
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("/input.txt"));
FileOutputFormat.setOutputPath(job, outPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In your map function, you will get input line by line. Split it according by using | as the delimiter. (by using the String.split() method to be exact)
It will look like this
String[] line = value.toString().split('|');
Access the third element of this array by line[2]
Then, using a simple if else statement, emit the output with N value for further processing.

How to pass variable between two map reduce jobs

I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup.
public static String GlobalVriable;
public static void main(String[] args) throws Exception {
int runs = 0;
for (; runs < 10; runs++) {
String inputPath = "part-r-000" + nf.format(runs);
String outputPath = "part-r-000" + nf.format(runs + 1);
MyProgram.MR1(inputPath);
MyProgram.MR2(inputPath, outputPath);
}
}
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("var1","");
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
GlobalVriable = conf.get("var1"); // I am getting NULL here
}
public static void MR2(String inputPath, String outputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job2");
...
}
public static class MyReduce1 extends
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float s = 0;
for (FloatWritable val : values) {
s += val.get();
}
String sum = Float.toString(s);
context.getConfiguration().set("var1", sum);
}
}
As you can see I need to iterate the entire program multiple times. My Job1 is computing a single number from the input. Since it is just a single number and a lot of iterations I dont want to write it to HDFS and read from it. Is there a way to share the value computed in Myreducer1 and use it in Myreducer2.
UPDATE: I have tried passing the value using conf.set & conf.get. The value is not being passed.
Here's how to pass back a float value via a counter ...
First, in the first reducer, transform the float value into a long by multiplying by 1000 (to maintain 3 digits of precision, for example) and putting the result into a counter:
public void cleanup(Context context) {
long result = (long) (floatValue * 1000);
context.getCounter("Result","Result").increment(result);
}
In the driver class, retrieve the long value and transform it back to a float:
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
long result = job.getCounters().findCounter("Result","Result").getValue();
float value = ((float)result) / 1000;
}
You could use ZooKeeper for this. It's great for any inter-job coordination or message passing like this.
Can't you just change the return type of MR1 to int (or whatever data type is appropriate) and return the number you computed:
int myNumber = MyProgram.MR1(inputPath);
Then add a parameter to MR2 and call it with your computed number:
MyProgram.MR2(inputPath, outputPath, myNumber);

Writing text output from map function hadoop

Input :
a,b,c,d,e
q,w,34,r,e
1,2,3,4,e
In mapper, I would grab all the values of the last field, and I want to emit (e,(a,b,c,d)) i.e. it emits (key, (rest of the fields from the line)).
Help appreciated.
Current code:
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
// further process , loads the chunk into 2d arraylist object for processing
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); /* The method setMapperClass(Class<? extends Mapper>) in the type Job is not applicable for the arguments (Class<Map>) */
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}`
Please note the errors (see comments) I get face.
So this is simple. First parse your string to get the key and pass the rest of the line as the value. Then use the identity reducer which will combine all the same key values as list together as your output. It should be in the same format.
So your map function will output:
e, (a,b,c,d,e)
e, (q,w,34,r,e)
e, (1,2,3,4,e)
Then after the identity reduce it should output:
e, {a,b,c,d,e; q,w,34,r,e; 1,2,3,4,e}
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
Found alternate logic. Implemented , tested and verified.

Resources