How to reduce running time of prime BigInteger mapreduce code? - performance

I am trying to generate prime BigIntegers of size 1764 bits(531 digits).when I do this on local computer it takes very long time. So I try mapreduce for generating BigIntegers and run on single node cloudera (CDH 4). But this takes lots of time in maping. Can I reduce the time by applying mapreduce and implementing it on multinode cluster? and my second question is is this program can be improved for better efficiency and How?
My input file consist of 90 entries containing "1764" which is the number of bits random BigInteger generated. Here is my code for mapreduce
public final class Primes {
public final static void main(final String[] args) throws Exception {
final Configuration conf = new Configuration();
final Job job = new Job(conf, "Primes");
job.setJarByClass(Primes.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PrimesMap.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
public static final class PrimesMap extends Mapper<LongWritable, Text, NullWritable, Text> {
final NullWritable nw = NullWritable.get();
private Text str=new Text();
public final void map(final LongWritable key, final Text value, final Context context)
throws IOException, InterruptedException {
final int number = Integer.parseInt(value.toString());
BigInteger num=new BigInteger("1");
num=num.probablePrime(number,new SecureRandom());
str.set(num.toString());
context.write(nw, str);
}
}
}

Related

MultipleOutputFormat does only one iteration in Reduce step

Here is my Reducer. Reducer takes in EdgeWritable and NullWritable
EdgeWritable has 4 integers, say <71, 74, 7, 2000>
The communication is between 71(FromID) to 74(ToID) on 7(July) 2000(Year).
Mapper outputs 10787 records as such to reducer, But Reducer only outputs 1.
I need to output 44 files with for 44 months between the period Oct-1998 and July 2002. The output should be in format "out"+month+year. ForExample July 2002 records will be in file out72002.
I have debugged the code. After one iteration, it outputs one file and stops without taking next record. Please suggest How I should use MultipleOutput. Thanks
public class MultipleOutputReducer extends Reducer<EdgeWritable, NullWritable, IntWritable, IntWritable>{
private MultipleOutputs<IntWritable,IntWritable> multipleOutputs;
protected void setup(Context context) throws IOException, InterruptedException{
multipleOutputs = new MultipleOutputs<IntWritable, IntWritable>(context);
}
#Override
public void reduce(EdgeWritable key, Iterable val , Context context) throws IOException, InterruptedException {
int year = key.get(3).get();
int month= key.get(2).get();
int to = key.get(1).get();
int from = key.get(0).get();
//if(year >= 1997 && year <= 2001){
if((month >= 9 && year >= 1997) || (month <= 6 && year <= 2001)){
multipleOutputs.write(new IntWritable(from), new IntWritable(to), "out"+month+year );
}
//}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException{
multipleOutputs.close();
}
Driver
public class TimeSlicingDriver extends Configured implements Tool{
static final SimpleDateFormat sdf = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");
public int run(String[] args) throws Exception {
if(args.length != 2){
System.out.println("Enter <input path> <output path>");
System.exit(-1);
}
Configuration setup = new Configuration();
//setup.set("Input Path", args[0]);
Job job = new Job(setup, "Time Slicing");
//job.setJobName("Time Slicing");
job.setJarByClass(TimeSlicingDriver.class);
job.setMapperClass(TimeSlicingMapper.class);
job.setReducerClass(MultipleOutputReducer.class);
//MultipleOutputs.addNamedOutput(setup, "output", org.apache.hadoop.mapred.TextOutputFormat.class, EdgeWritable.class, NullWritable.class);
job.setMapOutputKeyClass(EdgeWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
/**Set the Input File Path and output file path*/
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?0:1;
}
you are not iterating your Iterator "val", for that reason your logic in your code is executed one time for each group.

Hbase MapReduce job using MultipleInputs: cannot cast LongWritable to ImmutableBytesWritable

I am writing a MR job which takes HBase table as input and dump to HDFS files. I use MultipleInputs class (from Hadoop) since I plan to take multiple data sources. I wrote a very simple MR program (see the source code below). Unfortunately, I run into the following error:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hbase.io.ImmutableBytesWritable
I run on pseudo-distributed hadoop (1.2.0) and Pseudo-distributed HBase (0.95.1-hadoop1).
Here is the complete source code: an interesting thing is: if I comment out the multipleinputs line "MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, TableMap.class);", the MR job runs fine.
public class MixMR {
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path outputPath = new Path(args[1]);
String tableName1 = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName1, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
I got the answer:
in the following statement:replace TextInputFormat.class to TableInputFormat.class
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, TableMap.class);

How to pass variable between two map reduce jobs

I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup.
public static String GlobalVriable;
public static void main(String[] args) throws Exception {
int runs = 0;
for (; runs < 10; runs++) {
String inputPath = "part-r-000" + nf.format(runs);
String outputPath = "part-r-000" + nf.format(runs + 1);
MyProgram.MR1(inputPath);
MyProgram.MR2(inputPath, outputPath);
}
}
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("var1","");
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
GlobalVriable = conf.get("var1"); // I am getting NULL here
}
public static void MR2(String inputPath, String outputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job2");
...
}
public static class MyReduce1 extends
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float s = 0;
for (FloatWritable val : values) {
s += val.get();
}
String sum = Float.toString(s);
context.getConfiguration().set("var1", sum);
}
}
As you can see I need to iterate the entire program multiple times. My Job1 is computing a single number from the input. Since it is just a single number and a lot of iterations I dont want to write it to HDFS and read from it. Is there a way to share the value computed in Myreducer1 and use it in Myreducer2.
UPDATE: I have tried passing the value using conf.set & conf.get. The value is not being passed.
Here's how to pass back a float value via a counter ...
First, in the first reducer, transform the float value into a long by multiplying by 1000 (to maintain 3 digits of precision, for example) and putting the result into a counter:
public void cleanup(Context context) {
long result = (long) (floatValue * 1000);
context.getCounter("Result","Result").increment(result);
}
In the driver class, retrieve the long value and transform it back to a float:
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
long result = job.getCounters().findCounter("Result","Result").getValue();
float value = ((float)result) / 1000;
}
You could use ZooKeeper for this. It's great for any inter-job coordination or message passing like this.
Can't you just change the return type of MR1 to int (or whatever data type is appropriate) and return the number you computed:
int myNumber = MyProgram.MR1(inputPath);
Then add a parameter to MR2 and call it with your computed number:
MyProgram.MR2(inputPath, outputPath, myNumber);

Writing text output from map function hadoop

Input :
a,b,c,d,e
q,w,34,r,e
1,2,3,4,e
In mapper, I would grab all the values of the last field, and I want to emit (e,(a,b,c,d)) i.e. it emits (key, (rest of the fields from the line)).
Help appreciated.
Current code:
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
// further process , loads the chunk into 2d arraylist object for processing
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); /* The method setMapperClass(Class<? extends Mapper>) in the type Job is not applicable for the arguments (Class<Map>) */
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}`
Please note the errors (see comments) I get face.
So this is simple. First parse your string to get the key and pass the rest of the line as the value. Then use the identity reducer which will combine all the same key values as list together as your output. It should be in the same format.
So your map function will output:
e, (a,b,c,d,e)
e, (q,w,34,r,e)
e, (1,2,3,4,e)
Then after the identity reduce it should output:
e, {a,b,c,d,e; q,w,34,r,e; 1,2,3,4,e}
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); // reads the input line by line
String[] attr = line.split(","); // extract each attribute values from the csv record
context.write(attr[argno-1],line); // gives error seems to like only integer? how to override this?
}
}
public static void main(String[] args) throws Exception {
String line;
String arguements[];
Configuration conf = new Configuration();
// compute the total number of attributes in the file
FileReader infile = new FileReader(args[0]);
BufferedReader bufread = new BufferedReader(infile);
line = bufread.readLine();
arguements = line.split(","); // split the fields separated by comma
conf.setInt("argno", arguements.length); // saving that attribute value
Job job = new Job(conf, "nb");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
Found alternate logic. Implemented , tested and verified.

HADOOP - Mapreduce - I obtain same value for all keys

I have a problem with mapreduce. Giving as input a list of song ("Songname"#"UserID"#"boolean") i must have as result a song list in which is specified how many time different users listen them... so a '' output ("Songname","timelistening").
I used hashtable to allow only one couple .
With short files it works well but when I put as input a list about 1000000 of records it returns me the same value (20) for all records.
This is my mapper:
public static class CanzoniMapper extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable userID = new IntWritable(0);
private Text song = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] caratteri = value.toString().split("#");
if(caratteri[2].equals("1")){
song.set(caratteri[0]);
userID.set(Integer.parseInt(caratteri[1]));
context.write(song,userID);
}
}
}
This is my reducer:
public static class CanzoniReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
Hashtable<IntWritable,Text> doppioni = new Hashtable<IntWritable,Text>();
for (IntWritable val : values) {
doppioni.put(val,key);
}
result.set(doppioni.size());
doppioni.clear();
context.write(key,result);
}
}
and main:
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(Canzoni.class);
job.setMapperClass(CanzoniMapper.class);
//job.setCombinerClass(CanzoniReducer.class);
//job.setNumReduceTasks(2);
job.setReducerClass(CanzoniReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Any idea???
Maybe I solved it. It's an input problem. There were too many records compared to the number of songs, so in these records' list each song was listed at least once by each user.
In my test I had 20 different users, so naturally the result gives me 20 for each song.
I must increase the number of different songs.

Resources