Write mapreduce code to search for pattern - hadoop

customer_id, server_id, code
12342344, 1232, 3
12345456, 1433, 2
16345436, 2343, 4
12245456, 1434, 3
11145456, 1436, 2
If I'm running this query on hive:
select * from table where code=3;
what would be the code written in mapreduce, where can I find it ?
in another word, how can I write mapreduce job to give the same query results above ?
Thanks

you can run EXPLAIN select * from table where code=3; from hive to see hive's query execution plan, but hive doesn't output any mapreduce code for any query. Also check YSmart-Another SQL-to-MapReduce Translator - to see mapreduce jobs for SQL - I never used it, but you can try it out:
This is sample, you can try to modify for different other queries to get an idea . Basically you will filter the data in map class and run some aggregations in reduce class. but in this example you don't need reduce class as it is simple mapping from data. You will need to go through wordcount (helloworld example of hadooop) first in order to learn how to run this program and where to find output etc.
Output:
12245456, 1434, 3
12342344, 1232, 3
Code:
public class customers{
public static class customersMapper extends Mapper<Object, Text, Text, Text> {
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//this code should be converted to Int - try it for your self
String code = value.toString().split(",")[2].trim();
//this line should be chnaged if code variable is converted to Int
if (code.equals("3")){ // only map rec where code=3
context.write(value,new Text(""));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "get-customer-with-code=3");
job.setJarByClass(customers.class);
job.setMapperClass(customersMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Related

How do I return the output of a Hadoop MapReduce job as value/key instead of key/value?

For example, the typical WordCount mapreduce might return an output that reads:
hello 3
world 4
again 1
I want to format the output slightly differently so that it would show this instead:
3 hello
4 world
1 again
I've read a lot of posts wanting to sort by the value and the answers suggested a second mapreduce job on the output of the first one. However, I don't need to sort by the value, and it's possible that multiple keys have the same value--I don't want them to be lumped together.
Is there an easy way to simply switch the order the key/values are printed? It seems like it should be simple.
Two options to consider in order of ease are:
Switch the Key/Value in the Reduce
Modify the output from the reduce to switch the key and value. For example the reduce in Hadoops example WordCount job would change to:
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(result, key);
}
}
Here the context.write(result, key); has changed to switch the key and value.
Use a second Map only job
You can use the InverseMapper (Source) provided by Hadoop to run a Map only (0 reducers) job to switch the key and value. So you would just have a second job, and only need to write the driver, which would look something like:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Switch inputs");
job.setJarByClass(WordCount.class);
job.setMapperClass(InverseMapper.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note, that you would want the first job to write the output of the first job using SequenceFileOutputFormat and use SequenceFileInputFormat as the input to the second.

Partitioner is not working correctly

I am trying to code one MapReduce scenario in which i have created some User ClickStream data in the form of JSON. After that i have written Mapper class to fetch the required data from the file my mapper code is :-
private final static String URL = "u";
private final static String Country_Code = "c";
private final static String Known_User = "nk";
private final static String Session_Start_time = "hc";
private final static String User_Id = "user";
private final static String Event_Id = "event";
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String aJSONRecord = value.toString();
try {
JSONObject aJSONObject = new JSONObject(aJSONRecord);
StringBuilder aOutputString = new StringBuilder();
aOutputString.append(aJSONObject.get(User_Id).toString()+",");
aOutputString.append(aJSONObject.get(Event_Id).toString()+",");
aOutputString.append(aJSONObject.get(URL).toString()+",");
aOutputString.append(aJSONObject.get(Known_User)+",");
aOutputString.append(aJSONObject.get(Session_Start_time)+",");
aOutputString.append(aJSONObject.get(Country_Code)+",");
context.write(new Text(aOutputString.toString()), key);
System.out.println(aOutputString.toString());
} catch (JSONException e) {
e.printStackTrace();
}
}
}
And my reducer code is :-
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
String aString = key.toString();
context.write(new Text(aString.trim()), new Text(""));
}
And my partitioner code is :-
public int getPartition(Text key, LongWritable value, int numPartitions) {
String aRecord = key.toString();
if(aRecord.contains(Country_code_Us)){
return 0;
}else{
return 1;
}
}
And here is my driver code
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Click Stream Analyzer");
job.setNumReduceTasks(2);
job.setJarByClass(ClickStreamDriver.class);
job.setMapperClass(ClickStreamMapper.class);
job.setReducerClass(ClickStreamReducer.class);
job.setPartitionerClass(ClickStreamPartitioner.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Here i am trying to partition my data on the basis of country code. But its not working, it is sending each and every record in a single reducer file i think file other then the one created for US reduce.
One more thing when i see the output of mappers it shows some extra space added at the end of each record.
Please suggest if i am making any mistake here.
Your problem with the partitioning is due to the number of reducers. If it is 1, all your data will be sent to it, independently to what you return from your partitioner. Thus, setting mapred.reduce.tasks to 2 will solve this issue. Or you can simply write:
job.setNumReduceTasks(2);
In order to have 2 reducers as you want.
Unless you have very specific requirement, you can set reducers as below for job parameters.
mapred.reduce.tasks (in 1.x) & mapreduce.job.reduces(2.x)
Or
job.setNumReduceTasks(2) as per mark91 answer.
But leave the job to Hadoop fraemork by using below API. Framework will decide number of reducers as per the file & block sizes.
job.setPartitionerClass(HashPartitioner.class);
I have used NullWritable and it works. Now i can see records are getting partitioned in different files. Since i was using longwritable as a null value instead of null writable , space is added in the last of each line and due to this US was listed as "US " and partition was not able to divide the orders.

If 2 Mappers output the same key , what will the input to the reducer be?

I've the following doubt while learning Map reduce. It will be of great help if some one could answer.
I've two mappers working on the same file - I configured them using MultipleInputFormat
mapper 1 - Expected Output [ after extracting few columns of a file]
a - 1234
b - 3456
c - 1345
Mapper 2 Expected output [After extracting few columns of the same file]
a - Monday
b - Tuesday
c - Wednesday
And there is a reducer function that just outputs the key and value pair that it gets as input
So I expected the output to be as I know that similar keys will be shuffled to make a list.
a - [1234,Monday]
b - [3456, Tuesday]
c - [1345, Wednesday]
But am getting some weird output.I guess only 1 Mapper is getting run.
Should this not be expected ? Will the output of each mapper be shuffled separately ? Will both the mappers run parallel ?
Excuse me if its a lame question Please understand that I am new to Hadoop and Map Reduce
Below is the code
//Mapper1
public class numbermapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key,Text value, Context context) throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper number output "+parts[0]+" "+parts[1]);
context.write(new Text(parts[0]), new Text(parts[1]));
}
}
//Mapper2
public class weekmapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String record = value.toString();
String[] parts = record.split(",");
System.out.println("***Mapper week output "+parts[0]+" "+parts[2]);
context.write(new Text(parts[0]), new Text(parts[2]));
}
}
//Reducer
public class rjoinreducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Text values, Context context)
throws IOException, InterruptedException {
context.write(key, values);
}
}
//Driver class
public class driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Reduce-side join");
job.setJarByClass(numbermapper.class);
job.setReducerClass(rjoinreducer.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, numbermapper.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, weekmapper.class);
Path outputPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
And this is the O/P I got-
a Monday
b Tuesday
c Wednesday
Dataset used
a,1234,Monday
b,3456,Tuesday
c,1345,Wednesday
Multiple input format was just taking 1 file and running one mapper on it because I have given the same path for both the Mappers.
When I copy the dataset to a different file and ran the same program taking two different files (same content but different names for the files) I got the expected output.
So i now understood that the output from different mapper functions is also combined based on key , not just the output from the same mapper function.
Thanks for trying to help....!!!

Is it possible to have multiple output files for a map-reduce?

In my input file i have a column as country. Now, my task is to place records of a particular country into a separate file naming with that country. Is this possible to do in Map-reduce.!
Please share your ideas regarding this.
Yes it is, in hadoop you can use MultipleOutputFormat to do exactly that, using its generateFileNameForKeyValue method.
Using your country names as keys and the records as values this should work exactly as you need it to.
If you are using the new API , you should look at the MultipleOutputs class. There is an example inside this class.
Usage pattern for job submission:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
...
job.waitForCompletion(true);
...
Usage in Reducer:
String generateFileName(K k, V v) {
return k.toString() + "_" + v.toString();
}
public class MOReduce extends
Reducer {
private MultipleOutputs mos;
public void setup(Context context) {
...
mos = new MultipleOutputs(context);
}
public void reduce(WritableComparable key, Iterator values,
Context context)
throws IOException {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
...
}
public void cleanup(Context) throws IOException {
mos.close();
...
}
}

How do multiple reducers output only one part-file in Hadoop?

In my map-reduce job, I use 4 reducers to implement the reducer jobs. So by doing this, the final output will generate 4 part-files.: part-0000 part-0001 part-0002 part-0003
My question is how can I set the configuration of hadoop to output only one part-file, although the hadoop use 4 reducers to work?
This isn't the behaviour expected from hadoop. But you may use MultipleOutputs to your advantage here.
Create one named output and use that in all your reducers to get the final output in one file itself. It's javadoc itself suggest the following:
JobConf conf = new JobConf();
conf.setInputPath(inDir);
FileOutputFormat.setOutputPath(conf, outDir);
conf.setMapperClass(MOMap.class);
conf.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
LongWritable.class, Text.class);;
...
JobClient jc = new JobClient();
RunningJob job = jc.submitJob(conf);
...
Job configuration usage pattern is:
public class MOReduce implements
Reducer<WritableComparable, Writable> {
private MultipleOutputs mos;
public void configure(JobConf conf) {
...
mos = new MultipleOutputs(conf);
}
public void reduce(WritableComparable key, Iterator<Writable> values,
OutputCollector output, Reporter reporter)
throws IOException {
...
mos.getCollector("text", reporter).collect(key, new Text("Hello"));
...
}
public void close() throws IOException {
mos.close();
...
}
}
If you are using the new mapreduce API then see here.
MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
Here text is output directory or single large file named text ?

Resources