Partitioner is not working correctly - hadoop

I am trying to code one MapReduce scenario in which i have created some User ClickStream data in the form of JSON. After that i have written Mapper class to fetch the required data from the file my mapper code is :-
private final static String URL = "u";
private final static String Country_Code = "c";
private final static String Known_User = "nk";
private final static String Session_Start_time = "hc";
private final static String User_Id = "user";
private final static String Event_Id = "event";
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String aJSONRecord = value.toString();
try {
JSONObject aJSONObject = new JSONObject(aJSONRecord);
StringBuilder aOutputString = new StringBuilder();
aOutputString.append(aJSONObject.get(User_Id).toString()+",");
aOutputString.append(aJSONObject.get(Event_Id).toString()+",");
aOutputString.append(aJSONObject.get(URL).toString()+",");
aOutputString.append(aJSONObject.get(Known_User)+",");
aOutputString.append(aJSONObject.get(Session_Start_time)+",");
aOutputString.append(aJSONObject.get(Country_Code)+",");
context.write(new Text(aOutputString.toString()), key);
System.out.println(aOutputString.toString());
} catch (JSONException e) {
e.printStackTrace();
}
}
}
And my reducer code is :-
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
String aString = key.toString();
context.write(new Text(aString.trim()), new Text(""));
}
And my partitioner code is :-
public int getPartition(Text key, LongWritable value, int numPartitions) {
String aRecord = key.toString();
if(aRecord.contains(Country_code_Us)){
return 0;
}else{
return 1;
}
}
And here is my driver code
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Click Stream Analyzer");
job.setNumReduceTasks(2);
job.setJarByClass(ClickStreamDriver.class);
job.setMapperClass(ClickStreamMapper.class);
job.setReducerClass(ClickStreamReducer.class);
job.setPartitionerClass(ClickStreamPartitioner.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Here i am trying to partition my data on the basis of country code. But its not working, it is sending each and every record in a single reducer file i think file other then the one created for US reduce.
One more thing when i see the output of mappers it shows some extra space added at the end of each record.
Please suggest if i am making any mistake here.

Your problem with the partitioning is due to the number of reducers. If it is 1, all your data will be sent to it, independently to what you return from your partitioner. Thus, setting mapred.reduce.tasks to 2 will solve this issue. Or you can simply write:
job.setNumReduceTasks(2);
In order to have 2 reducers as you want.

Unless you have very specific requirement, you can set reducers as below for job parameters.
mapred.reduce.tasks (in 1.x) & mapreduce.job.reduces(2.x)
Or
job.setNumReduceTasks(2) as per mark91 answer.
But leave the job to Hadoop fraemork by using below API. Framework will decide number of reducers as per the file & block sizes.
job.setPartitionerClass(HashPartitioner.class);

I have used NullWritable and it works. Now i can see records are getting partitioned in different files. Since i was using longwritable as a null value instead of null writable , space is added in the last of each line and due to this US was listed as "US " and partition was not able to divide the orders.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

Not understanding the path in distributed path

From the below code I didn't understand 2 things:
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())
I didn't understand URI path has to be present in the HDFS. Correct me if I am wrong.
And what is p.getname().equals() from the below code:
public class MyDC {
public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {
private Map < String, String > abMap = new HashMap < String, String > ();
private Text outputKey = new Text();
private Text outputValue = new Text();
protected void setup(Context context) throws
java.io.IOException, InterruptedException {
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
for (Path p: files) {
if (p.getName().equals("abc.dat")) {
BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
String line = reader.readLine();
while (line != null) {
String[] tokens = line.split("\t");
String ab = tokens[0];
String state = tokens[1];
abMap.put(ab, state);
line = reader.readLine();
}
}
}
if (abMap.isEmpty()) {
throw new IOException("Unable to load Abbrevation data.");
}
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String row = value.toString();
String[] tokens = row.split("\t");
String inab = tokens[0];
String state = abMap.get(inab);
outputKey.set(state);
outputValue.set(row);
context.write(outputKey, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyDC.class);
job.setJobName("DCTest");
job.setNumReduceTasks(0);
try {
DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());
} catch (Exception e) {
System.out.println(e);
}
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The idea of Distributed Cache is to make some static data available to the task node before it starts its execution.
File has to be present in HDFS ,so that it can then add it to the Distributed Cache (to each task node)
DistributedCache.getLocalCacheFile basically gets all the cache files present in that task node. By if (p.getName().equals("abc.dat")) { you are getting the appropriate Cache File to be processed by your application.
Please refer to the docs below:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)
DistributedCache is an API which is used to add a file or a group of files in the memory and will be available for every data-nodes whether the map-reduce will work. One example of using DistributedCache is map-side joins.
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration()) will add the abc.dat file in the cache area. There can be n numbers of file in the cache and p.getName().equals("abc.dat")) will check the file which you required. Every path in HDFS will be taken under Path[] for map-reduce processing. For example :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
The first Path(args[0]) is the first argument
(input file location) you pass while Jar execution and Path(args[1]) is the second argument which the output file location. Everything is taken as Path array.
In the same way when you add any file to cache it will get arrayed in the Path array which you shud be retrieving using the below code.
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
It will return all the files present in the cache and you will your file name by p.getName().equals() method.

Is it possible to have multiple output files for a map-reduce?

In my input file i have a column as country. Now, my task is to place records of a particular country into a separate file naming with that country. Is this possible to do in Map-reduce.!
Please share your ideas regarding this.
Yes it is, in hadoop you can use MultipleOutputFormat to do exactly that, using its generateFileNameForKeyValue method.
Using your country names as keys and the records as values this should work exactly as you need it to.
If you are using the new API , you should look at the MultipleOutputs class. There is an example inside this class.
Usage pattern for job submission:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
...
job.waitForCompletion(true);
...
Usage in Reducer:
String generateFileName(K k, V v) {
return k.toString() + "_" + v.toString();
}
public class MOReduce extends
Reducer {
private MultipleOutputs mos;
public void setup(Context context) {
...
mos = new MultipleOutputs(context);
}
public void reduce(WritableComparable key, Iterator values,
Context context)
throws IOException {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
...
}
public void cleanup(Context) throws IOException {
mos.close();
...
}
}

Read/Write Hbase without reducer, exceptions

I want to read and write hbase without using any reducer.
I followed the example in "The Apache HBaseâ„¢ Reference Guide", but there are exceptions.
Here is my code:
public class CreateHbaseIndex {
static final String SRCTABLENAME="sourceTable";
static final String SRCCOLFAMILY="info";
static final String SRCCOL1="name";
static final String SRCCOL2="email";
static final String SRCCOL3="power";
static final String DSTTABLENAME="dstTable";
static final String DSTCOLNAME="index";
static final String DSTCOL1="key";
public static void main(String[] args) {
System.out.println("CreateHbaseIndex Program starts!...");
try {
Configuration config = HBaseConfiguration.create();
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes(SRCCOLFAMILY), Bytes.toBytes(SRCCOL1));//info:name
HBaseAdmin admin = new HBaseAdmin(config);
if (admin.tableExists(DSTTABLENAME)) {
System.out.println("table Exists.");
}
else{
HTableDescriptor tableDesc = new HTableDescriptor(DSTTABLENAME);
tableDesc.addFamily(new HColumnDescriptor(DSTCOLNAME));
admin.createTable(tableDesc);
System.out.println("create table ok.");
}
Job job = new Job(config, "CreateHbaseIndex");
job.setJarByClass(CreateHbaseIndex.class);
TableMapReduceUtil.initTableMapperJob(
SRCTABLENAME, // input HBase table name
scan, // Scan instance to control CF and attribute selection
HbaseMapper.class, // mapper
ImmutableBytesWritable.class, // mapper output key
Put.class, // mapper output value
job);
job.waitForCompletion(true);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
System.out.println("Program ends!...");
}
public static class HbaseMapper extends TableMapper<ImmutableBytesWritable, Put> {
private HTable dstHt;
private Configuration dstConfig;
#Override
public void setup(Context context) throws IOException{
dstConfig=HBaseConfiguration.create();
dstHt = new HTable(dstConfig,SRCTABLENAME);
}
#Override
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
// this is just copying the data from the source table...
context.write(row, resultToPut(row,value));
}
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
Put put = new Put(key.get());
for (KeyValue kv : result.raw()) {
put.add(kv);
}
return put;
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
dstHt.close();
super.cleanup(context);
}
}
}
By the way, "souceTable" is like this:
key name email
1 peter a#a.com
2 sam b#b.com
"dstTable" will be like this:
key value
peter 1
sam 2
I am a newbie in this field and need you help. Thx~
You are correct that you don't need a reducer to write to HBase, but there are some instances where a reducer might help. If you are creating an index, you might run into situations where two mappers are trying to write the same row. Unless you are careful to ensure that they are writing into different column qualifiers, you could overwrite one update with another due to race conditions. While HBase does do row level locking, it won't help if your application logic is faulty.
Without seeing your exceptions, I would guess that you are failing because you are trying to write key-value pairs from your source table into your index table, where the column family doesn't exist.
In this code you are not specifying the output format. You need to add the following code
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE,
DSTTABLENAME);
Also, we are not supposed to create new configuration in the set up, we need to use the same configuration from context.

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources