Any suggestions for reading two different dataset into Hadoop at the same time? - hadoop

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?

You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}

Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Related

When to go for custom Input format for Map reduce jobs

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.

Partitioner is not working correctly

I am trying to code one MapReduce scenario in which i have created some User ClickStream data in the form of JSON. After that i have written Mapper class to fetch the required data from the file my mapper code is :-
private final static String URL = "u";
private final static String Country_Code = "c";
private final static String Known_User = "nk";
private final static String Session_Start_time = "hc";
private final static String User_Id = "user";
private final static String Event_Id = "event";
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String aJSONRecord = value.toString();
try {
JSONObject aJSONObject = new JSONObject(aJSONRecord);
StringBuilder aOutputString = new StringBuilder();
aOutputString.append(aJSONObject.get(User_Id).toString()+",");
aOutputString.append(aJSONObject.get(Event_Id).toString()+",");
aOutputString.append(aJSONObject.get(URL).toString()+",");
aOutputString.append(aJSONObject.get(Known_User)+",");
aOutputString.append(aJSONObject.get(Session_Start_time)+",");
aOutputString.append(aJSONObject.get(Country_Code)+",");
context.write(new Text(aOutputString.toString()), key);
System.out.println(aOutputString.toString());
} catch (JSONException e) {
e.printStackTrace();
}
}
}
And my reducer code is :-
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
String aString = key.toString();
context.write(new Text(aString.trim()), new Text(""));
}
And my partitioner code is :-
public int getPartition(Text key, LongWritable value, int numPartitions) {
String aRecord = key.toString();
if(aRecord.contains(Country_code_Us)){
return 0;
}else{
return 1;
}
}
And here is my driver code
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Click Stream Analyzer");
job.setNumReduceTasks(2);
job.setJarByClass(ClickStreamDriver.class);
job.setMapperClass(ClickStreamMapper.class);
job.setReducerClass(ClickStreamReducer.class);
job.setPartitionerClass(ClickStreamPartitioner.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Here i am trying to partition my data on the basis of country code. But its not working, it is sending each and every record in a single reducer file i think file other then the one created for US reduce.
One more thing when i see the output of mappers it shows some extra space added at the end of each record.
Please suggest if i am making any mistake here.
Your problem with the partitioning is due to the number of reducers. If it is 1, all your data will be sent to it, independently to what you return from your partitioner. Thus, setting mapred.reduce.tasks to 2 will solve this issue. Or you can simply write:
job.setNumReduceTasks(2);
In order to have 2 reducers as you want.
Unless you have very specific requirement, you can set reducers as below for job parameters.
mapred.reduce.tasks (in 1.x) & mapreduce.job.reduces(2.x)
Or
job.setNumReduceTasks(2) as per mark91 answer.
But leave the job to Hadoop fraemork by using below API. Framework will decide number of reducers as per the file & block sizes.
job.setPartitionerClass(HashPartitioner.class);
I have used NullWritable and it works. Now i can see records are getting partitioned in different files. Since i was using longwritable as a null value instead of null writable , space is added in the last of each line and due to this US was listed as "US " and partition was not able to divide the orders.

How do you use ORCFile Input/Output format in MapReduce?

I need to implement a custom I/O format based on ORCFile I/O format. How do I go about it?
Specifically I would need a way to include the ORCFile library in my source code (which is a custom Pig implementation) and use the ORCFile Output format to write data, and later use the ORCFile Input format to read back the data.
You need to create your subclass of InputFormat class (or FileInputFormat, depending on the nature of your files).
Just google for Hadoop InputFormat and you will find a plenty of articles and tutorials on how to create your own InputFormat class.
You can use HCatalog library to read write orc files in mapreduce.
Just wrote a sample code here. Hope it helps.
sample mapper code
public static class MyMapper<K extends WritableComparable, V extends Writable>
extends MapReduceBase implements Mapper<K, OrcStruct, Text, IntWritable> {
private StructObjectInspector oip;
private final OrcSerde serde = new OrcSerde();
public void configure(JobConf job) {
Properties table = new Properties();
table.setProperty("columns", "a,b,c");
table.setProperty("columns.types", "int,string,struct<d:int,e:string>");
serde.initialize(job, table);
try {
oip = (StructObjectInspector) serde.getObjectInspector();
} catch (SerDeException e) {
e.printStackTrace();
}
}
public void map(K key, OrcStruct val,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
System.out.println(val);
List<? extends StructField> fields =oip.getAllStructFieldRefs();
StringObjectInspector bInspector =
(StringObjectInspector) fields.get(B_ID).getFieldObjectInspector();
String b = "garbage";
try {
b = bInspector.getPrimitiveJavaObject(oip.getStructFieldData(serde.deserialize(val), fields.get(B_ID)));
} catch (SerDeException e1) {
e1.printStackTrace();
}
OrcStruct struct = null;
try {
struct = (OrcStruct) oip.getStructFieldData(serde.deserialize(val),fields.get(C_ID));
} catch (SerDeException e1) {
e1.printStackTrace();
}
StructObjectInspector cInspector = (StructObjectInspector) fields.get(C_ID).getFieldObjectInspector();
int d = ((IntWritable) cInspector.getStructFieldData(struct, fields.get(D_ID))).get();
String e = cInspector.getStructFieldData(struct, fields.get(E_ID)).toString();
output.collect(new Text(b), new IntWritable(1));
output.collect(new Text(e), new IntWritable(1));
}
}
Launcher code
JobConf job = new JobConf(new Configuration(), OrcReader.class);
// Specify various job-specific parameters
job.setJobName("myjob");
job.set("mapreduce.framework.name","local");
job.set("fs.default.name","file:///");
job.set("log4j.logger.org.apache.hadoop","INFO");
job.set("log4j.logger.org.apache.hadoop","INFO");
//push down projection columns
job.set("hive.io.file.readcolumn.ids","1,2");
job.set("hive.io.file.read.all.columns","false");
job.set("hive.io.file.readcolumn.names","b,c");
FileInputFormat.setInputPaths(job, new Path("./src/main/resources/000000_0.orc"));
FileOutputFormat.setOutputPath(job, new Path("./target/out1"));
job.setMapperClass(OrcReader.MyMapper.class);
job.setCombinerClass(OrcReader.MyReducer.class);
job.setReducerClass(OrcReader.MyReducer.class);
job.setInputFormat(OrcInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
JobClient.runJob(job);

Multiple Input Files Mapreduce Wordcount example done separately

I was going about Hadoop framework for Mapreduce model,and actually tried out basic examples like WordCount, Max_temperature so much so as to create a mapreduce task for my project .I only want to know how to process wordcount as one output file for each input file...as in let me give you an example on that :-
FILE_1 Dog Cat Dog Bull
FILE_2 Cow Ox Tiger Dog Cat
FILE_3 Dog Cow Ox Tiger Bull
should give 3 output files, 1 for each input file as follows:-
Out_1 Dog 2,Cat 1,Bull 1
Out_2 Cow 1,Ox 1,Tiger 1,Dog 1,Cat 1
Out_3 Dog 1,Cow 1,Ox 1,Tiger 1,Bull 1
I went through the answers posted here Hadoop MapReduce - one output file for each input but couldn't grasp it properly.
Help please! Thanks
Each Reducer outputs one output file.
The number of output files is dependent on number of Reducers.
(A)
Assuming you want to process all three input files in a single MapReduce Job.
At the very minimum - you must set number of Reducers equal to the Number of Output Files you want.
Since you are trying to do word-counts Per File. And not across Files.
You will have to ensure that all the file contents (of one file) are processed by a Single Reducer. Using a Custom Partitioner is one way to do this.
(B)
Another way is to simply run your MapReduce Job Three Times. Once for Each Input File. And have Reducer count as 1.
Even I am a newbie in hadoop and found this question very interesting. And this is how I resolved this.
public class Multiwordcnt {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job myJob = new Job(conf, "Multiwordcnt");
String[] userargs = new GenericOptionsParser(conf, args).getRemainingArgs();
myJob.setJarByClass(Multiwordcnt.class);
myJob.setMapperClass(MyMapper.class);
myJob.setReducerClass(MyReducer.class);
myJob.setMapOutputKeyClass(Text.class);
myJob.setMapOutputValueClass(IntWritable.class);
myJob.setOutputKeyClass(Text.class);
myJob.setOutputValueClass(IntWritable.class);
myJob.setInputFormatClass(TextInputFormat.class);
myJob.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(myJob, new Path(userargs[0]));
FileOutputFormat.setOutputPath(myJob, new Path(userargs[1]));
System.exit(myJob.waitForCompletion(true) ? 0 : 1 );
}
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable(1);
public void map(LongWritable key , Text value, Context context) throws IOException, InterruptedException {
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
String filepathword = filePathString + "*" + tokenizer.nextToken();
emitkey.set(filepathword);
context.write(emitkey, emitvalue);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
Text emitkey = new Text();
IntWritable emitvalue = new IntWritable();
private MultipleOutputs<Text,IntWritable> multipleoutputs;
public void setup(Context context) throws IOException, InterruptedException {
multipleoutputs = new MultipleOutputs<Text,IntWritable>(context);
}
public void reduce(Text key , Iterable <IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values){
sum = sum + value.get();
}
String pathandword = key.toString();
String[] splitted = pathandword.split("\\*");
String path = splitted[0];
String word = splitted[1];
emitkey.set(word);
emitvalue.set(sum);
System.out.println("word:" + word + "\t" + "sum:" + sum + "\t" + "path: " + path);
multipleoutputs.write(emitkey,emitvalue , path);
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleoutputs.close();
}
}
}

Weird error in Hadoop reducer

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

Resources