I am doing a secondary sort in Hadoop 2.6.0, I am following this tutorial:
https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
I have the exact same code, but now I am trying to improve performance so I have decided to add a combiner. I have added two modifications:
Main file:
job.setCombinerClass(CombinerK.class);
Combiner file:
public class CombinerK extends Reducer<KeyWritable, KeyWritable, KeyWritable, KeyWritable> {
public void reduce(KeyWritable key, Iterator<KeyWritable> values, Context context) throws IOException, InterruptedException {
Iterator<KeyWritable> it = values;
System.err.println("combiner " + key);
KeyWritable first_value = it.next();
System.err.println("va: " + first_value);
while (it.hasNext()) {
sum += it.next().getSs();
}
first_value.setS(sum);
context.write(key, first_value);
}
}
But it seems that it is not run because I can't find any logs file which have the word "combiner". When I saw counters after running, I could see:
Combine input records=4040000
Combine output records=4040000
The combiner seems like it is being executed but it seems as it has been receiving a call for each key and by this reason it has the same number in input as output.
Related
I hit a brick wall. I have the following files I've generated from previous MR functions.
Product Scores (I have)
0528881469 1.62
0594451647 2.28
0594481813 2.67
0972683275 4.37
1400501466 3.62
where column 1 = product_id, and column 2 = product_rating
Related Products (I have)
0000013714 [0005080789,0005476798,0005476216,0005064341]
0000031852 [B00JHONN1S,B002BZX8Z6,B00D2K1M3O,0000031909]
0000031887 [0000031852,0000031895,0000031909,B00D2K1M3O]
0000031895 [B002BZX8Z6,B00JHONN1S,0000031909,B008F0SU0Y]
0000031909 [B002BZX8Z6,B00JHONN1S,0000031895,B00D2K1M3O]
where column 1 = product_id, and column 2 = array of also_bought products
The file I am trying to create now combines both of these files into the following:
Recommended Products (I need)
0000013714 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031852 [<0005476798, 4.58>,<0005080789, 2.34>,<0005476216, 2.32>]
0000031887 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031895 [<0005476216, 2.32>,<0005476798, 4.58>,<0005080789, 2.34>]
0000031909 [<0005476216, 2.32>,<0005080789, 2.34>,<0005476798, 4.58>]
where column 1 = product_id and column 2 = array of tuples of
I'm just totally stuck at the moment, I thought I had a plan for this but it turned out that it was not a very good plan and it didn't work.
Two approaches based on your size of Product Scores data:
If your Product Scores file is not huge, you can load that up in Hadoop Distributed Cache.(Now available in Jobs itself) Job.addCacheFile()
Then, process the Related Products file and fetch the necessary rating in the Reducer and write it out. Quick and dirty. But, if Product Scores is a huge file then probably not the correct way to go about this problem.
Reduce side Joins. Various examples available, for eg., refer to this link to get an idea.
As you already have defined a schema, you can create hive tables on top of it and get the output using queries. This would save you a lot of time.
Edit: Moreover, If you already have map-reduce jobs ton create this file, you can add hive jobs, which creates external hive tables on these reducer outputs and then query them.
I ended up using a MapFile. I transformed both the ProductScores and RelatedProducts data sets into two MapFiles and then made a Java program that pulled information out of these MapFiles when needed.
MapFileWriter
public class MapFileWriter {
public static void main(String[] args) {
Configuration conf = new Configuration();
Path inputFile = new Path(args[0]);
Path outputFile = new Path(args[1]);
Text txtKey = new Text();
Text txtValue = new Text();
try {
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(inputFile);
Writer writer = new Writer(conf, fs, outputFile.toString(), txtKey.getClass(), txtKey.getClass());
writer.setIndexInterval(1);
while (inputStream.available() > 0) {
String strLineInInputFile = inputStream.readLine();
String[] lstKeyValuePair = strLineInInputFile.split("\\t");
txtKey.set(lstKeyValuePair[0]);
txtValue.set(lstKeyValuePair[1]);
writer.append(txtKey, txtValue);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
MapFileReader
public class MapFileReader {
public static void main(String[] args) {
Configuration conf = new Configuration();
FileSystem fs;
Text txtKey = new Text(args[1]);
Text txtValue = new Text();
MapFile.Reader reader;
try {
fs = FileSystem.get(conf);
try {
reader = new MapFile.Reader(fs, args[0], conf);
reader.get(txtKey, txtValue);
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("The value for Key " + txtKey.toString() + " is " + txtValue.toString());
}
}
I have an Adcampaign driver, mapper and reducer classes. First two classes run great. The reducer class also runs fine but the results are not correct. This is a sample project I downloaded from internet to practice mapreduce program.
brief description of this program:
Problem statement:
For this article, let’s pretend that we are running an online advertising company. We run advertising campaigns for clients (like Pepsi, Sony) and the ads are displayed on popular websites such as news sites (CNN, Fox) and social media sites (Facebook). To track how well an advertising campaign is doing, we keep track of the ads we serve and ads that users click.
Scenario
Here is the sequence of events:
1. We serve the ad to the user
2. If the ad appears on users browser, aka user saw the ad. We track this event as VIEWED_EVENT
3. If user clicks on the ad, we track this event as CLICKED_EVENT
sample data:
293868800864,319248,1,flickr.com,12
1293868801728,625828,1,npr.org,19
1293868802592,522177,2,wikipedia.org,16
1293868803456,535052,2,cnn.com,20
1293868804320,287430,2,sfgate.com,2
1293868805184,616809,2,sfgate.com,1
1293868806048,704032,1,nytimes.com,7
1293868806912,631825,2,amazon.com,11
1293868807776,610228,2,npr.org,6
1293868808640,454108,2,twitter.com,18
Input Log files format and description:
Log Files: The log files are in the following format:
times- tamp, user_id, view/click, domain, campaign_id.
E.g: 1262332801728, 899523, 1, npr.org, 19
◾timestamp : unix time stamp in milliseconds
◾user_id : each user has a unique id
◾action_id : 1=view, 2=click
◾domain : which domain the ad was served
◾campaign_id: identifies the campaign the ad was part of
Expected ouput from reducer was:
campaignid, total views, total clicks
Example:
12, 3,2
13,100,23
14, 23,12
I looked at the logs of Mapper. The output is good. But the final output from Reducer is not good.
Reducer class:
public class AdcampaignReducer extends Reducer<IntWritable, IntWritable, IntWritable, Text>
{
// Key/value : IntWritable/List of IntWritables for every campaign, we are getting all actions for that
// campaign as an iterable list. We are iterating through action_ids and calculating views and click
// Once we are done calculating, we write out the results. This is possible because all actions for a campaign are grouped and sent to one reducer.
//Text k= new Text();
public void reduce(IntWritable key, Iterable<IntWritable> results, Context context) throws IOException, InterruptedException
{
int campaign = key.get();
//k = key.get();
int clicks = 0;
int views = 0;
for(IntWritable i:results)
{
int action = i.get();
if (action ==1)
views = views+1;
else if (action == 2)
clicks = clicks + 1;
}
String statistics = "Total Clicks =" +clicks + "and Views =" + views;
context.write(new IntWritable(campaign), new Text(statistics));
}
}
Mapper class:
public class AdcampaignMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private long numRecords = 0;
#Override
public void map(LongWritable key, Text record, Context context) throws IOException, InterruptedException {
String[] tokens = record.toString().split(",");
if (tokens.length !=5)
{
System.out.println("*** invalid record : " + record);
}
String actionStr = tokens[2];
String campaignStr = tokens[4];
try{
//System.out.println("during parseint"); //used to debug
System.out.println("actionStr =" + actionStr + "and campaign str = " + campaignStr);
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
//System.out.println("during intwritable"); //used to debug
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
context.write(outputKeyFromMapper, outputValueFromMapper);
}
catch(Exception e){
System.out.println("*** there is exception");
e.printStackTrace();
}
numRecords = numRecords+1;
}
}
Driver program:
public class Adcampaign {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxClosePrice <input path> <output path>");
System.exit(-1);
}
//reads the default configuration of cluster from the configuration xml files
// https://www.quora.com/What-is-the-use-of-a-configuration-class-and-object-in-Hadoop-MapReduce-code
Configuration conf = new Configuration();
//Initializing the job with the default configuration of the cluster
Job job = new Job(conf, "Adcampaign");
//first argument is job itself
//second argument is location of the input dataset
FileInputFormat.addInputPath(job, new Path(args[0]));
//first argument is the job itself
//second argument is the location of the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Defining input Format class which is responsible to parse the dataset into a key value pair
//Configuring the input/output path from the filesystem into the job
// InputFormat is responsible for 3 main tasks.
// a. Validate inputs - meaning the dataset exists in the location specified.
// b. Split up the input files into logical input splits. Each input split will be assigned a mapper.
// c. Recordreader implementation to extract logical records
job.setInputFormatClass(TextInputFormat.class);
//Defining output Format class which is responsible to parse the final key-value output from MR framework to a text file into the hard disk
//OutputFomat does 2 mains things
// a. Validate output specifications. Like if the output directory already exists? If the directory exist, it will throw an error.
// b. Recordwriter implementation to write output files of the job
//Hadoop comes with several output format implemenations.
job.setOutputFormatClass(TextOutputFormat.class);
//Assigning the driver class name
job.setJarByClass(Adcampaign.class);
//Defining the mapper class name
job.setMapperClass(AdcampaignMapper.class);
//Defining the Reducer class name
job.setReducerClass(AdcampaignReducer.class);
//setting the second argument as a path in a path variable
Path outputPath = new Path(args[1]);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
///exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You want the output as per the campaign_id. So the Campaign_id shud be key from the mapper code. And then in the reducer code, you will check whether its a view or a click.
String actionStr = tokens[2];
String campaignStr = tokens[4];
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
Here outputKeyFromMapper should be campaignid as the sorting will be done on campaignid.
PLEASE LET ME KNOW IF IT HELPS.
The output key from your mapper should be campaignid and value should be actionid
If you want to count number of records in mapper use counters
Your mapper and reducer looks fine.
Add below lines to your Driver class and give a try:
job.setOutputKeyClass( IntWritable.class );
job.setOutputValueClass( Text.class );
Currently, I got a data file which process line by line, most of the line contain one record I need, such as: id,name,total
But some line contain more than one record, such as such as:id1,name1,total1,id2,name2,total2
I wrote my load function, and try to return the tuple which compose of list of tuples. But I don't know how can I process data as below?
((id1,name1,total1),(id2,name2,total2))...
And another question is about loadfun, if I found some line contain invalid value, shall I return an empty tuple or just set the line reader to the next line?
Thanks.
I got a solution, which is define my own load or store.
For load, define the file input.
for the store, define the output in my my put next function as below.
#Override
public void putNext(Tuple t) throws IOException {
List<Object> all = t.getAll();
for (Object o : all) {
logger.info(o.getClass());
Tuple tuple = (Tuple) o;
try {
recordWriter.write(null, new Text(tuple.toString()));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
I am trying to play with MultipleOutputFormat.generateFileNameForKeyValue() .
The idea is to create directory for each of my keys.
This is the code:
static class MyMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
arr = key.toString().split("_");
return arr[0]+"/"+name;
}
}
This code works only if the emitted records are few. If i run the code against my real input, it just hangs on reducer around 70%.
What might be the problem here - working on small number of keys, not working on many .
My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance.
Thanks,
Sathish.
So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file for each key.
So first thing writing Partitioner can be a way to achieve that. By default hadoop has its own internal logic that it performs on keys and depending on that it calls reducers. So if you want to write a custom partitioner than you have to overwrite that default behaviour by your own logic/algorithm. Unless you know how exactly your keys will vary this logic wont be generic and based on variations you have to figure out the logic.
I am providing you a sample example here you can refer that but its not generic.
public class CustomPartitioner extends Partitioner<Text, Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(key.toString().contains("Key1"))
{
return 1;
}else if(key.toString().contains("Key2"))
{
return 2;
}else if(key.toString().contains("Key3"))
{
return 3;
}else if(key.toString().contains("Key4"))
{
return 4;
}else if(key.toString().contains("Key5"))
{
return 5;
}else
{
return 7;
}
}
}
This should solve your problem. Just replace key1,key2 ..etc by your key name...
In case you don't know the key names you can write your own logic by referring following:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return (key.toString().charAt(0)) % numReduceTasks;
}
}
In above partitioner just to illustrate that how you can write your own logic I have shown that if you take out length of the keys and do % operation with number of reducers than you will get one unique number which will be between 0 to Number of Reducers so by default different reducers get called and gives output in different files. But in this approach you have to make sure that for two keys same value should not be written
This was about Customized partitioner.
Another solution can be you can override the MultipleOutputFormat class methods that will enable to do the job in a generic way. Also using this approach you will be able to generate customized file name for reducer output files in hdfs.
NOTE: Make sure you use same libraries. Don't mix mapred against mapreduce libraries. org.apache.hadoop.mapred are older libraries and org.apache.hadoop.mapreduce are new ones.
Hope this will help.
I imagine the best way to do this since it will give a more even split would be:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return key.hashCode() % numReduceTasks;
}
}