Hadoop MultipleOutputFormat.generateFileNameForKeyValue with many keys - hadoop

I am trying to play with MultipleOutputFormat.generateFileNameForKeyValue() .
The idea is to create directory for each of my keys.
This is the code:
static class MyMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
arr = key.toString().split("_");
return arr[0]+"/"+name;
}
}
This code works only if the emitted records are few. If i run the code against my real input, it just hangs on reducer around 70%.
What might be the problem here - working on small number of keys, not working on many .

Related

Building a multi-record flat file

I'm struggling to find a proper solution for generating a flat file.
Here are some criteria I need to take care of:
The file has a header with summary of its following records
there could be multiple Collection Header Records with multiple Batch Header Records which contain multiple records of different types.
All records within a Batch have a checksum which has to be added to a batch checksum. This one has to be added to the collection Header checksum and that again to the file checksum. Also each entry in the file has a counter value.
So my plan was to create a class for each record. but what now? I have the records and the "summary records", the next step would be to bring them all in order, count the sums and then set the counters.
How should I proceed from here, should I put everything in a big SortedList? If so, how do I know where to add the latest record (It has to be added to its representing batch summary)?
My first idea was to do something like this:
SortedList<HeaderSummary, SortedList<BatchSummary, SortedList<string, object>>>();
But it is hard to navigate through the HeaderSummaries and BatchSummaries to add a object in the inner Sorted list, bearing in mind that I may have to create and add a HeaderSummary / BachtSummary.
Having several different ArrayLists like one for Header, one for Batch and one for the rest gives me problems when combining them to a flat file because of the order and the - yet to set - counters, while keeping the order etc.
Do you have any clever solution for such a flat file?
Consider using classes to represent levels of your tree structure.
interface iBatch {
public int checksum { get; set; }
}
class BatchSummary {
int batchChecksum;
List<iBatch> records;
public void WriteBatch() {
WriteBatchHeader();
foreach (var record in records)
batch.WriteRecord();
}
public void Add(iBatch rec) {
records.Add(rec); // or however you find the appropriate batch
}
}
class CollectionSummary {
int collectionChecksum;
List<BatchSummary> batches;
public void WriteCollection() {
WriteCollectionHeader();
foreach (var batch in batches)
batch.WriteBatch();
}
public void Add(int WhichBatch, iBatch rec) {
batches[whichBatch].Add(rec); // or however you find the appropriate batch
}
}
class FileSummary {
// ... file summary info
int fileChecksum;
List<CollectionSummary> collections;
public void WriteFile() {
WriteFileHeader();
foreach (var collection in collections)
collection.WriteCollection();
}
public void Add(int whichCollection, int WhichBatch, iBatch rec) {
collections[whichCollection].Add(whichBatch, rec); // or however you find the appropriate collection
}
}
Of course, you could use a common Summary class to be more DRY, if not necessarily more clear.

Mapreduce - reducer class results not correct

I have an Adcampaign driver, mapper and reducer classes. First two classes run great. The reducer class also runs fine but the results are not correct. This is a sample project I downloaded from internet to practice mapreduce program.
brief description of this program:
Problem statement:
For this article, let’s pretend that we are running an online advertising company. We run advertising campaigns for clients (like Pepsi, Sony) and the ads are displayed on popular websites such as news sites (CNN, Fox) and social media sites (Facebook). To track how well an advertising campaign is doing, we keep track of the ads we serve and ads that users click.
Scenario
Here is the sequence of events:
1. We serve the ad to the user
2. If the ad appears on users browser, aka user saw the ad. We track this event as VIEWED_EVENT
3. If user clicks on the ad, we track this event as CLICKED_EVENT
sample data:
293868800864,319248,1,flickr.com,12
1293868801728,625828,1,npr.org,19
1293868802592,522177,2,wikipedia.org,16
1293868803456,535052,2,cnn.com,20
1293868804320,287430,2,sfgate.com,2
1293868805184,616809,2,sfgate.com,1
1293868806048,704032,1,nytimes.com,7
1293868806912,631825,2,amazon.com,11
1293868807776,610228,2,npr.org,6
1293868808640,454108,2,twitter.com,18
Input Log files format and description:
Log Files: The log files are in the following format:
times- tamp, user_id, view/click, domain, campaign_id.
E.g: 1262332801728, 899523, 1, npr.org, 19
◾timestamp : unix time stamp in milliseconds
◾user_id : each user has a unique id
◾action_id : 1=view, 2=click
◾domain : which domain the ad was served
◾campaign_id: identifies the campaign the ad was part of
Expected ouput from reducer was:
campaignid, total views, total clicks
Example:
12, 3,2
13,100,23
14, 23,12
I looked at the logs of Mapper. The output is good. But the final output from Reducer is not good.
Reducer class:
public class AdcampaignReducer extends Reducer<IntWritable, IntWritable, IntWritable, Text>
{
// Key/value : IntWritable/List of IntWritables for every campaign, we are getting all actions for that
// campaign as an iterable list. We are iterating through action_ids and calculating views and click
// Once we are done calculating, we write out the results. This is possible because all actions for a campaign are grouped and sent to one reducer.
//Text k= new Text();
public void reduce(IntWritable key, Iterable<IntWritable> results, Context context) throws IOException, InterruptedException
{
int campaign = key.get();
//k = key.get();
int clicks = 0;
int views = 0;
for(IntWritable i:results)
{
int action = i.get();
if (action ==1)
views = views+1;
else if (action == 2)
clicks = clicks + 1;
}
String statistics = "Total Clicks =" +clicks + "and Views =" + views;
context.write(new IntWritable(campaign), new Text(statistics));
}
}
Mapper class:
public class AdcampaignMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private long numRecords = 0;
#Override
public void map(LongWritable key, Text record, Context context) throws IOException, InterruptedException {
String[] tokens = record.toString().split(",");
if (tokens.length !=5)
{
System.out.println("*** invalid record : " + record);
}
String actionStr = tokens[2];
String campaignStr = tokens[4];
try{
//System.out.println("during parseint"); //used to debug
System.out.println("actionStr =" + actionStr + "and campaign str = " + campaignStr);
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
//System.out.println("during intwritable"); //used to debug
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
context.write(outputKeyFromMapper, outputValueFromMapper);
}
catch(Exception e){
System.out.println("*** there is exception");
e.printStackTrace();
}
numRecords = numRecords+1;
}
}
Driver program:
public class Adcampaign {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxClosePrice <input path> <output path>");
System.exit(-1);
}
//reads the default configuration of cluster from the configuration xml files
// https://www.quora.com/What-is-the-use-of-a-configuration-class-and-object-in-Hadoop-MapReduce-code
Configuration conf = new Configuration();
//Initializing the job with the default configuration of the cluster
Job job = new Job(conf, "Adcampaign");
//first argument is job itself
//second argument is location of the input dataset
FileInputFormat.addInputPath(job, new Path(args[0]));
//first argument is the job itself
//second argument is the location of the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Defining input Format class which is responsible to parse the dataset into a key value pair
//Configuring the input/output path from the filesystem into the job
// InputFormat is responsible for 3 main tasks.
// a. Validate inputs - meaning the dataset exists in the location specified.
// b. Split up the input files into logical input splits. Each input split will be assigned a mapper.
// c. Recordreader implementation to extract logical records
job.setInputFormatClass(TextInputFormat.class);
//Defining output Format class which is responsible to parse the final key-value output from MR framework to a text file into the hard disk
//OutputFomat does 2 mains things
// a. Validate output specifications. Like if the output directory already exists? If the directory exist, it will throw an error.
// b. Recordwriter implementation to write output files of the job
//Hadoop comes with several output format implemenations.
job.setOutputFormatClass(TextOutputFormat.class);
//Assigning the driver class name
job.setJarByClass(Adcampaign.class);
//Defining the mapper class name
job.setMapperClass(AdcampaignMapper.class);
//Defining the Reducer class name
job.setReducerClass(AdcampaignReducer.class);
//setting the second argument as a path in a path variable
Path outputPath = new Path(args[1]);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
///exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You want the output as per the campaign_id. So the Campaign_id shud be key from the mapper code. And then in the reducer code, you will check whether its a view or a click.
String actionStr = tokens[2];
String campaignStr = tokens[4];
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
Here outputKeyFromMapper should be campaignid as the sorting will be done on campaignid.
PLEASE LET ME KNOW IF IT HELPS.
The output key from your mapper should be campaignid and value should be actionid
If you want to count number of records in mapper use counters
Your mapper and reducer looks fine.
Add below lines to your Driver class and give a try:
job.setOutputKeyClass( IntWritable.class );
job.setOutputValueClass( Text.class );

Hadoop KeyComposite and Combiner

I am doing a secondary sort in Hadoop 2.6.0, I am following this tutorial:
https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
I have the exact same code, but now I am trying to improve performance so I have decided to add a combiner. I have added two modifications:
Main file:
job.setCombinerClass(CombinerK.class);
Combiner file:
public class CombinerK extends Reducer<KeyWritable, KeyWritable, KeyWritable, KeyWritable> {
public void reduce(KeyWritable key, Iterator<KeyWritable> values, Context context) throws IOException, InterruptedException {
Iterator<KeyWritable> it = values;
System.err.println("combiner " + key);
KeyWritable first_value = it.next();
System.err.println("va: " + first_value);
while (it.hasNext()) {
sum += it.next().getSs();
}
first_value.setS(sum);
context.write(key, first_value);
}
}
But it seems that it is not run because I can't find any logs file which have the word "combiner". When I saw counters after running, I could see:
Combine input records=4040000
Combine output records=4040000
The combiner seems like it is being executed but it seems as it has been receiving a call for each key and by this reason it has the same number in input as output.

Custom Partitioner : N number of keys to N different files

My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance.
Thanks,
Sathish.
So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file for each key.
So first thing writing Partitioner can be a way to achieve that. By default hadoop has its own internal logic that it performs on keys and depending on that it calls reducers. So if you want to write a custom partitioner than you have to overwrite that default behaviour by your own logic/algorithm. Unless you know how exactly your keys will vary this logic wont be generic and based on variations you have to figure out the logic.
I am providing you a sample example here you can refer that but its not generic.
public class CustomPartitioner extends Partitioner<Text, Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(key.toString().contains("Key1"))
{
return 1;
}else if(key.toString().contains("Key2"))
{
return 2;
}else if(key.toString().contains("Key3"))
{
return 3;
}else if(key.toString().contains("Key4"))
{
return 4;
}else if(key.toString().contains("Key5"))
{
return 5;
}else
{
return 7;
}
}
}
This should solve your problem. Just replace key1,key2 ..etc by your key name...
In case you don't know the key names you can write your own logic by referring following:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return (key.toString().charAt(0)) % numReduceTasks;
}
}
In above partitioner just to illustrate that how you can write your own logic I have shown that if you take out length of the keys and do % operation with number of reducers than you will get one unique number which will be between 0 to Number of Reducers so by default different reducers get called and gives output in different files. But in this approach you have to make sure that for two keys same value should not be written
This was about Customized partitioner.
Another solution can be you can override the MultipleOutputFormat class methods that will enable to do the job in a generic way. Also using this approach you will be able to generate customized file name for reducer output files in hdfs.
NOTE: Make sure you use same libraries. Don't mix mapred against mapreduce libraries. org.apache.hadoop.mapred are older libraries and org.apache.hadoop.mapreduce are new ones.
Hope this will help.
I imagine the best way to do this since it will give a more even split would be:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return key.hashCode() % numReduceTasks;
}
}

PIG UDF handle multi-lined tuple split into different mapper

I have file where each tuple span multiple lines, for example:
START
name: Jim
phone: 2128789283
address: 56 2nd street, New York, USA
END
START
name: Tom
phone: 6308789283
address: 56 5th street, Chicago, 13611, USA
END
.
.
.
So above are 2 tuples in my file. I wrote my UDF that defined a getNext() function which check if it is START then I will initialize my tuple; if it is END then I will return the tuple (from string buffer); otherwise I will just add the string to string buffer.
It works well for file size is less than HDFS block size which is 64 MB (on Amazon EMR), whereas it will fail for the size larger than this. I try to google around, find this blog post. Raja's explaination is easy to understand and he provided a sample code. But the code is implementing the RecordReader part, instead of getNext() for pig LoadFunc. Just wondering if anyone has experience to handle multi-lined pig tuple split problem? Should I go ahead implement RecordReader in Pig? If so, how?
Thanks.
You may preprocess your input as Guy mentioned or can apply other tricks described here.
I think the cleanest solution would be to implement a custom InputFormat (along with its RecordReader) which creates one record/START-END. The Pig's LoadFunc sits on the top of the Hadoop's InputFormat, so you can define which InputFormat your LoadFunc will use.
A raw, skeleton implementation of a custom LoadFunc would look like:
import java.io.IOException;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class CustomLoader extends LoadFunc {
private RecordReader reader;
private TupleFactory tupleFactory;
public CustomLoader() {
tupleFactory = TupleFactory.getInstance();
}
#Override
public InputFormat getInputFormat() throws IOException {
return new MyInputFormat(); //custom InputFormat
}
#Override
public Tuple getNext() {
Tuple result = null;
try {
if (!reader.nextKeyValue()) {
return null;
}
//value can be a custom Writable containing your name/value
//field pairs for a given record
Object value = reader.getCurrentValue();
result = tupleFactory.newTuple();
// ...
//append fields to tuple
}
catch (Exception e) {
// ...
}
return result;
}
#Override
public void prepareToRead(RecordReader reader, PigSplit pigSplit)
throws IOException {
this.reader = reader;
}
#Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
After the LoadFunc initializes the InputFormat and its RecordReader, it locates the input location of your data and begins to obtain the records from recordReader, creates the resulting tuples (getNext()) until the input has been fully read.
Some remarks on the custom InputFormat:
I'd create a custom InputFormat in which the RecordReader is a modified version of
org.apache.hadoop.mapreduce.lib.input.LineRecordReader: Most of the methods would
remain the same, except initialize(): it would call a custom LineReader
(based on org.apache.hadoop.util.LineReader).
The InputFormat's key would be the line offset (Long), the value would be a custom
Writable. This would hold the fields of a record (i.e data between START-END) as a list of key-value pairs. Each time your RecordReader's nextKeyValue() is called the record is written to the custom Writable by the LineReader. The gist of the whole thing is how you
implement LineReader.readLine().
Another, probably an easier approach would be to change the delimiter of TextInputFormat (It is configurable in Hadoop 0.23, see textinputformat.record.delimiter)
to one that is appropriate for your data structure (if it is possible). In this case you'll end up having your data in Text from which you need to split and extract KV pairs and into tuples.
If can take start as your delimiter, probably below code works without UDF
SET textinputformat.record.delimiter 'START';
a = load '<input path>' as (data:chararray);
dump a;
the output would look like:
(
name: Jim
enter code here`phone: 2128789283
address: 56 2nd street, New York, USA
END
)
(
name: Tom
phone: 6308789283
address: 56 5th street, Chicago, 13611, USA
END
)
Now both are separated into two tuples.

Resources