Mapreduce - reducer class results not correct - hadoop

I have an Adcampaign driver, mapper and reducer classes. First two classes run great. The reducer class also runs fine but the results are not correct. This is a sample project I downloaded from internet to practice mapreduce program.
brief description of this program:
Problem statement:
For this article, let’s pretend that we are running an online advertising company. We run advertising campaigns for clients (like Pepsi, Sony) and the ads are displayed on popular websites such as news sites (CNN, Fox) and social media sites (Facebook). To track how well an advertising campaign is doing, we keep track of the ads we serve and ads that users click.
Scenario
Here is the sequence of events:
1. We serve the ad to the user
2. If the ad appears on users browser, aka user saw the ad. We track this event as VIEWED_EVENT
3. If user clicks on the ad, we track this event as CLICKED_EVENT
sample data:
293868800864,319248,1,flickr.com,12
1293868801728,625828,1,npr.org,19
1293868802592,522177,2,wikipedia.org,16
1293868803456,535052,2,cnn.com,20
1293868804320,287430,2,sfgate.com,2
1293868805184,616809,2,sfgate.com,1
1293868806048,704032,1,nytimes.com,7
1293868806912,631825,2,amazon.com,11
1293868807776,610228,2,npr.org,6
1293868808640,454108,2,twitter.com,18
Input Log files format and description:
Log Files: The log files are in the following format:
times- tamp, user_id, view/click, domain, campaign_id.
E.g: 1262332801728, 899523, 1, npr.org, 19
◾timestamp : unix time stamp in milliseconds
◾user_id : each user has a unique id
◾action_id : 1=view, 2=click
◾domain : which domain the ad was served
◾campaign_id: identifies the campaign the ad was part of
Expected ouput from reducer was:
campaignid, total views, total clicks
Example:
12, 3,2
13,100,23
14, 23,12
I looked at the logs of Mapper. The output is good. But the final output from Reducer is not good.
Reducer class:
public class AdcampaignReducer extends Reducer<IntWritable, IntWritable, IntWritable, Text>
{
// Key/value : IntWritable/List of IntWritables for every campaign, we are getting all actions for that
// campaign as an iterable list. We are iterating through action_ids and calculating views and click
// Once we are done calculating, we write out the results. This is possible because all actions for a campaign are grouped and sent to one reducer.
//Text k= new Text();
public void reduce(IntWritable key, Iterable<IntWritable> results, Context context) throws IOException, InterruptedException
{
int campaign = key.get();
//k = key.get();
int clicks = 0;
int views = 0;
for(IntWritable i:results)
{
int action = i.get();
if (action ==1)
views = views+1;
else if (action == 2)
clicks = clicks + 1;
}
String statistics = "Total Clicks =" +clicks + "and Views =" + views;
context.write(new IntWritable(campaign), new Text(statistics));
}
}
Mapper class:
public class AdcampaignMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private long numRecords = 0;
#Override
public void map(LongWritable key, Text record, Context context) throws IOException, InterruptedException {
String[] tokens = record.toString().split(",");
if (tokens.length !=5)
{
System.out.println("*** invalid record : " + record);
}
String actionStr = tokens[2];
String campaignStr = tokens[4];
try{
//System.out.println("during parseint"); //used to debug
System.out.println("actionStr =" + actionStr + "and campaign str = " + campaignStr);
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
//System.out.println("during intwritable"); //used to debug
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
context.write(outputKeyFromMapper, outputValueFromMapper);
}
catch(Exception e){
System.out.println("*** there is exception");
e.printStackTrace();
}
numRecords = numRecords+1;
}
}
Driver program:
public class Adcampaign {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxClosePrice <input path> <output path>");
System.exit(-1);
}
//reads the default configuration of cluster from the configuration xml files
// https://www.quora.com/What-is-the-use-of-a-configuration-class-and-object-in-Hadoop-MapReduce-code
Configuration conf = new Configuration();
//Initializing the job with the default configuration of the cluster
Job job = new Job(conf, "Adcampaign");
//first argument is job itself
//second argument is location of the input dataset
FileInputFormat.addInputPath(job, new Path(args[0]));
//first argument is the job itself
//second argument is the location of the output path
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Defining input Format class which is responsible to parse the dataset into a key value pair
//Configuring the input/output path from the filesystem into the job
// InputFormat is responsible for 3 main tasks.
// a. Validate inputs - meaning the dataset exists in the location specified.
// b. Split up the input files into logical input splits. Each input split will be assigned a mapper.
// c. Recordreader implementation to extract logical records
job.setInputFormatClass(TextInputFormat.class);
//Defining output Format class which is responsible to parse the final key-value output from MR framework to a text file into the hard disk
//OutputFomat does 2 mains things
// a. Validate output specifications. Like if the output directory already exists? If the directory exist, it will throw an error.
// b. Recordwriter implementation to write output files of the job
//Hadoop comes with several output format implemenations.
job.setOutputFormatClass(TextOutputFormat.class);
//Assigning the driver class name
job.setJarByClass(Adcampaign.class);
//Defining the mapper class name
job.setMapperClass(AdcampaignMapper.class);
//Defining the Reducer class name
job.setReducerClass(AdcampaignReducer.class);
//setting the second argument as a path in a path variable
Path outputPath = new Path(args[1]);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
///exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

You want the output as per the campaign_id. So the Campaign_id shud be key from the mapper code. And then in the reducer code, you will check whether its a view or a click.
String actionStr = tokens[2];
String campaignStr = tokens[4];
int actionid = Integer.parseInt(actionStr.trim());
int campaignid = Integer.parseInt(campaignStr.trim());
IntWritable outputKeyFromMapper = new IntWritable(actionid);
IntWritable outputValueFromMapper = new IntWritable(campaignid);
Here outputKeyFromMapper should be campaignid as the sorting will be done on campaignid.
PLEASE LET ME KNOW IF IT HELPS.

The output key from your mapper should be campaignid and value should be actionid
If you want to count number of records in mapper use counters

Your mapper and reducer looks fine.
Add below lines to your Driver class and give a try:
job.setOutputKeyClass( IntWritable.class );
job.setOutputValueClass( Text.class );

Related

Hadoop Map Reduce: How to create a reduce function for this?

I hit a brick wall. I have the following files I've generated from previous MR functions.
Product Scores (I have)
0528881469 1.62
0594451647 2.28
0594481813 2.67
0972683275 4.37
1400501466 3.62
where column 1 = product_id, and column 2 = product_rating
Related Products (I have)
0000013714 [0005080789,0005476798,0005476216,0005064341]
0000031852 [B00JHONN1S,B002BZX8Z6,B00D2K1M3O,0000031909]
0000031887 [0000031852,0000031895,0000031909,B00D2K1M3O]
0000031895 [B002BZX8Z6,B00JHONN1S,0000031909,B008F0SU0Y]
0000031909 [B002BZX8Z6,B00JHONN1S,0000031895,B00D2K1M3O]
where column 1 = product_id, and column 2 = array of also_bought products
The file I am trying to create now combines both of these files into the following:
Recommended Products (I need)
0000013714 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031852 [<0005476798, 4.58>,<0005080789, 2.34>,<0005476216, 2.32>]
0000031887 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031895 [<0005476216, 2.32>,<0005476798, 4.58>,<0005080789, 2.34>]
0000031909 [<0005476216, 2.32>,<0005080789, 2.34>,<0005476798, 4.58>]
where column 1 = product_id and column 2 = array of tuples of
I'm just totally stuck at the moment, I thought I had a plan for this but it turned out that it was not a very good plan and it didn't work.
Two approaches based on your size of Product Scores data:
If your Product Scores file is not huge, you can load that up in Hadoop Distributed Cache.(Now available in Jobs itself) Job.addCacheFile()
Then, process the Related Products file and fetch the necessary rating in the Reducer and write it out. Quick and dirty. But, if Product Scores is a huge file then probably not the correct way to go about this problem.
Reduce side Joins. Various examples available, for eg., refer to this link to get an idea.
As you already have defined a schema, you can create hive tables on top of it and get the output using queries. This would save you a lot of time.
Edit: Moreover, If you already have map-reduce jobs ton create this file, you can add hive jobs, which creates external hive tables on these reducer outputs and then query them.
I ended up using a MapFile. I transformed both the ProductScores and RelatedProducts data sets into two MapFiles and then made a Java program that pulled information out of these MapFiles when needed.
MapFileWriter
public class MapFileWriter {
public static void main(String[] args) {
Configuration conf = new Configuration();
Path inputFile = new Path(args[0]);
Path outputFile = new Path(args[1]);
Text txtKey = new Text();
Text txtValue = new Text();
try {
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(inputFile);
Writer writer = new Writer(conf, fs, outputFile.toString(), txtKey.getClass(), txtKey.getClass());
writer.setIndexInterval(1);
while (inputStream.available() > 0) {
String strLineInInputFile = inputStream.readLine();
String[] lstKeyValuePair = strLineInInputFile.split("\\t");
txtKey.set(lstKeyValuePair[0]);
txtValue.set(lstKeyValuePair[1]);
writer.append(txtKey, txtValue);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
MapFileReader
public class MapFileReader {
public static void main(String[] args) {
Configuration conf = new Configuration();
FileSystem fs;
Text txtKey = new Text(args[1]);
Text txtValue = new Text();
MapFile.Reader reader;
try {
fs = FileSystem.get(conf);
try {
reader = new MapFile.Reader(fs, args[0], conf);
reader.get(txtKey, txtValue);
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("The value for Key " + txtKey.toString() + " is " + txtValue.toString());
}
}

How to repeat Job with Partitioner when data is dynamic with Spring Batch?

I am trying to develop a batch process using Spring Batch + Spring Boot (Java config), but I have a problem doing so. I have a software that has a database and a Java API, and I read records from there. The batch process should retrieve all the documents which expiration date is less than a certain date, update the date, and save them again in the same database.
My first approach was reading the records 100 by 100; so the ItemReader retrieve 100 records, I process them 1 by 1, and finally I write them again. In the reader, I put this code:
public class DocumentItemReader implements ItemReader<Document> {
public List<Document> documents = new ArrayList<>();
#Override
public Document read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if(documents.isEmpty()) {
getDocuments(); // This method retrieve 100 documents and store them in "documents" list.
if(documents.isEmpty()) return null;
}
Document doc = documents.get(0);
documents.remove(0);
return doc;
}
}
So, with this code, the reader reads from the database until no records are found. When the "getDocuments()" method doesn't retrieve any documents, the List is empty and the reader returns null (so the Job finish). Everything worked fine here.
However, the problem appears if I want to use several threads. In this case, I started using the Partitioner approach instead of Multi-threading. The reason of doing that is because I read from the same database, so if I repeat the full step with several threads, all of them will find the same records, and I cannot use pagination (see below).
Another problem is that database records are updated dynamically, so I cannot use pagination. For example, let's suppose I have 200 records, and all of them are going to expire soon, so the process is going to retrieve them. Now imagine I retrieve 10 with one thread, and before anything else, that thread process one and update it in the same database. The next thread cannot retrieve from 11 to 20 records, as the first record is not going to appear in the search (as it has been processed, its date has been updated, and then it doesn't match the query).
It is a little difficult to understand, and some things may sound strange, but in my project:
I am forced to use the same database to read and write.
I can have millions of documents, so I cannot read all the records at the same time. I need to read them 100 by 100, or 500 by 500.
I need to use several threads.
I cannot use pagination, as the query to the databse will retrieve different documents each time it is executed.
So, after hours thinking, I think the unique possible solution is to repeat the job until the query retrives no documents. Is this possible? I want to do something like the step does: Do something until null is returned - repeat the job until the query return zero records.
If this is not a good approach, I will appreciate other possible solutions.
Thank you.
Maybe you can add a partitioner to your step that will :
Select all the ids of the datas that needs to be updated (and other columns if needed)
Split them in x (x = gridSize parameter) partitions and write them in temporary file (1 by partition).
Register the filename to read in the executionContext
Then your reader is not reading from the database anymore but from the partitioned file.
Seem complicated but it's not that much, here is an example which handle millions of record using JDBC query but it can be easily transposed for your use case :
public class JdbcToFilePartitioner implements Partitioner {
/** number of records by database fetch */
private int fetchSize = 100;
/** working directory */
private File tmpDir;
/** limit the number of item to select */
private Long nbItemMax;
#Override
public Map<String, ExecutionContext> partition(final int gridSize) {
// Create contexts for each parttion
Map<String, ExecutionContext> executionsContexte = createExecutionsContext(gridSize);
// Fill partition with ids to handle
getIdsAndFillPartitionFiles(executionsContexte);
return executionsContexte;
}
/**
* #param gridSize number of partitions
* #return map of execution context, one for each partition
*/
private Map<String, ExecutionContext> createExecutionsContext(final int gridSize) {
final Map<String, ExecutionContext> map = new HashMap<>();
for (int partitionId = 0; partitionId < gridSize; partitionId++) {
map.put(String.valueOf(partitionId), createContext(partitionId));
}
return map;
}
/**
* #param partitionId id of the partition to create context
* #return created executionContext
*/
private ExecutionContext createContext(final int partitionId) {
final ExecutionContext context = new ExecutionContext();
String fileName = tmpDir + File.separator + "partition_" + partitionId + ".txt";
context.put(PartitionerConstantes.ID_GRID.getCode(), partitionId);
context.put(PartitionerConstantes.FILE_NAME.getCode(), fileName);
if (contextParameters != null) {
for (Entry<String, Object> entry : contextParameters.entrySet()) {
context.put(entry.getKey(), entry.getValue());
}
}
return context;
}
private void getIdsAndFillPartitionFiles(final Map<String, ExecutionContext> executionsContexte) {
List<BufferedWriter> fileWriters = new ArrayList<>();
try {
// BufferedWriter for each partition
for (int i = 0; i < executionsContexte.size(); i++) {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(executionsContexte.get(String.valueOf(i)).getString(
PartitionerConstantes.FILE_NAME.getCode())));
fileWriters.add(bufferedWriter);
}
// Fetching the datas
ScrollableResults results = runQuery();
// Get the result and fill the files
int currentPartition = 0;
int nbWriting = 0;
while (results.next()) {
fileWriters.get(currentPartition).write(results.get(0).toString());
fileWriters.get(currentPartition).newLine();
currentPartition++;
nbWriting++;
// If we already write on all partitions, we start again
if (currentPartition >= executionsContexte.size()) {
currentPartition = 0;
}
// If we reach the max item to read we stop
if (nbItemMax != null && nbItemMax != 0 && nbWriting >= nbItemMax) {
break;
}
}
// closing
results.close();
session.close();
for (BufferedWriter bufferedWriter : fileWriters) {
bufferedWriter.close();
}
} catch (IOException | SQLException e) {
throw new UnexpectedJobExecutionException("Error writing partition file", e);
}
}
private ScrollableResults runQuery() {
...
}
}

How to handle duplicate messages using Kafka streaming DSL functions

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}

Loading Files in UDF

I have a requirement of populating a field based on the evaluation of a UDF. The input to the UDF would be some other fields in the input and as well as an csv sheet. Presently, the approach I have taken is to load the CSV file, group it ALL and then pass it as a bag to the UDF along with other required parameters. However, its taking a very long time to complete the process (roughly about 3 hours) for source data of 170k records and as well as csv records of about 150k.
I'm sure there must be much better efficient way to handle this and hence need your inputs.
source_alias = LOAD 'src.csv' USING
PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray);
csv_alias = LOAD 'csv_file.csv' USING
PigStorage(',') AS (c1:chararray,c2:chararray,c3:chararray);
grpd_csv_alias = GROUP csv_alias ALL;
final_alias = FOREACH source_alias GENERATE f1 AS f1,
myUDF(grpd_csv_alias, f2) AS derived_f2;
Here is my UDF on a high level.
public class myUDF extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String f2Response = "N";
DataBag csvAliasBag = (DataBag)input.get(0);
String f2 = (String) input.get(1);
try {
Iterator<Tuple> bagIterator = csvAliasBag.iterator();
while (bagIterator.hasNext()) {
Tuple localTuple = (Tuple)bagIterator.next();
String col1 = ((String)localTuple.get(1)).trim().toLowerCase();
String col2 = ((String)localTuple.get(2)).trim().toLowerCase();
String col3 = ((String)localTuple.get(3)).trim().toLowerCase();
String col4 = ((String)localTuple.get(4)).trim().toLowerCase();
<Custom logic to populate f2Response based on the value in f2 and as well as col1, col2, col3 and col4>
}
}
return f2Response;
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
I believe the process is taking too long because of building and passing csv_alias to the UDF for each row in the source file.
Is there any better way to handle this?
Thanks
For small files, you can put them on the distributed cache. This copies the file to each task node as a local file then you load it yourself. Here's an example from the Pig docs UDF section. I would not recommend parsing the file each time, however. Store your results in a class variable and check to see if it's been initialized. If the csv is on the local file system, use getShipFiles. If the csv you're using is on HDFS, used the getCachedFiles method. Notice that for HDFS there's a file path followed by a # and some text. To the left of the # is the HDFS path and to the right is the name you want it to be called when it's copied to the local file system.
public class Udfcachetest extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String concatResult = "";
FileReader fr = new FileReader("./smallfile1");
BufferedReader d = new BufferedReader(fr);
concatResult +=d.readLine();
fr = new FileReader("./smallfile2");
d = new BufferedReader(fr);
concatResult +=d.readLine();
return concatResult;
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/user/pig/tests/data/small#smallfile1"); // This is hdfs file
return list;
}
public List<String> getShipFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/home/hadoop/pig/smallfile2"); // This local file
return list;
}
}

Hadoop KeyComposite and Combiner

I am doing a secondary sort in Hadoop 2.6.0, I am following this tutorial:
https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
I have the exact same code, but now I am trying to improve performance so I have decided to add a combiner. I have added two modifications:
Main file:
job.setCombinerClass(CombinerK.class);
Combiner file:
public class CombinerK extends Reducer<KeyWritable, KeyWritable, KeyWritable, KeyWritable> {
public void reduce(KeyWritable key, Iterator<KeyWritable> values, Context context) throws IOException, InterruptedException {
Iterator<KeyWritable> it = values;
System.err.println("combiner " + key);
KeyWritable first_value = it.next();
System.err.println("va: " + first_value);
while (it.hasNext()) {
sum += it.next().getSs();
}
first_value.setS(sum);
context.write(key, first_value);
}
}
But it seems that it is not run because I can't find any logs file which have the word "combiner". When I saw counters after running, I could see:
Combine input records=4040000
Combine output records=4040000
The combiner seems like it is being executed but it seems as it has been receiving a call for each key and by this reason it has the same number in input as output.

Resources