Issue while Writing Multiple O/P Files in MapReduce - hadoop

I have a requirement to split my input file into 2 output file based on a filter condition. My output directory should looks like below:
/hdfs/base/dir/matched/YYYY/MM/DD
/hdfs/base/dir/notmatched/YYYY/MM/DD
I am using MultipleOutputs class to split my data in my map function.
In my driver class I am using like below:
FileOutputFormat.setOutputPath(job, new Path("/hdfs/base/dir"));
and in Mapper I am using below:
mos.write(key, value, fileName); // File Name is generating based on filter criteria
This program is working fine for a single day. But in second day my program is failing saying that:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://nameservice1/hdfs/base/dir already exists
I cannot use different base directory for the second day.
How can I handle this situation?
Note: I don't want to read the input twise to create 2 separate file.

Create Custom o/p format class like below
package com.visa.util;
import java.io.IOException;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
public class CostomOutputFormat<K, V> extends SequenceFileOutputFormat<K, V>{
#Override
public void checkOutputSpecs(JobContext arg0) throws IOException {
}
#Override
public OutputCommitter getOutputCommitter(TaskAttemptContext arg0) throws IOException {
return super.getOutputCommitter(arg0);
}
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
return super.getRecordWriter(arg0);
}
}
and use it in driver class:
job.setOutputFormatClass(CostomOutputFormat.class);
Which will skip checking of the existence of o/p directory.

You can have a flag column in your output value. Later you can process the output and split it by the flag column.

Related

How to skip blank lines in CSV using FlatFileItemReader and chunks

I am processing CSV files using FlatFileItemReader.
Sometimes I am getting blank lines within the input file.
When that happened the whole step stops. I want to skipped those lines and proceed normal.
I tried to add exception handler to the step in order to catch the execption instead of having the whole step stooped:
#Bean
public Step processSnidUploadedFileStep() {
return stepBuilderFactory.get("processSnidFileStep")
.<MyDTO, MyDTO>chunk(numOfProcessingChunksPerFile)
.reader(snidFileReader(OVERRIDDEN_BY_EXPRESSION))
.processor(manualUploadAsyncItemProcessor())
.writer(manualUploadAsyncItemWriter())
.listener(logProcessListener)
.throttleLimit(20)
.taskExecutor(infrastructureConfigurationConfig.taskJobExecutor())
.exceptionHandler((context, throwable) -> logger.error("Skipping record on file. cause="+ ((FlatFileParseException)throwable).getCause()))
.build();
}
Since I am processing with chunks when blank line arrives and exception is catched what's happens is that the whole chunk is skipped(the chunk might contains valid lines on CSV file and they are skipped aswell)
Any idea how to do this right when processing file in chunks?
Thanks,
ray.
After editing my code. still not skipping:
public Step processSnidUploadedFileStep() {
SimpleStepBuilder<MyDTO, MyDTO> builder = new SimpleStepBuilder<MyDTO, MyDTO>(stepBuilderFactory.get("processSnidFileStep"));
return builder
.<PushItemDTO, PushItemDTO>chunk(numOfProcessingChunksPerFile)
.faultTolerant().skip(FlatFileParseException.class)
.reader(snidFileReader(OVERRIDDEN_BY_EXPRESSION))
.processor(manualUploadAsyncItemProcessor())
.writer(manualUploadAsyncItemWriter())
.listener(logProcessListener)
.throttleLimit(20)
.taskExecutor(infrastructureConfigurationConfig.taskJobExecutor())
.build();
}
We created custom SimpleRecordSeparatorPolicy which is telling reader to skip blank lines. That way we read 100 records, i.e. 3 are blank lines and those are ignored without exception and it writes 97 records.
Here is code:
package com.my.package;
import org.springframework.batch.item.file.separator.SimpleRecordSeparatorPolicy;
public class BlankLineRecordSeparatorPolicy extends SimpleRecordSeparatorPolicy {
#Override
public boolean isEndOfRecord(final String line) {
return line.trim().length() != 0 && super.isEndOfRecord(line);
}
#Override
public String postProcess(final String record) {
if (record == null || record.trim().length() == 0) {
return null;
}
return super.postProcess(record);
}
}
And here is reader:
package com.my.package;
import org.springframework.batch.core.configuration.annotation.StepScope;
import org.springframework.batch.item.file.FlatFileItemReader;
import org.springframework.batch.item.file.mapping.DefaultLineMapper;
import org.springframework.batch.item.file.transform.DelimitedLineTokenizer;
import org.springframework.stereotype.Component;
#Component
#StepScope
public class CustomReader extends FlatFileItemReader<CustomClass> {
#Override
public void afterPropertiesSet() throws Exception {
setLineMapper(new DefaultLineMapper<CustomClass>() {
{
/// configuration of line mapper
}
});
setRecordSeparatorPolicy(new BlankLineRecordSeparatorPolicy());
super.afterPropertiesSet();
}
}

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Hadoop: Getting the input file name in the mapper only once

I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}

Hadoop Not Finding Map Class

I am using hadoop-1.2.1 and trying to run a simple RowCount HBase job using ToolRunner. However, no matter what I seem to try, hadoop cannot find the map class. The jar file is being copied correctly into hdfs, but I can't seem to figure out where it is going wrong. Please help!
Here is the code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HBaseRowCountToolRunnerTest extends Configured implements Tool
{
// What to copy.
public static final String JAR_NAME = "myJar.jar";
public static final String LOCAL_JAR = <path_to_jar> + JAR_NAME;
public static final String REMOTE_JAR = "/tmp/"+JAR_NAME;
public static void main(String[] args) throws Exception
{
Configuration config = HBaseConfiguration.create();
//All connection configs set here -- omitted to post the code
config.set("tmpjars", REMOTE_JAR);
FileSystem dfs = FileSystem.get(config);
System.out.println("pathString = " + (new Path(LOCAL_JAR)).toString() + " \n");
// Copy jar file to remote.
dfs.copyFromLocalFile(new Path(LOCAL_JAR), new Path(REMOTE_JAR));
// Get rid of jar file when we're done.
dfs.deleteOnExit(new Path(REMOTE_JAR));
// Run the job.
System.exit(ToolRunner.run(config, new HBaseRowCountToolRunnerTest(), args));
}
#Override
public int run(String[] args) throws Exception
{
Job job = new RowCountJob(getConf(), "testJob", "myLittleHBaseTable");
return job.waitForCompletion(true) ? 0 : 1;
}
public static class RowCountJob extends Job
{
RowCountJob(Configuration conf, String jobName, String tableName) throws IOException
{
super(conf, RowCountJob.class.getCanonicalName() + "_" + jobName);
setJarByClass(getClass());
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setFilter(new FirstKeyOnlyFilter());
setOutputFormatClass(NullOutputFormat.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, this);
setNumReduceTasks(0);
}
}//end public static class RowCountJob extends Job
//Mapper that runs the count
//TableMapper -- TableMapper<KEYOUT, VALUEOUT> (*OUT by type)
public static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result>
{
//Counter enumeration to count the actual rows
public static enum Counters {ROWS}
/**
* Maps the data.
*
* #param row The current table row key.
* #param values The columns.
* #param context The current context.
* #throws IOException When something is broken with the data.
* #see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN,
* org.apache.hadoop.mapreduce.Mapper.Context)
*/
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException
{
// Count every row containing data times 2, whether it's in qualifiers or values
context.getCounter(Counters.ROWS).increment(2);
}
}//end public static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result>
}//end public static void main(String[] args) throws Exception
Ok- I found a workaround to the problem and thought that I would share for all others having similar issues...
As is turns out, I abandoned the tmpjars configuration option and just copied the jar file directed into the DistributedCache from the code itself. Here is what it looks like:
// Copy jar file to remote.
FileSystem dfs = FileSystem.get(conf);
dfs.copyFromLocalFile(new Path(LOCAL_JAR), new Path(REMOTE_JAR));
// Get rid of jar file when we're done.
dfs.deleteOnExit(new Path(REMOTE_JAR));
//Place it in the distributed cache
DistributedCache.addFileToClassPath(new Path(REMOTE_JAR), conf, dfs);
Perhaps it doesn't solve what is going on with tmpjars, but it does work.
I got the same problem today.Finally, I found it was because I forgot to insert the following sentence in the driver class...
job.setJarByClass(HBaseTestDriver.class);

PIG doesn't read my custom InputFormat

I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow:
import org.apache.hadoop.mapreduce.InputFormat;
public class EccUDFLogLoader extends LoadFunc {
#Override
public InputFormat getInputFormat() {
System.out.println("I am in getInputFormat function");
return new MyInputFormat();
}
}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class MyInputFormat extends TextInputFormat {
public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
}
For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks.
I am using PIG on Amazon EMR.
Your signature doesn't match that of the parent class (you're missing the Reporter argument), try this:
#Override
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit inputSplit, JobConf jobConf, Reporter reporter)
throws IOException {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
EDIT Sorry i didn't spot this earlier, as you note, you need to use the new API signature instead:
#Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
System.out.prinln("I am in createRecordReader");
//MyRecordReader suppose to handle record boundary
return new MyRecordReader((FileSplit)inputSplit, jobConf);
}
And your MyRecordReader class needs to extend the org.apache.hadoop.mapreduce.RecordReader class

Resources