Hadoop Chain Jobs - Skip second Job if input files does not exit - hadoop

I have two hadoop jobs. First Job saves output in hdfs file an the second job has as input this file. When this file does not exist i have error. How can i skip the second job if first jobs outfile does not exist?

Use this test, but with the path created by the first job:
FileSystem fs = FileSystem.get(conf);
String inputDir= "HDFS file path";
if (fs.exists(new Path(inputDir))) {
// this block gets executed only if the file path inputDir exists
}
The code inside the block would contain the configuration and execution code for the second job.

Related

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

MapReduce -Is there way to overwrite a output directory through CustomOutputFormat class?

I would like to overwrite a output directory in mapreduce but it throws an exception as "FileAlreadyExists". Is there way to overwrite a output directory by creating custom output class?
The easiest way, is check if the output directory exists. if it does delete it all it's contents.
To do it, use the [FileSystem][1] class in your driver class.
Path outputPath = new Path("/user/foor/jobOutput");
Job job = new Job();
FileSystem fs = FileSystem.get(outputPath.toUri(),job.getConfiguration());
fs.delete(outputPath, true);
FileOutputFormat.setOutputPath(job, outputPath);
Output file from MapReduce will be in HDFS. HDFS runs on the concept of write once and read many times. So you cannot overwrite a output directory. You have to delete it and write it again through MapReduce

Oozie generate set of files in directory

I'm trying to ingest log files into hadoop.
I'd like to use oozie to trigger my ingestion task (written in spark),and have oozie pass the filenames to my task.
I expect the log files to be set out as:
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.2.log
(etc).
So, now I have two problems:
1. How to get oozie to generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/ and pass it to my app; and
How to get oozie to in parallel generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/ and pass it to a second invocation of my task.
DateTime wise File name create can be done by using small Java Program, which can be call from Oozie Workflow.xml,
somthing like
String processedDateString = (new SimpleDateFormat("yyyyMMddhhmmss")).format(new Date(timeInMilis));
and while calling the same jar in workflow
<main-class>NameFile.jar</main-class>
<arg>Path=${output_path}</arg>
<arg>Name=${name}</arg>
<arg>processedDate=${(wf:actionData('Rename')['ProcessedDate'])}</arg>
For Copying/Moving you can use same Java program with Copy Action.
for log1 and log2 location you can mention in job.properties

Hadoop job just ends

I'm having a rather strange problem with Hadoop.
I wrote a MR job that ends just like that, without executing map or reduce code. It produces the output folder, but that folder is empty. I see no reason for such a behavior.
I'm even trying out this with default Mapper and Reducer, just to find the problem, but I get no exception, no error, the job just finishes and produces an empty folder. Here's the simplest driver:
Configuration conf = new Configuration();
//DistributedCache.addCacheFile(new URI(firstPivotsInput), conf);
Job pivotSelection = new Job(conf);
pivotSelection.setJarByClass(Driver.class);
pivotSelection.setJobName("Silhoutte");
pivotSelection.setMapperClass(Mapper.class);
pivotSelection.setReducerClass(Reducer.class);
pivotSelection.setMapOutputKeyClass(IntWritable.class);
pivotSelection.setMapOutputValueClass(Text.class);
pivotSelection.setOutputKeyClass(IntWritable.class);
pivotSelection.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(pivotSelection, new Path("/home/pera/WORK/DECOMPRESSION_RESULT.csv"));
FileOutputFormat.setOutputPath(pivotSelection, new Path("/home/pera/WORK/output"));
pivotSelection.setNumReduceTasks(1);
pivotSelection.waitForCompletion(true);
What could be the problem in such a simple example?
The simplest explanation is that the input Path ("/home/pera/WORK/DECOMPRESSION_RESULT.csv") does not contain anything on HDFS. You can verify that by the value of the MAP_INPUT_RECORDS counter. You can also check the size of this file on HDFS with hadoop dfs -ls /home/pera/WORK, or you can even see the first few lines of this file by hadoop dfs -cat /home/pera/WORK/DECOMPRESSION_RESULT.csv | head. (or -text instead of -cat if it is compressed).
Another problem could be that the reducer has a special (if) condition that fails for every mapper's output, but this should not hold in the case of identity mapper and reducer, so I believe the case is the former one.

How to overwrite/reuse the existing output path for Hadoop jobs again and agian

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily.
Actually the output directory will store summarized output of each day's job run results.
If I specify the same output directory it gives the error "output directory already exists".
How to bypass this validation?
What about deleting the directory before you run the job?
You can do this via shell:
hadoop fs -rmr /path/to/your/output/
or via the Java API:
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path("/path/to/your/output"), true);
Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:
Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.
Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.
Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.
However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V>
{
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed)
{
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, true);
if (!isCompressed)
{
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
}
else
{
return new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
}
}
}
Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.
Hadoop already supports the effect you seem to be trying to achieve by allowing multiple input paths to a job. Instead of trying to have a single directory of files to which you add more files, have a directory of directories to which you add new directories. To use the aggregate result as input, simply specify the input glob as a wildcard over the subdirectories (e.g., my-aggregate-output/*). To "append" new data to the aggregate as output, simply specify a new unique subdirectory of the aggregate as the output directory, generally using a timestamp or some sequence number derived from your input data (e.g. my-aggregate-output/20140415154424).
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
hdfs dfs -put /mylocalfile /user/cloudera/purchase
Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder
hdfs dfs -put -f /updated_mylocalfile /user/cloudera/purchase
Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
FileOutputFormat.setOutputPath(job, new Path(args[1]);
Change this by the following lines:
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
I had a similar use case, I use MultipleOutputs to resolve this.
For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).
In the main, set the output directory to a random one (it can be deleted before a new job runs)
FileInputFormat.addInputPath(job, new Path("/randomPath"));
In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:
public void setup(Context context) {
MultipleOutputs mos = new MultipleOutputs(context);
}
and:
mos.write(key, value, "/outputDir/fileOfJobX.txt")
However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir
In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:
/outputDir/
messageTypeA/
messageSubTypeA1/
job1Output/
job1-part1.txt
job1-part2.txt
...
job2Output/
job2-part1.txt
...
messageSubTypeA2/
...
messageTypeB/
...
Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.
I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
#Override
public void checkOutputSpecs(JobContext job) throws IOException {
try {
super.checkOutputSpecs(job);
}catch (FileAlreadyExistsException ignored){
// Suffocate the exception
}
}
}
And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
you need to add the setting in your main class:
//Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//auto_delete output dir
OutputPath.getFileSystem(conf).delete(OutputPath);

Resources