DistributedCache Hadoop - FileNotFound - hadoop

I'm trying to place a file in the distributed cache. In order to do this I invoke my driver class using the -files option, something like:
hadoop jar job.jar my.driver.class -files MYFILE input output
The getCacheFiles() and the getLocalCacheFiles() return arrays of URIs/Paths containing MYFILE.
(E.g.: hdfs://localhost/tmp/hadoopuser/mapred/staging/knappy/.staging/job_201208262359_0005/files/histfile#histfile)
Unfortunately, when trying to retrieve MYFILE in the map task, it throws a FileNotFoundException.
I tried this in standalone(local) mode as well as in pseudo-distributed mode.
Do you know what might be the cause ?
UPDATE:
The following three lines:
System.out.println("cache files:"+ctx.getConfiguration().get("mapred.cache.files"));
uris = DistributedCache.getLocalCacheFiles(ctx.getConfiguration());
for(Path uri: uris){
System.out.println(uri.toString());
System.out.println(uri.getName());
if(uri.getName().contains(Constants.PATH_TO_HISTFILE)){
histfileName = uri.getName();
}
}
print out this:
cache files:file:/home/knappy/histfile#histfile
/tmp/hadoop-knappy/mapred/local/archive/-7231_-1351_105/file/home/knappy/histfile
histfile
So, the file seems to be listed in the job.xml mapred.cache.files property and the local file seems to be present. Still, the FileNotFoundException is thrown.

First check mapred.cache.files in your job's xml to see whether the file is in the cache.
The you can retrieve it in your mapper:
...
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
File myFile = new File(files[0].getName());
//read your file content
...

Related

How to get the file name in hadoop from input file path out side mapper and reducer i.e driver class

To get file path in mapper or reducer we use
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
process(key,value);
But In input folder i had five different kind of files so need to get file name such that i can set different mappers for different files .
example in args[0] my input folder /cloudera/test contains test.txt,dev.txt,rev.txt
if file name contains dev I should set mapper1
file name contains test I should set mapper 2
..........
You have to use MultipleInputs and mappers i think i got a good link for you help that helped me too when i practiced it long back.
MultipleInput Usage
You can use something like this:
FileInputFormat.addInputPaths(job, String.valueOf(args[0]+","+args[1]));
Where, you can mention the paths of individual files in args[0] and args[1].

passing URI as a runtime variable to distributed cache in mapreduce hadoop

I am using distributed cache in my mapreduce program and I am passing three variables to this mapreduce program input file, output dir and config file.
I want to add the third argument i.e config file to the Distributed Cache.
I am setting the parameter as follows in run() method of the MapReduce Driver:-
conf.set("CONF_XML", args[2]);
How to add this file into distributed cache in the same method. how do I do that ?
Usually we add using URI(new (file path));
DistributedCache.addCacheFile(new URI(file_path), conf); << here how to pass the argument parameter?
Pass the file path argument to the DistributedCache API as URI
DistributedCache.addCacheFile(new Path(args[2]).toUri(),job.getConfiguration());

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

How to overwrite/reuse the existing output path for Hadoop jobs again and agian

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily.
Actually the output directory will store summarized output of each day's job run results.
If I specify the same output directory it gives the error "output directory already exists".
How to bypass this validation?
What about deleting the directory before you run the job?
You can do this via shell:
hadoop fs -rmr /path/to/your/output/
or via the Java API:
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path("/path/to/your/output"), true);
Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:
Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.
Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.
Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.
However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V>
{
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed)
{
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, true);
if (!isCompressed)
{
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
}
else
{
return new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
}
}
}
Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.
Hadoop already supports the effect you seem to be trying to achieve by allowing multiple input paths to a job. Instead of trying to have a single directory of files to which you add more files, have a directory of directories to which you add new directories. To use the aggregate result as input, simply specify the input glob as a wildcard over the subdirectories (e.g., my-aggregate-output/*). To "append" new data to the aggregate as output, simply specify a new unique subdirectory of the aggregate as the output directory, generally using a timestamp or some sequence number derived from your input data (e.g. my-aggregate-output/20140415154424).
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
hdfs dfs -put /mylocalfile /user/cloudera/purchase
Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder
hdfs dfs -put -f /updated_mylocalfile /user/cloudera/purchase
Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
FileOutputFormat.setOutputPath(job, new Path(args[1]);
Change this by the following lines:
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
I had a similar use case, I use MultipleOutputs to resolve this.
For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).
In the main, set the output directory to a random one (it can be deleted before a new job runs)
FileInputFormat.addInputPath(job, new Path("/randomPath"));
In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:
public void setup(Context context) {
MultipleOutputs mos = new MultipleOutputs(context);
}
and:
mos.write(key, value, "/outputDir/fileOfJobX.txt")
However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir
In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:
/outputDir/
messageTypeA/
messageSubTypeA1/
job1Output/
job1-part1.txt
job1-part2.txt
...
job2Output/
job2-part1.txt
...
messageSubTypeA2/
...
messageTypeB/
...
Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.
I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
#Override
public void checkOutputSpecs(JobContext job) throws IOException {
try {
super.checkOutputSpecs(job);
}catch (FileAlreadyExistsException ignored){
// Suffocate the exception
}
}
}
And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
you need to add the setting in your main class:
//Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//auto_delete output dir
OutputPath.getFileSystem(conf).delete(OutputPath);

How to create and read directories in Hadoop - Mapreduce Job working directory

I want to create a directory inside the working directory of a MapReduce job in Hadoop.
For example by using:
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
in my mapper class to write some intermediate files in it. Is it the right way to do it?.
Also after completion of the job how will I access this directory again if I wish so?
Please advice.
If you are using java, you can override the setup method and open the file handler there ( and close it in cleanup ) . This handle will be available to all mappers.
I am assuming that you are not writing all the map output here but some debug/stats. With this handler you can read and write as it is show in this example ( http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample )
if you want to read the whole directory, check out this example https://sites.google.com/site/hadoopandhive/home/how-to-read-all-files-in-a-directory-in-hdfs-using-hadoop-filesystem-api
remember that you will not be able to depend on the the order of data written to the files.
You can override setupReduce() in reducer class, use mkdirs() to create folder and use create() to create file for outputstream.
#Override
protected void setupReduce(Context context) throws IOException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
fs.mkdirs(new Path("your_path_here"));
}

Resources