Adding output files to an existing output directory in Mapreduce - hadoop

I want to add output files of a map reduce program to the same directory every time I run the job by appending time stamp at the end of file name.
Currently i am able append the time stamp at the end of file output file, but I am unable to find out how to add files to the same output directory instead of overwriting it every time.

You can write output files in temporary folder and move them to target folder after the end of job. Example of a method that moves all files from one folder to another:
public static void moveFiles(Path from, Path to, Configuration conf) throws IOException {
FileSystem fs = from.getFileSystem(conf); // get file system
for (FileStatus status : fs.listStatus(from)) { // list all files in 'from' folder
Path file = status.getPath(); // get path to file in 'from' folder
Path dst = new Path(to, file.getName()); // create new file name
fs.rename(file, dst); // move file from 'from' folder to 'to' folder
}
}

The output can be controlled by using reduce method. I guess you can try out a logic in reducer.
Please note that number of reducers = number of output files.

Related

How to copy new created file from multiple directory into single directory uisng ssis or cmd?

I have multiple folders in a single directory and each folder(folder name as per Siteid) contains the .orig file. so I want to copy the latest file created ".orig" file from multiple directories into a single directory.
The file received from ftp in each directory on a weekly basis
Filename format
:PSAN{SiteId}_20190312_TO_20190318_AT_201903191600.txt
I tried to copy a file using windows command
for /R "source" %f in (*.orig) do copy %f "destination"
using this command I able to copy an all .orig file from multiple folders into a single folder but it fetches all file.
can we do modification in command to fetch the latest file?
How to perform this task using an SSIS package or using cmd.
Create two package variables called InputFileNameWithPath and OutputFolderPath. The Former will be dynamically set in the package. The former should be hardcoded to your destination folder name.
Create a ForEachLoop enumerator of type File Enumerator. Set the path to the root folder and select the check box for traverse sub-folders. You may want to add an expression for the FileMask and set it to *.orig that way only the .orig files are moved (assuming there are other files in the directory that you do not want to move) Assign the fully qualified path to the package variable InputFileName.
Then add a System File Task into the ForEachLoop enumerator and have it move the file. from the InputFileNameWithPath to the OutputFolderPath.
You can use a Script Task to achieve that, you can simply store the folders within an array, loop over these folders get the latest file created and copy it to the destination directory (you can achieve this without SSIS), as example:
using System.Linq;
string[] sourcefolders = new string[] { #"C:\New Folder1", #"C:\New Folder2" };
string destinationfolder = #"D:\Destination";
foreach (string srcFolder in sourcefolders)
{
string latestfile = System.IO.Directory.GetFiles(srcFolder, "*.orig").OrderByDescending(x => System.IO.File.GetLastWriteTime(x)).FirstOrDefault();
if (!String.IsNullOrWhiteSpace(latestfile))
{
System.IO.File.Copy(latestfile, System.IO.Path.Combine(destinationfolder, System.IO.Path.GetFileName(latestfile)));
}
}
An example C# Script Task for this is below. This uses SSIS variables for the source and destination folders, and make sure to add these in the ReadOnlyVariables field if that's how you're storing them. Be sure that you have the references listed below as well. As noted below, if a file with the same name exists it will be overwritten to avoid such an error, however I'd recommend confirming that this meets your requirements. The third parameter of File.Copy controls this, and can be omitted as it's not a required parameter.
using System.Linq;
using System.IO;
using System.Collections.Generic;
string sourceFolder = Dts.Variables["User::SourceDirectory"].Value.ToString();
string destinationFolder = Dts.Variables["User::DestinationDirectory"].Value.ToString();
Dictionary<string, string> files = new Dictionary<string, string>();
DirectoryInfo sourceDirectoryInfo = new DirectoryInfo(sourceFolder);
foreach (DirectoryInfo di in sourceDirectoryInfo.EnumerateDirectories())
{
//get most recently created .orig file
FileInfo fi = di.EnumerateFiles().Where(e => e.Extension == ".orig")
.OrderByDescending(f => f.CreationTime).First();
//add the full file path (FullName) as the key
//and the name only (Name) as the value
files.Add(fi.FullName, fi.Name);
}
foreach (var v in files)
{
//copy using key (full path) to destination folder
//by combining destination folder path with the file name
//true- overwrite file if file by same name exists
File.Copy(v.Key, destinationFolder + v.Value, true);
}

Flutter How to move file

I will move certain files. how to move the image file from directory to another directory,
example file img.jpg from /storage/emulated/0/Myfolder to /storage/emulated/0/Urfolder?
File.rename works only if source file and destination path are on the same file system, otherwise you will get a FileSystemException (OS Error: Cross-device link, errno = 18). Therefore it should be used for moving a file only if you are sure that the source file and the destination path are on the same file system.
For example when trying to move a file under the folder /storage/emulated/0/Android/data/ to a new path under the folder /data/user/0/com.my.app/cache/ will fail to FileSystemException.
Here is a small utility function for moving a file safely:
Future<File> moveFile(File sourceFile, String newPath) async {
try {
// prefer using rename as it is probably faster
return await sourceFile.rename(newPath);
} on FileSystemException catch (e) {
// if rename fails, copy the source file and then delete it
final newFile = await sourceFile.copy(newPath);
await sourceFile.delete();
return newFile;
}
}
await File('/storage/emulated/0/Myfolder').rename('/storage/emulated/0/Urfolder')
If the files are on different file systems you need to create a new destination file, read the source file and write the content into the destination file, then delete the source file.

Listing files in a HDFS directory

Currently I am getting the list of files in a HDFS directory as below
FileSystem fs = DistributedFileSystem.get(
URI.create(projectDir), conf);
for (FileStatus status : fs.listStatus(inputDirPath)) {
//Do something
}
The problem is this directory has a very large no of files. So this fills up memory. Is there a way that I can get a filtered file list like files which are created after a particular day.

Writing output to different folders hadoop

I want to write two different types of output from the same reducer, into two different directories.
I am able to use multipleoutputs feature in hadoop to write to different files, but they both go to the same output folder.
I want to write each file from the same reduce to a different folder.
Is there a way for doing this?
If I try putting for example "hello/testfile", as the second argument, it shows invaid argument. So I m not able to write to different folders.
If the above case is not possible, the is it possible for the mapper to read only specific files from an input folder?
Please help me.
Thanks in advance!
Thanks for the reply. I am able to read a file successfully using then above method. But in distributed mode, I am not able to do so. In the reducer, I have
set:
mos.getCollector("data", reporter).collect(new Text(str_key), new Text(str_val));
(Using multiple outputs, and in Job Conf:
I tried using
FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data-r-00000*");
as well as
FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data*");
But, it gives the following error:
cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://mentat.cluster:54310/home/users/mlakshm/opchk295/data-r-00000* matches 0 files
Question 1: Writing output files to different directories - you can do it using the following approaches:
1. Using MultipleOutputs class:
Its great that you are able to create multiple named output files using MultipleOutputs. As you know, we need to add this in your driver code.
MultipleOutputs.addNamedOutput(job, "OutputFileName", OutputFormatClass, keyClass, valueClass);
The API provides two overloaded write methods to achieve this.
multipleOutputs.write("OutputFileName", new Text(Key), new Text(Value));
Now, to write the output file to separate output directories, you need to use an overloaded write method with an extra parameter for the base output path.
multipleOutputs.write("OutputFileName", new Text(key), new Text(value), baseOutputPath);
Please remember to change your baseOutputPath in each of your implementation.
2. Rename/Move the file in driver class:
This is probably the easiest hack to write output to multiple directories. Use multipleOutputs and write all the output files to a single output directory. But the file names need to be different for each category.
Assume that you want to create 3 different sets of output files, the first step is to register named output files in the driver:
MultipleOutputs.addNamedOutput(job, "set1", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set2", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set3", OutputFormatClass, keyClass, valueClass);
Also, create the different output directories or the directory structure you want in the driver code, along with the actual output directory:
Path set1Path = new Path("/hdfsRoot/outputs/set1");
Path set2Path = new Path("/hdfsRoot/outputs/set2");
Path set3Path = new Path("/hdfsRoot/outputs/set3");
The final important step is to rename the output files based on their names. If the job is successful;
FileSystem fileSystem = FileSystem.get(new Configuration);
if (jobStatus == 0) {
// Get the output files from the actual output path
FileStatus outputfs[] = fileSystem.listStatus(outputPath);
// Iterate over all the files in the output path
for (int fileCounter = 0; fileCounter < outputfs.length; fileCounter++) {
// Based on each fileName rename the path.
if (outputfs[fileCounter].getPath().getName().contains("set1")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set1Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set2")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set2Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set3")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set3Path+"/"+anyNewFileName));
}
}
}
Note: This will not add any significant overhead to the job because we are only MOVING files from one directory to another. And choosing any particular approach depends on the nature of your implementation.
In summary, this approach basically writes all the output files using different names to the same output directory and when the job is successfully completed, we rename the base output path and move files to different output directories.
Question 2: Reading specific files from an input folder(s):
You can definitely read specific input files from a directory using MultipleInputs class.
Based on your input path/file names you can pass the input files to the corresponding Mapper implementation.
Case 1: If all the input files ARE IN a single directory:
FileStatus inputfs[] = fileSystem.listStatus(inputPath);
for (int fileCounter = 0; fileCounter < inputfs.length; fileCounter++) {
if (inputfs[fileCounter].getPath().getName().contains("set1")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set1Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set2")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set2Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set3")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set3Mapper.class);
}
}
Case 2: If all the input files ARE NOT IN a single directory:
We can basically use the same approach above even if the input files are in different directories. Iterate over the base input path and check the file path name for a matching criteria.
Or, if the files are in complete different locations, the simplest way is to add to multiple inputs individually.
MultipleInputs.addInputPath(job, Set1_Path, TextInputFormat.class, Set1Mapper.class);
MultipleInputs.addInputPath(job, Set2_Path, TextInputFormat.class, Set2Mapper.class);
MultipleInputs.addInputPath(job, Set3_Path, TextInputFormat.class, Set3Mapper.class);
Hope this helps! Thank you.
Copy the MultipleOutputs code into your code base and loosen the restriction on allowable characters. I can't see any valid reason for the restrictions anyway.
Yes you can specify that a input format only processes certain files:
FileInputFormat.setInputPaths(job, "/path/to/folder/testfile*");
If you do amend the code, remember the _SUCCESS file should be written to both folders upon successful job completion - while this isn't a requirement, it is a machanism by which someone can determine if the output in that folder is complete, and not 'truncated' because of an error.
Yes you can do this. All you need to do is generate the file name for a particular key/value pair coming out of the reducer.
If you override a method, you can return the file name depending on what key/value pair you get, and so on. Here is the link that shows you how to do that.
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFMQFjAA&url=https%3A%2F%2Fsites.google.com%2Fsite%2Fhadoopandhive%2Fhome%2Fhow-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat&ei=y7YBULarN8iIrAf4iPSOBg&usg=AFQjCNHbd8sRwlY1-My2gNYI0yqw4254YQ

How to overwrite/reuse the existing output path for Hadoop jobs again and agian

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily.
Actually the output directory will store summarized output of each day's job run results.
If I specify the same output directory it gives the error "output directory already exists".
How to bypass this validation?
What about deleting the directory before you run the job?
You can do this via shell:
hadoop fs -rmr /path/to/your/output/
or via the Java API:
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path("/path/to/your/output"), true);
Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:
Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.
Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.
Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.
However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V>
{
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed)
{
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, true);
if (!isCompressed)
{
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
}
else
{
return new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
}
}
}
Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.
Hadoop already supports the effect you seem to be trying to achieve by allowing multiple input paths to a job. Instead of trying to have a single directory of files to which you add more files, have a directory of directories to which you add new directories. To use the aggregate result as input, simply specify the input glob as a wildcard over the subdirectories (e.g., my-aggregate-output/*). To "append" new data to the aggregate as output, simply specify a new unique subdirectory of the aggregate as the output directory, generally using a timestamp or some sequence number derived from your input data (e.g. my-aggregate-output/20140415154424).
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
hdfs dfs -put /mylocalfile /user/cloudera/purchase
Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder
hdfs dfs -put -f /updated_mylocalfile /user/cloudera/purchase
Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
FileOutputFormat.setOutputPath(job, new Path(args[1]);
Change this by the following lines:
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
I had a similar use case, I use MultipleOutputs to resolve this.
For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).
In the main, set the output directory to a random one (it can be deleted before a new job runs)
FileInputFormat.addInputPath(job, new Path("/randomPath"));
In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:
public void setup(Context context) {
MultipleOutputs mos = new MultipleOutputs(context);
}
and:
mos.write(key, value, "/outputDir/fileOfJobX.txt")
However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir
In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:
/outputDir/
messageTypeA/
messageSubTypeA1/
job1Output/
job1-part1.txt
job1-part2.txt
...
job2Output/
job2-part1.txt
...
messageSubTypeA2/
...
messageTypeB/
...
Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.
I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
#Override
public void checkOutputSpecs(JobContext job) throws IOException {
try {
super.checkOutputSpecs(job);
}catch (FileAlreadyExistsException ignored){
// Suffocate the exception
}
}
}
And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
you need to add the setting in your main class:
//Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//auto_delete output dir
OutputPath.getFileSystem(conf).delete(OutputPath);

Resources