How to count similar files in hdfs by groovy code?

How to count similar files in hdfs by groovy code? - hadoop

I want to count files with similar filenames in hdfs directory, but i don't want to use File System API can you recommend me something which can help me get the result which i can get from this groovy code below?
Is there any equivalent of file System Api which can i use for this task?
String filename="2017-02-03";
def count = new File("file path ").listFiles()
.findAll { it.name.substring(0,10) ==filename}
.size();

Related

Get list of files containing string(s) or pattern(s)

Is there a Gradle pattern for retrieving the list of files in a folder or set of folders that contain a given string, set of strings, or pattern?
My project produces RPMs and is using the Nebula RPM type (great package!). There are a couple of different kinds of sets of files that need post-processing. I am trying to generate the list of files that contain the strings that are the indicators for post-processing. For example, files that contain "#doc" need to be processed by the doc generator script. Files that contain "#HOSTNAME#" and "#HOSTFQDN#" need to be processed by sed to replace the strings with the actual host name or host fqdn.
The search root in the package will be src\main\resources. With the result the build script sets up the post-install script commands - something like:
postInstall('/opt/product/bin/postprocess.sh ' + join(filesContainingDocs, " "))
postInstall('/bin/sed -i -e "s/#HOSTNAME#/$(hostname -s)/" -e s/#HOSTFQDN#/$(hostname)/" ' + join(filesContainingHostname, " ")
I can figure out the postinstall syntax. I'm having difficulty finding the filter for any of the regular Gradle 'things' (i.e., FileTree) that operate on contents of files rather than names of files. How would I populate filesContainingDocs and filesContainingHostname - something along the lines of:
filesContainingDocs = FileTree('src/main/resources', { contents.matches('#doc') }
filesContainingHostname = FileTree('src/main/resources', { contents.matches('#(HOSTNAME|HOSTFQDN)#') }
While the post-process script could simply do the grep, the several RPMs in our product overlay each other and each RPM should only post-process the files it provides, so a general grep over the final installed folder is not workable - it would catch files provided by other RPMs. It seems to me that I ought to be able to, at build time, produce the correct static list of files from the bigger set of source files that comprise the given RPM's project.
It doesn't have to be FileTree - running a command like findstr /s /m /c:"#doc" src\main\resources\*.conf (alas, the build platform is Windows) produces the answer in stdout but I'm not sure how to get that result into an object Gradle can use to expand the result. (I also suspect there is a 'more Gradle way' to do this.)
The set of files, and the contents of those files, is generally fairly small.

I'm having difficulty finding the filter for any of the regular Gradle 'things' (i.e., FileTree) that operate on contents of files rather than names of files.
You can apply any filter you can imagine on a Gradle file tree, in the end it is just Groovy (or Kotlin) code running in the JVM. Each Gradle FileTree is nothing more than a (lazily evaluated) collection of Java File objects. To filter those File objects, you can read their content, e.g. in the same way you would read them in Java. Groovy even provides a JDK enhancement for the Java class File that includes the simple method getText() for this purpose. Now you can easily filter for files that contain a certain string:
filesContainingDocs = fileTree('src/main/resources').filter { file ->
file.text.contains('#doc')
}
Using Groovy, you can call getters like .getText() in the same way as accessing fields (.text in this case).
If a simple contains check is not enough, the Groovy JDK enhancements even provide the method matches(Pattern pattern) on CharSequence/string instances to perform a regular extension check:
filesContainingDocs = fileTree('src/main/resources').filter { file ->
file.text.replace('\r\n','\n').matches('.*some regex.*') }
}

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)

You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());

I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.

You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

Writing output to different folders hadoop

I want to write two different types of output from the same reducer, into two different directories.
I am able to use multipleoutputs feature in hadoop to write to different files, but they both go to the same output folder.
I want to write each file from the same reduce to a different folder.
Is there a way for doing this?
If I try putting for example "hello/testfile", as the second argument, it shows invaid argument. So I m not able to write to different folders.
If the above case is not possible, the is it possible for the mapper to read only specific files from an input folder?
Please help me.
Thanks in advance!
Thanks for the reply. I am able to read a file successfully using then above method. But in distributed mode, I am not able to do so. In the reducer, I have
set:
mos.getCollector("data", reporter).collect(new Text(str_key), new Text(str_val));
(Using multiple outputs, and in Job Conf:
I tried using
FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data-r-00000*");
as well as
FileInputFormat.setInputPaths(conf2, "/home/users/mlakshm/opchk285/data*");
But, it gives the following error:
cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://mentat.cluster:54310/home/users/mlakshm/opchk295/data-r-00000* matches 0 files

Question 1: Writing output files to different directories - you can do it using the following approaches:
1. Using MultipleOutputs class:
Its great that you are able to create multiple named output files using MultipleOutputs. As you know, we need to add this in your driver code.
MultipleOutputs.addNamedOutput(job, "OutputFileName", OutputFormatClass, keyClass, valueClass);
The API provides two overloaded write methods to achieve this.
multipleOutputs.write("OutputFileName", new Text(Key), new Text(Value));
Now, to write the output file to separate output directories, you need to use an overloaded write method with an extra parameter for the base output path.
multipleOutputs.write("OutputFileName", new Text(key), new Text(value), baseOutputPath);
Please remember to change your baseOutputPath in each of your implementation.
2. Rename/Move the file in driver class:
This is probably the easiest hack to write output to multiple directories. Use multipleOutputs and write all the output files to a single output directory. But the file names need to be different for each category.
Assume that you want to create 3 different sets of output files, the first step is to register named output files in the driver:
MultipleOutputs.addNamedOutput(job, "set1", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set2", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set3", OutputFormatClass, keyClass, valueClass);
Also, create the different output directories or the directory structure you want in the driver code, along with the actual output directory:
Path set1Path = new Path("/hdfsRoot/outputs/set1");
Path set2Path = new Path("/hdfsRoot/outputs/set2");
Path set3Path = new Path("/hdfsRoot/outputs/set3");
The final important step is to rename the output files based on their names. If the job is successful;
FileSystem fileSystem = FileSystem.get(new Configuration);
if (jobStatus == 0) {
// Get the output files from the actual output path
FileStatus outputfs[] = fileSystem.listStatus(outputPath);
// Iterate over all the files in the output path
for (int fileCounter = 0; fileCounter < outputfs.length; fileCounter++) {
// Based on each fileName rename the path.
if (outputfs[fileCounter].getPath().getName().contains("set1")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set1Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set2")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set2Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set3")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set3Path+"/"+anyNewFileName));
}
}
}
Note: This will not add any significant overhead to the job because we are only MOVING files from one directory to another. And choosing any particular approach depends on the nature of your implementation.
In summary, this approach basically writes all the output files using different names to the same output directory and when the job is successfully completed, we rename the base output path and move files to different output directories.
Question 2: Reading specific files from an input folder(s):
You can definitely read specific input files from a directory using MultipleInputs class.
Based on your input path/file names you can pass the input files to the corresponding Mapper implementation.
Case 1: If all the input files ARE IN a single directory:
FileStatus inputfs[] = fileSystem.listStatus(inputPath);
for (int fileCounter = 0; fileCounter < inputfs.length; fileCounter++) {
if (inputfs[fileCounter].getPath().getName().contains("set1")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set1Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set2")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set2Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set3")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set3Mapper.class);
}
}
Case 2: If all the input files ARE NOT IN a single directory:
We can basically use the same approach above even if the input files are in different directories. Iterate over the base input path and check the file path name for a matching criteria.
Or, if the files are in complete different locations, the simplest way is to add to multiple inputs individually.
MultipleInputs.addInputPath(job, Set1_Path, TextInputFormat.class, Set1Mapper.class);
MultipleInputs.addInputPath(job, Set2_Path, TextInputFormat.class, Set2Mapper.class);
MultipleInputs.addInputPath(job, Set3_Path, TextInputFormat.class, Set3Mapper.class);
Hope this helps! Thank you.

Copy the MultipleOutputs code into your code base and loosen the restriction on allowable characters. I can't see any valid reason for the restrictions anyway.

Yes you can specify that a input format only processes certain files:
FileInputFormat.setInputPaths(job, "/path/to/folder/testfile*");
If you do amend the code, remember the _SUCCESS file should be written to both folders upon successful job completion - while this isn't a requirement, it is a machanism by which someone can determine if the output in that folder is complete, and not 'truncated' because of an error.

Yes you can do this. All you need to do is generate the file name for a particular key/value pair coming out of the reducer.
If you override a method, you can return the file name depending on what key/value pair you get, and so on. Here is the link that shows you how to do that.
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFMQFjAA&url=https%3A%2F%2Fsites.google.com%2Fsite%2Fhadoopandhive%2Fhome%2Fhow-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat&ei=y7YBULarN8iIrAf4iPSOBg&usg=AFQjCNHbd8sRwlY1-My2gNYI0yqw4254YQ

Ruby - Search and collect files in all directorys

I'm trying to search for a certain file type within all directories on my unix system using a ruby script. I understand the following code will search all files ending with .pdf within the current directory:
my_pdfs = Dir['*pdf']
As well as:
my_pdfs = Dir.glob('*.pdf').each do |f|
puts f
end
But how about searching all directories and sub-directories for files with the .pdf extension?

Check out the Find module:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/find/rdoc/Find.html
Using Dir.glob is less than ideal since globbing doesn't handle recursion nearly as well as something like find.
Also if you're on a *nix box try using the find command. Its pretty amazingly useful for one liners.

Maybe something like:
pdfs=Dir['/**/*.pdf']
?
Not using Linux right now, so don't know if that will work. The ** syntax implies recursive listing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio