Hadoop distributed cache archive unarchiving in working directory - hadoop

I’m sending an archive to distributed cache via –Dmapred.cache.archives=hdfs://host:port/path/archive.zip#foldername –D.mapred.create.symlink=yes and it creates a new folder in the working directory and unarchives the files there. The problem is I need those files in the working directory and I’ve already tried using . and ./ as folder name as well as sending an empty one. Any ideas on how to solve this except for moving the files explicitly in my Java code?

What's the specific need for the files to be in the working directory (so i can understand, and suggest some alternatives).
Anyway, it looks like archives in the distributed cache will always be unpacked to a directory so i don't think you can resolve this using archives - however, depending on the number of files you wish to place in the working directory, you can use files in the DistributedCache.
For example, using the GenericOptionsParser parameters you can specify files and folders to include which are then available in the working directory:
public static class DistCacheMapper extends
Mapper<LongWritable, Text, NullWritable, NullWritable> {
#Override
public void run(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
System.err.println("Local Files:");
listFiles(new File("."), "");
}
private void listFiles(File dir, String ident) {
for (File f : dir.listFiles()) {
System.out.println(ident + (f.isDirectory() ? "d" : "-") + "\t"
+ f.getName());
if (f.isDirectory()) {
listFiles(f, ident + " ");
}
}
}
}
For example with hadoop jar myjar.jar -files pom.xml,.project,.classpath,src dummy.txt gives the following on stderr (which you can see has taken the src folder):
- .classpath
- .project
d tmp
- pom.xml
d src
d test
d resources
d java
d main
d resources
d java
d csw
d sandbox
- DistCacheJob.java
- .DistCacheJob.java.crc
- job.jar
- .job.jar.crc
So the long and the short of it is you're going to have to list all the files you want in the working directory in the Dist Cache files, and subdirectories can be listed as either archives, or using files too.

Related

relative path to folder won't work when jpackaged java

so my project is here: https://github.com/Potat-OS1/project_thingo and i started the project from a template.
under the champPortrait section is where i'm having my problem. when i run it in the IDE the path works, as i understand it its the relative path of the build folder. does it not use this path when its packaged? what path should i be using?
i can getResourceAsStream the contents of the folder but in this particular case i need the folder its self so i can put all the names of files inside of the folder into a list.
When the application is bundled with jpackage, all classes and resources are packaged in a jar file. So what you are trying to do is read all the entries in a particular package from a jar file. There is no nice way to do that.
Since the contents of the jar file can't be changed after deployment, the easiest solution is probably just to create a text resource listing the files. You just have to make sure the you update the text file at development time if you change the contents of that resource.
So, e.g., if in your source hierarchy you have
resources
|
|--- images
|
|--- img1.png
|--- img2.png
|--- img3.png
I would just create a text file resources/images/imageList.txt with the content
img1.png
img2.png
img3.png
Then in code you can do:
List<Image> images = new ArrayList<>();
String imageBase = "/images/"
try (BufferedReader br = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("/images/imageList.txt"))) {
br.lines().forEach(imageName -> {
URL imageUrl = getClass().getResource(imageBse + imageName);
Image image = new Image(imageURL.toExternalForm());
images.add(image);
}
} catch (Exception exc) {
exc.printStackTrace();
}
As mentioned, you will need to keep the text file in sync with the contents of the resource folder before building. If you're feeling ambitious, you could look into automating this as part of your build with your build tool (gradle/Maven etc.).
The Java resource API does not provide a supported way to list the resources in a given package. If you aren't using a framework that provides their own solution (e.g., Spring), then probably the easiest and sufficiently robust solution is to do what #James_D demonstrates: Create another resource that simply lists the names of the resources in the current package. Then you can read that resource to get the names of the other resources.
For a relatively small number of resources, where the number doesn't change often, creating the "name list" resource manually is probably sufficient. But you've tagged this question with gradle, so another option is to have the build tool create these "name list" resources for you. This can be done in plugin, or you could do it directly in your build script.
Example
Here's an example of creating the "plugin" in your build script.
Sources
Source structure:
\---src
\---main
+---java
| \---sample
| Main.java
|
\---resources
\---sample
bar.txt
baz.txt
foo.txt
qux.txt
Where each *.txt file in src/main/resources/sample contains a single line which says Hello from <filename>!.
build.gradle.kts (Kotlin DSL):
plugins {
application // implicitly applies the Java Plugin as well
}
application {
mainClass.set("sample.Main")
}
// gets the 'processResources' task and augments it to add the desired
// behavior. This task processes resources in the "main" source set.
tasks.processResources {
// 'doLast' means everything inside happens at the end, or at least
// near the end, of this task
doLast {
/*
* Get the "main" source set. By default, this essentially
* represents the files under 'src/main'. There is another
* source set added by the Java Plugin named "test", which
* represents the files under 'src/test'.
*/
val main: SourceSet by sourceSets
/*
* Gets *all* the source directories in the main source set
* used for resources. By default, this will only include
* 'src/main/resources'. If you add other resource directories
* to the main source set, then those will be included here as well.
*/
val source: Set<File> = main.resources.srcDirs
/*
* Gets the build output directory for the resources in the
* main source set. By default, this will be the
* 'build/resources/main` directory. The '!!' bit at the end
* of this line of code is a Kotlin language thing, which
* basically says "I know this won't be null, but fail if it is".
*/
val target: File = main.output.resourcesDir!!
/*
* This calls the 'createResourceListFiles' function for every
* resource directory in 'source'.
*/
for (root in source) {
// the last argument is 'root' because the first package is
// the so-called "unnamed/default package", which are resources
// under the "root"
createResourceListFiles(root, target, root)
}
}
}
/**
* Recursively traverses the package hierarchy of the given resource root and creates
* a `resource-list.txt` resource in each package containing the absolute names of every
* resource in that package, with each name on its own line. If a package does not have
* any resources, then no `resource-list.txt` resource is created for that package.
*
* The `resourceRoot` and `targetDir` arguments will never change. Only the `packageDir`
* argument changes for each recursive call.
*
* #param resourceRoot the root of the resources
* #param targetDir the output directory for resources; this is where the
* `resource-list.txt` resource will be created
* #param packageDir the current package directory
*/
fun createResourceListFiles(resourceRoot: File, targetDir: File, packageDir: File) {
// get all non-directories in the current package; these are the resources
val resourceFiles: List<File> = listFiles(packageDir, File::isFile)
// only create a resource-list.txt file if there are resources in this package
if (resourceFiles.isNotEmpty()) {
/*
* Determine the output file path for the 'resource-list.txt' file. This is
* computed by getting the path of the current package directory relative
* to the resource root. And then resolving that relative path against
* the output directory, and finally resolving the filename 'resource-list.txt'
* against that directory.
*
* So, if 'resourceRoot' was 'src/main/resources', 'targetDir' was 'build/resources/main',
* and 'packageDir' was 'src/main/resources/sample', then 'targetFile' will be resolved
* to 'build/resources/main/sample/resource-list.txt'.
*/
val targetFile: File = targetDir.resolve(packageDir.relativeTo(resourceRoot)).resolve("resource-list.txt")
// opens a BufferedWriter to 'targetFile' and will close it when
// done (that's what 'use' does; it's like try-with-resources in Java)
targetFile.bufferedWriter().use { writer ->
// prints the absolute name of each resource on their own lines
for (file in resourceFiles) {
/*
* Prepends a forward slash to make the name absolute. Gets the rest of the name
* by getting the relative path of the resource file from the resource root. Replaces
* any backslashes with forward slashes because Java's resource-lookup API uses forward
* slashes (needed on e.g., Windows, which uses backslashes for filename separators).
*
* So, a resource at 'src/main/resources/sample/foo.txt' would result in
* '/sample/foo.txt' being written to the 'resource-list.txt' file.
*/
writer.append("/${file.toRelativeString(resourceRoot).replace("\\", "/")}")
writer.newLine()
}
}
}
/*
* Gets all the child directories of the current package directory, as these
* are the "sub packages", and recursively calls this function for each
* sub package.
*/
for (packageSubDir in listFiles(packageDir, File::isDirectory)) {
createResourceListFiles(resourceRoot, targetDir, packageSubDir)
}
}
/**
* #param directory the directory to list the children of
* #param predicate the filter function; only children for which this function
* returns `true` are included in the list
* #return a possibly empty list of files which are the children of `dir`
*/
fun listFiles(directory: File, predicate: (File) -> Boolean): List<File>
= directory.listFiles()?.filter(predicate) ?: emptyList()
Main.java:
package sample;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
public class Main {
public static void main(String[] args) throws IOException {
for (var resource : resources()) {
System.out.printf("Contents of '%s':%n", resource);
try (var reader = openResource(resource)) {
String line;
while ((line = reader.readLine()) != null) {
System.out.printf(" %s%n", line);
}
System.out.println();
}
}
}
public static List<String> resources() throws IOException {
try (var input = openResource("/sample/resource-list.txt")) {
return input.lines().toList();
}
}
public static BufferedReader openResource(String name) throws IOException {
var input = Main.class.getResourceAsStream(name);
return new BufferedReader(new InputStreamReader(input));
}
}
Output
After the processResources task runs, you'll have the following /sample/resource-list.txt file in your build output:
/sample/bar.txt
/sample/baz.txt
/sample/foo.txt
/sample/qux.txt
And running the application (./gradlew clean run) will give the following output:
> Task :run
Contents of '/sample/bar.txt':
Hello from bar.txt!
Contents of '/sample/baz.txt':
Hello from baz.txt!
Contents of '/sample/foo.txt':
Hello from foo.txt!
Contents of '/sample/qux.txt':
Hello from qux.txt!
BUILD SUCCESSFUL in 2s
4 actionable tasks: 4 executed
Notes
Note that the resource-list.txt resource(s) will only exist in your build output/deployment. It does not exist in your source directories. Also, the way I implemented this, it will only list resources in your source directories. Any resources generated by, for example, an annotation processor will not be included. You could, of course, modify the code to fix that if it becomes an issue for you.
The above will only run for production resources, not test resources (or any other source set). You can modify the code to change this as needed.
If a package does not have any resources, then the above will not create a resource-list.txt resource for that package.
Each name listed in resource-list.txt is the absolute name. It has a leading /. This will work with Class#getResource[AsStream](String), but I believe to call the same methods on ClassLoader (if you need to for some reason) you'll have to remove the leading / (in code).
Finally, I wrote the Kotlin code in the build script rather quickly. There may be more efficient, or at least less verbose, ways to do the same thing. And if you want to apply this to multiple projects, or even multiple subprojects of the same project, you can create a plugin. Though it may be that some plugin already exists for this, if you're willing to search for one.

Unable to add UDF in hive

I have to add the following UDF in hive :
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
This is an example from the book "Hadoop : The definitive guide"
I created the .class file of above java file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ javac Strip.java
Then I created the jar file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ jar cvf Strip.jar Strip Strip.class
Strip : no such file or directory
added manifest
adding: Strip.class(in = 915) (out= 457)(deflated 50%)
I added the geenrated jar file to hdfs directory with :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ hadoop dfs -copyFromLocal /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar /user/hduser/input
I tried to create a UDf usign the following command :
hive> create function strip as 'com.hadoopbook.hive.Strip' using jar 'hdfs://localhost/user/hduser/input/Strip.jar';
But I got an error as following :
converting to local hdfs://localhost/user/hduser/input/Strip.jar Added
[/tmp/hduser_resources/Strip.jar] to class path Added resources:
[hdfs://localhost/user/hduser/input/Strip.jar] Failed to register
default.strip using class com.hadoopbook.hive.Strip FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
I also tried to create temporary function.
So I first added the jar file to hive using :
hive> add jar hdfs://localhost/user/hduser/input/Strip.jar;
converting to local hdfs://localhost/user/hduser/input/Strip.jar
Added [/tmp/hduser_resources/Strip.jar] to class path
Added resources: [hdfs://localhost/user/hduser/input/Strip.jar]
Then I tried to add the temporary function :
hive> create temporary function strip as 'com.hadoopbook.hive.Strip';
But I got the following error :
FAILED: Class com.hadoopbook.hive.Strip not found FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
The jar file was successully created and added to hive.Still it is showing that class not found.
Can anyone please tell what is wrong with it ?
yes using IDE like eclipse is easy then making jar from CLI.
Creating jar file from command line you have to follow these steps:
First make project dirs under project dir ch17-hive:
bin - will store .class (Strip.class) files
lib - will store required external jars
traget - will store jars that you will create
[ch17-hive]$ mkdir bin lib traget
[ch17-hive]$ ls
bin lib src target
copy required external jars to ch170hive/lib dir:
[ch17-hive]$ cp /usr/lib/hive/lib/hive-exec.jar lib/.
[ch17-hive]$ cp /usr/lib/hadoop/hadoop-common.jar lib/.
Now compile java from dir from which your class com.hadoopbook.hive.Strip resides, in your case its ch17-hive/src/main/java:
[java]$ pwd
/home/cloudera/ch17-hive/src/main/java
[java]$ javac -d ../../../bin -classpath ../../../lib/hive-exec.jar:../../../lib/hadoop-common.jar com/hadoopbook/hive/Strip.java
Create menifest file as:
[ch17-hive]$ cat MENIFEST.MF
Main-Class: com.hadoopbook.hive.Strip
Class-Path: lib/hadoop-common.jar lib/hive-exec.jar
Create jar as
[ch17-hive]$ jar cvfm target/strip.jar MENIFEST.MF -C bin .added manifest
adding: com/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/Strip.class(in = 915) (out= 456)(deflated 50%)
Now you project structure should look like:
[ch17-hive]$ ls *
MENIFEST.MF
bin:
com
lib:
hadoop-common.jar hive-exec.jar
src:
main
target:
strip.jar
copy created jar to hdfs:
hadoop fs -put /home/cloudera/ch17-hive/target/strip.jar /user/cloudera/.
use it in HIVE:
hive> create function strip_new as 'com.hadoopbook.hive.Strip' using jar 'hdfs:/user/cloudera/strip.jar';
converting to local hdfs:/user/cloudera/strip.jar
Added [/tmp/05a13d23-8051-431f-a354-793abac66160_resources/strip.jar] to class path
Added resources: [hdfs:/user/cloudera/strip.jar]
OK
Time taken: 0.071 seconds
hive>

Gradle copying and rename files

I have to write gradle task which will be copying files. Files are stored at tests/[Name]/test.txt and for each Name I want to create numbered directory /tested/test00/, /tested/test01/ etc. and in each catalog should be one file (test.txt from source folder renamed to test00, test01 etc.)
I have the code, but behavior is strange...
It creates correct directories /tested/test00 etc. but all files in each directory have the same name... test06. So number in directory is correct but in file name it isn't.
My code is:
int copyTaskIterator = 0
int testIterator = 0
...
sources.each { mySource ->
task "myCopyTask$copyTaskIterator"(type: Copy)
nameSuffix = String.format("%02d", testIterator)
fromPath = 'tests/'+mySource+'/test.txt'
toPath = "tested/test"+nameSuffix
tasks."myCopyTask$copyTaskIterator".from fromPath
tasks."myCopyTask$copyTaskIterator".into toPath
tasks."myCopyTask$copyTaskIterator".rename { fileName ->
fileName.replace '.txt', nameSuffix
}
preBuild.dependsOn tasks."myCopyTask$copyTaskIterator"
copyTaskIterator++
testIterator++
}
The problem is, that nameSuffix is evaluated too late. Sadly no documentation explains whether it is executed in execution time or not.
Just try to use rename(java.util.regex.Pattern, java.lang.String)
tasks."myCopyTask$copyTaskIterator".rename("\\.txt", nameSuffix)

copy tree with gradle and change structure?

Can gradle alter the structure of the tree while copying?
original
mod/a/src
mod/b/src
desired
dest/mod-a/source
dest/mod-b/source
dest/mod-c/source
I'm not sure where I should create a closure and override the copy tree logic
I'd like to do the gradle equivalent of ant's globmapper functionality
<property name="from.dir" location=".."/>
<property name="to.dir" location="dbutil"/>
<copy>
<fileset dir="${from.dir}" ... />
<globmapper from="${from.dir}/*/db" to="${to.dir}"/>
</copy>
Thanks
Peter
When changing file name, rename seems a good approach. When changing path you can override eachFile and modify the destination path.
This works pretty well.
copy {
from("${sourceDir}") {
include 'modules/**/**'
}
into(destDir)
eachFile {details ->
// Top Level Modules
def targetPath = rawPathToModulesPath(details.path)
details.path = targetPath
}
}
....
def rawPathToModulesPath(def path) {
// Standard case modules/name/src -> module-name/src
def modified=path.replaceAll('modules/([^/]+)/.*src/(java/)?(.*)', {"module-${it[1]}/src/${it[3]}"})
return modified
}
Please see sample below. Gradle 4.3 does not have rename/move methods, so we can do renaming on the fly.
What was happened:
Load file tree into the memory. I used zip file from dependencies in my example
Filter items, which are in the target folder
All result items will have the same prefix: if we filter files from directory "A/B/C/", then all files will be like "A/B/C/file.txt" or "A/B/C/D/file.txt". E.g. all of them will start with the same words
In the last statement eachFile we will change final name by cutting the directory prefix (e.g. we will cut "A/B/C").
Important: use type of task "Copy", which has optimizations for incremental compilation. Gradle will not do file copy if all of items below are true:
Input is the same (for my case - all dependencies of scope "nativeDependenciesScope") with previous build
Your function returned the same items with the previous build
Destination folder has the same file hashes, with the previous build
task copyNativeDependencies(type: Copy) {
includeEmptyDirs = false
def subfolderToUse = "win32Subfolder"
def nativePack = configurations.nativeDependenciesScope.singleFile // result - single dependency file
def nativeFiles = zipTree(nativePack).matching { include subfolderToUse + "/*" } // result - filtered file tree
from nativeFiles
into 'build/native_libs'
eachFile {
print(it.path)
// we filtered this folder above, e.g. all files will start from the same folder name
it.path = it.path.replaceFirst("$subfolderToUse/", "")
}
}
// and don't forget to link this task for something mandatory
test.dependsOn(copyNativeDependencies)
run.dependsOn(copyNativeDependencies)
The following works, but is there a more gradle-ish way to do this?
ant.copy(todir: destDir) {
fileset( dir: "${srcDir}/module", includes: '**/src/**')
regexpmapper(from: '^(.*)/src/(.*)$', to: /module-\1\/src\/\2/)
}

how do we compare a localfile and hdfs file for consistency

public String getDirs() throws IOException{
fs=FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/private/tmp/as"), new Path("/test"));
LocalFileSystem lfs=LocalFileSystem.getLocal(conf);
// System.out.println(new LocalFileSystem().ge (conf.getLocalPath("/private/tmp/as")));
System.out.println("Local Path : "+lfs.getFileChecksum(new Path("/private/tmp/as")));
System.out.println("HDFS PATH : "+ fs.getFileChecksum(new Path("/test/as")));
return "done";
}
Output is
Local Path : null
HDFS PATH : MD5-of-0MD5-of-512CRC32:a575c5e99b2e08605dc7c6723889519c
Not sure why the checksum is null for local file
Hadoop relies on the FileSystem to have a checksum ready to match against. It does not generate one on-the-fly.
By default, the LocalFileSystem (or the specific implementation used for file:// paths) does not create/store checksums for all files created through it. You can toggle this behavior via the FileSystem#setWriteChecksum API call, and subsequently retrieving the checksum post-write will then work.

Resources