Unable to add UDF in hive - hadoop

I have to add the following UDF in hive :
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
This is an example from the book "Hadoop : The definitive guide"
I created the .class file of above java file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ javac Strip.java
Then I created the jar file using the following command :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ jar cvf Strip.jar Strip Strip.class
Strip : no such file or directory
added manifest
adding: Strip.class(in = 915) (out= 457)(deflated 50%)
I added the geenrated jar file to hdfs directory with :
hduser#nb-VPCEH35EN:~/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive$ hadoop dfs -copyFromLocal /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar /user/hduser/input
I tried to create a UDf usign the following command :
hive> create function strip as 'com.hadoopbook.hive.Strip' using jar 'hdfs://localhost/user/hduser/input/Strip.jar';
But I got an error as following :
converting to local hdfs://localhost/user/hduser/input/Strip.jar Added
[/tmp/hduser_resources/Strip.jar] to class path Added resources:
[hdfs://localhost/user/hduser/input/Strip.jar] Failed to register
default.strip using class com.hadoopbook.hive.Strip FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
I also tried to create temporary function.
So I first added the jar file to hive using :
hive> add jar hdfs://localhost/user/hduser/input/Strip.jar;
converting to local hdfs://localhost/user/hduser/input/Strip.jar
Added [/tmp/hduser_resources/Strip.jar] to class path
Added resources: [hdfs://localhost/user/hduser/input/Strip.jar]
Then I tried to add the temporary function :
hive> create temporary function strip as 'com.hadoopbook.hive.Strip';
But I got the following error :
FAILED: Class com.hadoopbook.hive.Strip not found FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
The jar file was successully created and added to hive.Still it is showing that class not found.
Can anyone please tell what is wrong with it ?

yes using IDE like eclipse is easy then making jar from CLI.
Creating jar file from command line you have to follow these steps:
First make project dirs under project dir ch17-hive:
bin - will store .class (Strip.class) files
lib - will store required external jars
traget - will store jars that you will create
[ch17-hive]$ mkdir bin lib traget
[ch17-hive]$ ls
bin lib src target
copy required external jars to ch170hive/lib dir:
[ch17-hive]$ cp /usr/lib/hive/lib/hive-exec.jar lib/.
[ch17-hive]$ cp /usr/lib/hadoop/hadoop-common.jar lib/.
Now compile java from dir from which your class com.hadoopbook.hive.Strip resides, in your case its ch17-hive/src/main/java:
[java]$ pwd
/home/cloudera/ch17-hive/src/main/java
[java]$ javac -d ../../../bin -classpath ../../../lib/hive-exec.jar:../../../lib/hadoop-common.jar com/hadoopbook/hive/Strip.java
Create menifest file as:
[ch17-hive]$ cat MENIFEST.MF
Main-Class: com.hadoopbook.hive.Strip
Class-Path: lib/hadoop-common.jar lib/hive-exec.jar
Create jar as
[ch17-hive]$ jar cvfm target/strip.jar MENIFEST.MF -C bin .added manifest
adding: com/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/(in = 0) (out= 0)(stored 0%)
adding: com/hadoopbook/hive/Strip.class(in = 915) (out= 456)(deflated 50%)
Now you project structure should look like:
[ch17-hive]$ ls *
MENIFEST.MF
bin:
com
lib:
hadoop-common.jar hive-exec.jar
src:
main
target:
strip.jar
copy created jar to hdfs:
hadoop fs -put /home/cloudera/ch17-hive/target/strip.jar /user/cloudera/.
use it in HIVE:
hive> create function strip_new as 'com.hadoopbook.hive.Strip' using jar 'hdfs:/user/cloudera/strip.jar';
converting to local hdfs:/user/cloudera/strip.jar
Added [/tmp/05a13d23-8051-431f-a354-793abac66160_resources/strip.jar] to class path
Added resources: [hdfs:/user/cloudera/strip.jar]
OK
Time taken: 0.071 seconds
hive>

Related

relative path to folder won't work when jpackaged java

so my project is here: https://github.com/Potat-OS1/project_thingo and i started the project from a template.
under the champPortrait section is where i'm having my problem. when i run it in the IDE the path works, as i understand it its the relative path of the build folder. does it not use this path when its packaged? what path should i be using?
i can getResourceAsStream the contents of the folder but in this particular case i need the folder its self so i can put all the names of files inside of the folder into a list.
When the application is bundled with jpackage, all classes and resources are packaged in a jar file. So what you are trying to do is read all the entries in a particular package from a jar file. There is no nice way to do that.
Since the contents of the jar file can't be changed after deployment, the easiest solution is probably just to create a text resource listing the files. You just have to make sure the you update the text file at development time if you change the contents of that resource.
So, e.g., if in your source hierarchy you have
resources
|
|--- images
|
|--- img1.png
|--- img2.png
|--- img3.png
I would just create a text file resources/images/imageList.txt with the content
img1.png
img2.png
img3.png
Then in code you can do:
List<Image> images = new ArrayList<>();
String imageBase = "/images/"
try (BufferedReader br = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("/images/imageList.txt"))) {
br.lines().forEach(imageName -> {
URL imageUrl = getClass().getResource(imageBse + imageName);
Image image = new Image(imageURL.toExternalForm());
images.add(image);
}
} catch (Exception exc) {
exc.printStackTrace();
}
As mentioned, you will need to keep the text file in sync with the contents of the resource folder before building. If you're feeling ambitious, you could look into automating this as part of your build with your build tool (gradle/Maven etc.).
The Java resource API does not provide a supported way to list the resources in a given package. If you aren't using a framework that provides their own solution (e.g., Spring), then probably the easiest and sufficiently robust solution is to do what #James_D demonstrates: Create another resource that simply lists the names of the resources in the current package. Then you can read that resource to get the names of the other resources.
For a relatively small number of resources, where the number doesn't change often, creating the "name list" resource manually is probably sufficient. But you've tagged this question with gradle, so another option is to have the build tool create these "name list" resources for you. This can be done in plugin, or you could do it directly in your build script.
Example
Here's an example of creating the "plugin" in your build script.
Sources
Source structure:
\---src
\---main
+---java
| \---sample
| Main.java
|
\---resources
\---sample
bar.txt
baz.txt
foo.txt
qux.txt
Where each *.txt file in src/main/resources/sample contains a single line which says Hello from <filename>!.
build.gradle.kts (Kotlin DSL):
plugins {
application // implicitly applies the Java Plugin as well
}
application {
mainClass.set("sample.Main")
}
// gets the 'processResources' task and augments it to add the desired
// behavior. This task processes resources in the "main" source set.
tasks.processResources {
// 'doLast' means everything inside happens at the end, or at least
// near the end, of this task
doLast {
/*
* Get the "main" source set. By default, this essentially
* represents the files under 'src/main'. There is another
* source set added by the Java Plugin named "test", which
* represents the files under 'src/test'.
*/
val main: SourceSet by sourceSets
/*
* Gets *all* the source directories in the main source set
* used for resources. By default, this will only include
* 'src/main/resources'. If you add other resource directories
* to the main source set, then those will be included here as well.
*/
val source: Set<File> = main.resources.srcDirs
/*
* Gets the build output directory for the resources in the
* main source set. By default, this will be the
* 'build/resources/main` directory. The '!!' bit at the end
* of this line of code is a Kotlin language thing, which
* basically says "I know this won't be null, but fail if it is".
*/
val target: File = main.output.resourcesDir!!
/*
* This calls the 'createResourceListFiles' function for every
* resource directory in 'source'.
*/
for (root in source) {
// the last argument is 'root' because the first package is
// the so-called "unnamed/default package", which are resources
// under the "root"
createResourceListFiles(root, target, root)
}
}
}
/**
* Recursively traverses the package hierarchy of the given resource root and creates
* a `resource-list.txt` resource in each package containing the absolute names of every
* resource in that package, with each name on its own line. If a package does not have
* any resources, then no `resource-list.txt` resource is created for that package.
*
* The `resourceRoot` and `targetDir` arguments will never change. Only the `packageDir`
* argument changes for each recursive call.
*
* #param resourceRoot the root of the resources
* #param targetDir the output directory for resources; this is where the
* `resource-list.txt` resource will be created
* #param packageDir the current package directory
*/
fun createResourceListFiles(resourceRoot: File, targetDir: File, packageDir: File) {
// get all non-directories in the current package; these are the resources
val resourceFiles: List<File> = listFiles(packageDir, File::isFile)
// only create a resource-list.txt file if there are resources in this package
if (resourceFiles.isNotEmpty()) {
/*
* Determine the output file path for the 'resource-list.txt' file. This is
* computed by getting the path of the current package directory relative
* to the resource root. And then resolving that relative path against
* the output directory, and finally resolving the filename 'resource-list.txt'
* against that directory.
*
* So, if 'resourceRoot' was 'src/main/resources', 'targetDir' was 'build/resources/main',
* and 'packageDir' was 'src/main/resources/sample', then 'targetFile' will be resolved
* to 'build/resources/main/sample/resource-list.txt'.
*/
val targetFile: File = targetDir.resolve(packageDir.relativeTo(resourceRoot)).resolve("resource-list.txt")
// opens a BufferedWriter to 'targetFile' and will close it when
// done (that's what 'use' does; it's like try-with-resources in Java)
targetFile.bufferedWriter().use { writer ->
// prints the absolute name of each resource on their own lines
for (file in resourceFiles) {
/*
* Prepends a forward slash to make the name absolute. Gets the rest of the name
* by getting the relative path of the resource file from the resource root. Replaces
* any backslashes with forward slashes because Java's resource-lookup API uses forward
* slashes (needed on e.g., Windows, which uses backslashes for filename separators).
*
* So, a resource at 'src/main/resources/sample/foo.txt' would result in
* '/sample/foo.txt' being written to the 'resource-list.txt' file.
*/
writer.append("/${file.toRelativeString(resourceRoot).replace("\\", "/")}")
writer.newLine()
}
}
}
/*
* Gets all the child directories of the current package directory, as these
* are the "sub packages", and recursively calls this function for each
* sub package.
*/
for (packageSubDir in listFiles(packageDir, File::isDirectory)) {
createResourceListFiles(resourceRoot, targetDir, packageSubDir)
}
}
/**
* #param directory the directory to list the children of
* #param predicate the filter function; only children for which this function
* returns `true` are included in the list
* #return a possibly empty list of files which are the children of `dir`
*/
fun listFiles(directory: File, predicate: (File) -> Boolean): List<File>
= directory.listFiles()?.filter(predicate) ?: emptyList()
Main.java:
package sample;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
public class Main {
public static void main(String[] args) throws IOException {
for (var resource : resources()) {
System.out.printf("Contents of '%s':%n", resource);
try (var reader = openResource(resource)) {
String line;
while ((line = reader.readLine()) != null) {
System.out.printf(" %s%n", line);
}
System.out.println();
}
}
}
public static List<String> resources() throws IOException {
try (var input = openResource("/sample/resource-list.txt")) {
return input.lines().toList();
}
}
public static BufferedReader openResource(String name) throws IOException {
var input = Main.class.getResourceAsStream(name);
return new BufferedReader(new InputStreamReader(input));
}
}
Output
After the processResources task runs, you'll have the following /sample/resource-list.txt file in your build output:
/sample/bar.txt
/sample/baz.txt
/sample/foo.txt
/sample/qux.txt
And running the application (./gradlew clean run) will give the following output:
> Task :run
Contents of '/sample/bar.txt':
Hello from bar.txt!
Contents of '/sample/baz.txt':
Hello from baz.txt!
Contents of '/sample/foo.txt':
Hello from foo.txt!
Contents of '/sample/qux.txt':
Hello from qux.txt!
BUILD SUCCESSFUL in 2s
4 actionable tasks: 4 executed
Notes
Note that the resource-list.txt resource(s) will only exist in your build output/deployment. It does not exist in your source directories. Also, the way I implemented this, it will only list resources in your source directories. Any resources generated by, for example, an annotation processor will not be included. You could, of course, modify the code to fix that if it becomes an issue for you.
The above will only run for production resources, not test resources (or any other source set). You can modify the code to change this as needed.
If a package does not have any resources, then the above will not create a resource-list.txt resource for that package.
Each name listed in resource-list.txt is the absolute name. It has a leading /. This will work with Class#getResource[AsStream](String), but I believe to call the same methods on ClassLoader (if you need to for some reason) you'll have to remove the leading / (in code).
Finally, I wrote the Kotlin code in the build script rather quickly. There may be more efficient, or at least less verbose, ways to do the same thing. And if you want to apply this to multiple projects, or even multiple subprojects of the same project, you can create a plugin. Though it may be that some plugin already exists for this, if you're willing to search for one.

How to replace in log4j2.xml with Gradle?

I want to replace a value in our log4j2.xml with Gradle during build. I found a way to do that:
task reaplaceInLogFile {
String apiSuffix = System.properties['apiSuffix'] ?: ''
println "In my task"
String contents = file('src/main/resources/log4j2.xml').getText( 'UTF-8' )
println "File found"
contents = contents.replaceAll( "svc0022_operations", "svc0022_operations${apiSuffix}")
new File( 'src/main/resources/log4j2.xml' ).write( contents, 'UTF-8' )
}
However, this changes also the source file permanently and I do not want to do that. I want to change the log4j2.xml that will be included in the build zip only. I know I can use something like this:
tasks.withType(com.mulesoft.build.MuleZip) { task ->
String apiSuffix = System.properties['apiSuffix'] ?: ''
task.eachFile {
println name
if (name == 'mule-app.properties') {
println "Expanding properties for API Version suffix: ${apiSuffix}"
filter { String line ->
line.startsWith("${serviceId}.api.suffix") ? "${serviceId}.api.suffix=${apiSuffix}" : line
}
}
But I do not know what is the type of the log4j2 file. If there is another way to do that I will be thankful!
We are using Mule gradle plugin.
The Type is not the type of the log4j2 file, but the type of the task that creates the ZIP or wherever your log4j2 file is packaged into. If the log4j2 file is included in the ZIP that is generated by a MuleZip task, then you can simply add another if-branch for the log4j2 file.
But actually it is probably better to just edit the concrete task that packages up the log4j2 file into some archive instead of all tasks of the same type.
Besides that you should be able to use filesMatching instead of eachFile with an if I think.

read file reference storm from hdfs

Hi I want to ask something. I've started to learn about apache storm. is it possible to storm to read data file in hdfs.??
example: I have a data txt file in directory /user/hadoop on hdfs. is it possible to storm to read that file.?? thx before
because when I try to running the storm I've an error message if the file does not exist. when I try to run it by read file from my local storage it was successful
>>> Starting to create Topology ...
---> Read Class: FileReaderSpout , 1387957892009
---> Read Class: Split , 1387958291266
---> Read Class: Identity , 247_Identity_1387957902310_1
---> Read Class: SubString , 1387964607853
---> Read Class: Trim , 1387962789262
---> Read Class: SwitchCase , 1387962333010
---> Read Class: Reference , 1387969791518
File /reff/test.txt .. not exists ..
Of course! Here is an example for how to read a file from HDFS:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
// ...stuff...
FileSystem fs = FileSystem.get(URI.create("hdfs://prodserver"), new Configuration());
String hdfspath = "/apps/storm/conf/config.json";
Path path = new Path();
if (fs.exists(path)) {
InputStreamReader reader = new InputStreamReader(fs.open(path));
// do stuff with the reader
} else {
LOG.info("Does not exist {}", hdfspath);
}
This doesn't use anything specific to storm, just the Hadoop API (hadoop-common.jar).
The error you're getting looks like it is because your file path is incorrect.

Gradle copying and rename files

I have to write gradle task which will be copying files. Files are stored at tests/[Name]/test.txt and for each Name I want to create numbered directory /tested/test00/, /tested/test01/ etc. and in each catalog should be one file (test.txt from source folder renamed to test00, test01 etc.)
I have the code, but behavior is strange...
It creates correct directories /tested/test00 etc. but all files in each directory have the same name... test06. So number in directory is correct but in file name it isn't.
My code is:
int copyTaskIterator = 0
int testIterator = 0
...
sources.each { mySource ->
task "myCopyTask$copyTaskIterator"(type: Copy)
nameSuffix = String.format("%02d", testIterator)
fromPath = 'tests/'+mySource+'/test.txt'
toPath = "tested/test"+nameSuffix
tasks."myCopyTask$copyTaskIterator".from fromPath
tasks."myCopyTask$copyTaskIterator".into toPath
tasks."myCopyTask$copyTaskIterator".rename { fileName ->
fileName.replace '.txt', nameSuffix
}
preBuild.dependsOn tasks."myCopyTask$copyTaskIterator"
copyTaskIterator++
testIterator++
}
The problem is, that nameSuffix is evaluated too late. Sadly no documentation explains whether it is executed in execution time or not.
Just try to use rename(java.util.regex.Pattern, java.lang.String)
tasks."myCopyTask$copyTaskIterator".rename("\\.txt", nameSuffix)

Hadoop distributed cache archive unarchiving in working directory

I’m sending an archive to distributed cache via –Dmapred.cache.archives=hdfs://host:port/path/archive.zip#foldername –D.mapred.create.symlink=yes and it creates a new folder in the working directory and unarchives the files there. The problem is I need those files in the working directory and I’ve already tried using . and ./ as folder name as well as sending an empty one. Any ideas on how to solve this except for moving the files explicitly in my Java code?
What's the specific need for the files to be in the working directory (so i can understand, and suggest some alternatives).
Anyway, it looks like archives in the distributed cache will always be unpacked to a directory so i don't think you can resolve this using archives - however, depending on the number of files you wish to place in the working directory, you can use files in the DistributedCache.
For example, using the GenericOptionsParser parameters you can specify files and folders to include which are then available in the working directory:
public static class DistCacheMapper extends
Mapper<LongWritable, Text, NullWritable, NullWritable> {
#Override
public void run(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
System.err.println("Local Files:");
listFiles(new File("."), "");
}
private void listFiles(File dir, String ident) {
for (File f : dir.listFiles()) {
System.out.println(ident + (f.isDirectory() ? "d" : "-") + "\t"
+ f.getName());
if (f.isDirectory()) {
listFiles(f, ident + " ");
}
}
}
}
For example with hadoop jar myjar.jar -files pom.xml,.project,.classpath,src dummy.txt gives the following on stderr (which you can see has taken the src folder):
- .classpath
- .project
d tmp
- pom.xml
d src
d test
d resources
d java
d main
d resources
d java
d csw
d sandbox
- DistCacheJob.java
- .DistCacheJob.java.crc
- job.jar
- .job.jar.crc
So the long and the short of it is you're going to have to list all the files you want in the working directory in the Dist Cache files, and subdirectories can be listed as either archives, or using files too.

Resources