Alternative ways to start hadoop streaming job - hadoop

I can sucessfully kick of a hadoop streaming job from the terminal but i am looking for ways to start steaming jobs via an api, eclipse or some other means.
The closest i found was this post https://stackoverflow.com/questions/11564463/remotely-execute-hadoop-streaming-job but it has no answers!
Any ideas or suggestions would be welcome.

Interesting question, I found a way to do this, hopefully this will help you too.
First method should work on Hadoop 0.22:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://xxxxx:9000");
conf.set("mapred.job.tracker", "hdfs://xxxxx:9001");
StreamJob sj = new StreamJob();
try {
ToolRunner.run(conf, sj, new String[] {
"-D", "stream.tmpdir=c:\\",
"-mapper", "/path/to/mapper.py",
"-reducer", "/path/to/reducer.py", "-input",
"/path/to/input", "-output",
"/path/to/output" });
} catch (Exception e) {
e.printStackTrace();
}
I also found this Java wrapper which you should be able to run.

Take a look at Apache Oozie - once you have defined your job via XML you can launch a job via an Http POST to the oozie server

When the Hadoop streaming job is run as
hadoop jar /home/training/Installations/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar -input input4 -output output4 -mapper /home/training/Code/Streaming/max_temperature_map.rb -reducer /home/training/Code/Streaming/max_temperature_reduce.rb
then org.apache.hadoop.streaming.HadoopStreaming is executed. This class is defined in the MANIFEST.MF in the hadoop-streaming-1.0.3.jar. Check the code in the org.apache.hadoop.streaming.HadoopStreaming java class to know the API details.

Related

How to handle service application failures developed by Hive JDBC?

I'm developing a service application based on the Hive jdbc-examples project, fail to run the application.
I normally use Huawei's big data products to perform Hive's jdbc sample development, which always fails to run the application.
To solve this problem, try either one of the following methods:
Method 1
Put the core-site.xml configuration file to classpath according to the Hive jdbc-examples sample project.
Method 2
In the code, explicitly load the core-site.xml file:
......conf = new Configuration();
String userdir = System.getProperty("user.dir") + File.separator + "conf" + File.separator;
conf.addResource(new Path(userdir + "core-site.xml"));......
Method 3
In the code, set hadoop.security.authentication to kerberos:
......CONF = new Configuration();
CONF.set("hadoop.security.authentication", "kerberos");......
See more.

How to set -javaagent in spark-submit

I have used aspect-oriented programming to do the logging in a Java Maven project.
While running it through eclipse I have to initialize javaagent in vmargs, as follows:
-javaagent:lib/aspectjweaver-1.9.1.jar
Now I want to submit the jar produced to a Spark worker. I have written a shell script to do it. I am able to run but unable to initialize javaagent.
export SPARK_PATH=/xyz
export SPARK_URL=spark://abc:0000
export JAVA_OPTS="$JAVA_OPTS -javaagent:../aspectweaver-1.9.1.jar"
$SPARK_PATH/spark-submit --master $SPARK_URL --jars --class com.main.index ../index-0.0.1-SNAPSHOT.jar
I have tried number of examples like setting JAVA_OPTS and CATALINE_OPTS, creating spark-env.sh and setting it. But none of this worked. Struggling from last 3 days.
I checked few similar questions on stackoverflow but none of the were helpful in setting javaagent. Help.
Thanks.
EDIT:
I am checking if javaagent is initialized in code using below code:
try {
org.aspectj.weaver.loadtime.Agent.getInstrumentation();
} catch (NoClassDefFoundError | UnsupportedOperationException e) {
System.out.println(e);
}
I get the NoClassDefFoundError. Which concludes that javaagent is not set.
I got the answer for this, I had to use "--driver-java-options". Below is the updated script.
$SPARK_PATH/spark-submit --master $SPARK_URL --driver-java-options "-javaagent:../aspectjweaver-1.9.1.jar" --class com.main.index ../index-0.0.1-SNAPSHOT.jar "$1"

Adding some hadoop configuration to a spark application at runtime (through spark-submit)?

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

Can anyone provide me mapping of hadoop file shell commands to java classes?

am new to Hadoop and am working on Java API to manage files in HDFS and am looking for mapping(kind of) of hadoop file shell commands to java classes.
Are you looking for this mapping info?
Java class for each shell command:
factory.addClass(Merge.class, "-getmerge");
factory.addClass(Cp.class, "-cp");
factory.addClass(CopyFromLocal.class, "-copyFromLocal");
factory.addClass(CopyToLocal.class, "-copyToLocal");
factory.addClass(Get.class, "-get");
factory.addClass(Put.class, "-put");
factory.addClass(AppendToFile.class, "-appendToFile");
org.apache.hadoop.fs.shell.CopyCommands
class CopyCommands {
public static void registerCommands(CommandFactory factory) {
factory.addClass(Merge.class, "-getmerge");
factory.addClass(Cp.class, "-cp");
factory.addClass(CopyFromLocal.class, "-copyFromLocal");
factory.addClass(CopyToLocal.class, "-copyToLocal");
factory.addClass(Get.class, "-get");
factory.addClass(Put.class, "-put");
factory.addClass(AppendToFile.class, "-appendToFile");
}

HBase ERROR: hbase-default.xml file seems to be for and old version of HBase (null)

I am trying to write a program to connect to HBase. However when I execute following command
HBaseConfiguration.create(); I get following error: .
"hbase-default.xml file seems to be for and old version of HBase (null), this version is 0.92.1-cdh4.1.2.
When I dig deep and debug inside observe following:
class HBaseConfiguration
private static void checkDefaultsVersion(Configuration conf) {
if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))return;
String defaultsVersion = conf.get("hbase.defaults.for.version");
String thisVersion = VersionInfo.getVersion();
if (!thisVersion.equals(defaultsVersion)) {
throw new RuntimeException(
"hbase-default.xml file seems to be for and old version of HBase (" +
defaultsVersion + "), this version is " + thisVersion);
}
}
In my case HBase returns default version as null, I am not sure why its returning as null as I checked the corresponding entry in hbase-default.xml packaged with the HBase.jar it has correct entry.
When I try the same thing from a standalone program it works as expected.
Guyz, Please let me know if you have any questions.
Thanks in advance,
Rohit
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>hbase.defaults.for.version.skip</name>
<value>true</value>
</property>
</configuration>
Add this to a hbase-default.xml and put the file in the classpath or resource foldr. I got it when i ran from within spring hadoop environment. By adding above file to reosurce folder of the job jar i was able to solve tis-
finally found the workaround to this problem...
The problem is hbase-default.xml is not included in your classpath.
I added hbase-default.xml in target/test-classes ( it will vary in your case ), you can just add hbase-default.xml in various folder and see what works for you.
NOTE : This is just workaround, not the solution
Solution will be load the proper jars ( which I haven't figured out yet )
I've been getting this error using HBase1.1.1.
I created a simple HBase client and it worked fine. Then I built a simple RMI service, and that worked fine. But when I tried putting my simple HBase query code into RMI service I started getting this error on the HBaseConfiguration.create() call. After playing a bit, I found that the HBaseConfiguration.create() call works OK if placed before the security manager stuff that is in my main(). I get the error if the call is placed after block of code containing security manager calls...
Configuration conf = HBaseConfiguration.create(); // This works
if(System.getSecurityManager() == null)
{
System.setSecurityManager(new SecurityManager());
} // End if
// Configuration conf = HBaseConfiguration.create(); // This fails
I get the error if the create() call happens in main() after that security manager block, or in code within the class that is instantiated by main(). I don't get the error if create() is called within a static{ } block in my RMI service class (which I believe gets called before main()), or in main() before the security manager block, as shown.
BTW, the jar files that I include in my class path in order to get a minimal client to run are the following:
commons-codec-1.9.jar,
commons-collections-3.2.1.jar,
commons-configuration-1.6.jar,
commons-lang-2.6.jar,
commons-logging-1.2.jar,
guava-12.0.1.jar,
hadoop-auth-2.5.1.jar,
hadoop-common-2.5.1.jar,
hbase-client-1.1.1.jar,
hbase-common-1.1.1.jar,
hbase-hadoop2-compat-1.1.1.jar,
hbase-it-1.1.1-tests.jar,
hbase-protocol-1.1.1.jar,
htrace-core-3.1.0-incubating.jar,
log4j-1.2.17.jar,
netty-all-4.0.23.Final.jar,
protobuf-java-2.5.0.jar,
slf4j-api-1.7.7.jar,
slf4j-log4j12-1.7.5.jar
Had a similar problem where the error was
java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.98.3-hadoop2), this version is Unknown
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:70)
at org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:102)
at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:113)
In my case I had same set of jar files at two different levels of classpath, removed from one level and it worked fine.
In my case the issue was caused by old java version (1.5), which was default on the server. But it works fine with 1.7.
In my code, I used this to solve my error.
val config = HBaseConfiguration.create() //error
val config = new Configuration() //works

Resources