How to set system environment variable from Mapper Hadoop? - hadoop

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..

Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2

Related

how to change spark.r.backendConnectionTimeout value in RStudio?

I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.

how to trigger spark job from java code with SparkSubmit.scala

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
sparkArgs.add("--master");
sparkArgs.add("yarn-cluster");
sparkArgs.add("--class");
sparkArgs.add("com.sample.spark.HelloSpark");
...
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?

Setting elasticsearch properties in spark-submit

I'm trying to launch Spark jobs that use Elastic Search input via command line using spark-submit as described in http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
I'm setting the properties in a file, but when launching spark-submit it gives the following warnings:
~/spark-1.0.1-bin-hadoop1/bin/spark-submit --class Main --properties-file spark.conf SparkES.jar
Warning: Ignoring non-spark config property: es.resource=myresource
Warning: Ignoring non-spark config property: es.nodes=mynode
Warning: Ignoring non-spark config property: es.query=myquery
...
Exception in thread "main" org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed
My config file looks like (with correct values):
es.nodes nodeip:port
es.resource index/type
es.query query
Setting the properties in the Configuration object in the code works, but I need to avoid this workaround.
Is there a way to set those properties via command line?
I don't know if you resolved your issue (if so, how?), but I found this solution:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "spark/docs", Map("es.nodes" -> "10.0.5.151"))
Bye
When you pass a config file to spark-submit, it only loads configs that start with 'spark.'
So, in my config I simply use
spark.es.nodes <es-ip>
and in the code itself I have to do
val conf = new SparkConf()
conf.set("es.nodes", conf.get("spark.es.nodes"))

Error when running Hadoop Map Reduce for map-only job

I want to run a map-only job in Hadoop MapReduce, here's my code:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("import");
job.setMapperClass(Map.class);//Custom Mapper
job.setInputFormatClass(TextInputFormat.class);
job.setNumReduceTasks(0);
TextInputFormat.setInputPaths(job, new Path("/home/jonathan/input"));
But I get the error:
13/07/17 18:22:48 ERROR security.UserGroupInformation: PriviledgedActionException
as: jonathan cause:org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Then I tried to use this:
job.setOutputFormatClass(org.apache.hadoop.mapred.lib.NullOutputFormat.class);
But it gives me a compilation error:
java: method setOutputFormatClass in class org.apache.hadoop.mapreduce.Job
cannot be applied to given types;
required: java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
found: java.lang.Class<org.apache.hadoop.mapred.lib.NullOutputFormat>
reason: actual argument java.lang.Class
<org.apache.hadoop.mapred.lib.NullOutputFormat> cannot be converted to
java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
by method invocation conversion
What am I doing wrong?
Map-only jobs still need an output location specified. As the error says, you're not specifying this.
I think you mean that your job produces no output at all. Hadoop still wants you to specify an output location, though nothing need be written.
You want org.apache.hadoop.mapreduce.lib.output.NullOutputFormat not org.apache.hadoop.mapred.lib.NullOutputFormat, which is what the second error indicates though it's subtle.

Hadoop : ClassNotFound Error at MapReduce

Just to state my setup before posing the question,
Hadoop Version : 1.0.3
The default WordCount example is running fine. But when I created a new WordCount program according to this page http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
I compiled it and jar-ed it in similar fashion as given in the tutorial. But when I ran it using :
/usr/local/hadoop$ bin/hadoop jar wordcount.jar org.myorg.WordCount ../Space/input/ ../Space/output
I got the following error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.myorg.WordCount$Map
The whole error log has been pasted here : http://pastebin.com/GNbsfpg3
Where did I go wrong?
There are some clues in the error messages:
12/07/14 18:09:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/07/14 18:09:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
You'll need to share your driver code with us (where you create and configure the job), but it appears you are not configuring the 'job jar', that is to say the job client is not given a hint as to where your code is bundled into a jar, and hence when you run your job, the classes cannot be found when the map instances actually run.
You probably want something like
jobConf.setJarByClass(org.myorg.WordCount.class);
I had exactly the same problem, and got it fixed by adding the following on the main code
jobConf.setJarByClass(org.myorg.WordCount.class);
here you can find the full main function
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(org.myorg.WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
I also got the above error.
What I did is that I just copied the jar files into all nodes of the cluster and set the classpath such that every slave node can access this jar.
And this worked for me.
It might help you.
before runJob, set conf like this:
conf.setJar("Your jar file name");
it may work, try it !

Resources