Spark workers unable to find JAR on EC2 cluster - amazon-ec2

I'm using spark-ec2 to run some Spark code. When I set master to
"local", then it runs fine. However, when I set master to $MASTER,
the workers immediately fail, with java.lang.NoClassDefFoundError for
the classes. The workers connect to the master, and show up in the UI, and try to run the task; but immediately raise that exception as soon as it loads its first dependency class (which is in the assembly jar).
I've used sbt-assembly to make a jar with the classes, confirmed using
jar tvf that the classes are there, and set SparkConf to distribute
the classes. The Spark Web UI indeed shows the assembly jar to be
added to the classpath:
http://172.x.x.x47441/jars/myjar-assembly-1.0.jar
It seems that, despite the fact that myjar-assembly contains the
class, and is being added to the cluster, it's not reaching the
workers. How do I fix this? (Do I need to manually copy the jar file?
If so, to which dir? I thought that the point of the SparkConf add
jars was to do this automatically)
My attempts at debugging have shown:
The assembly jar is being copied to /root/spark/work/app-xxxxxx/1/
(Determined by ssh to worker and searching for jar)
However, that path doesn't appear on the worker's classpath
(Determined from logs, which show java -cp but lack that file)
So, it seems like I need to tell Spark to add the path to the assembly
jar to the worker's classpath. How do I do that? Or is there another culprit? (I've spent hours trying to debug this but to no avail!)

NOTE: EC2 specific answer, not a general Spark answer. Just trying to round out an answer to a question asked a year ago, one that has the same symptom but often different causes and trips up a lot of people.
If I am understanding the question correctly, you are asking, "Do I need to manually copy the jar file? If so, to which dir?" You say, "and set SparkConf to distribute the classes" but you are not clear if this is done via spark-env.sh or spark-defaults.conf? So making some assumptions, the main one being your are running in cluster mode, meaning your driver runs on one of the workers and you don't know which one in advance... then...
The answer is yes, to the dir named in the classpath. In EC2 the only persistent data storage is /root/persistent-hdfs, but I don't know if that's a good idea.
In the Spark docs on EC2 I see this line:
To deploy code or data within your cluster, you can log in and use
the provided script ~/spark-ec2/copy-dir, which, given a directory
path, RSYNCs it to the same location on all the slaves.
SPARK_CLASSPATH
I wouldn't use SPARK_CLASSPATH because it's deprecated as of Spark 1.0 so a good idea is to use its replacement in $SPARK_HOME/conf/spark-defaults.conf:
spark.executor.extraClassPath /path/to/jar/on/worker
This should be the option that works. If you need to do this on the fly, not in a conf file, the recommendation is "./spark-submit with --driver-class-path to augment the driver classpath" (from Spark docs about spark.executor.extraClassPath and see end of answer for another source on that).
BUT ... you are not using spark-submit ... I don't know how that works in EC2, looking at the script I didn't figure out where EC2 let's you supply these parameters on a command line. You mention you already do this in setting up your SparkConf object so stick with that if that works for you.
I see in Spark-years this is a very old question so I wonder how you resolved it? I hope this helps someone, I learned a lot researching the specifics of EC2.
I must admit, as a limitation on this, it confuses me in the Spark docs that for spark.executor.extraClassPath it says:
Users typically should not need to set this option
I assume they mean most people will get the classpath out through a driver config option. I know most of the docs for spark-submit make it should like the script handles moving your code around the cluster but I think that's only in "standalone client mode" which I assume you are not using, I assume EC2 must be in "standalone cluster mode."
MORE / BACKGROUND ON SPARK_CLASSPATH deprecation:
More background that leads me to think SPARK_CLASSPATH is deprecated is this archived thread. and this one, crossing the other thread and this one about a WARN message when using SPARK_CLASSPATH:
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:
/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath

You are require to register a jar with spark cluster while submitting your app, to make it possible you can edit your code as follows.
jars(0) = "/usr/local/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar"
val conf: SparkConf = new SparkConf()
.setAppName("Busigence App")
.setMaster(sparkMasterUrl)
.setSparkHome(sparkHome)
.setJars(jars);

Related

How to get the scheduler of an already finished job in Yarn Hadoop?

So I'm in this situation, where I'm modifying the mapred-site.xml and specific configuration files of different schedulers for Hadoop, and I just want to make sure that the modifications I have made to the default scheduler(FIFO), has actually taken place.
How can I check the scheduler applied to a job or a queue of jobs already submitted to hadoop using job id ?
Sorry if this doesn't make that much sense, but I've looked around quite extensively to wrap my head around it, and read many documentations, and yet I still cannot seem to find this fundamental piece of information.
I'm simply trying the word count as a job, changing scheduler settings in mapped-site.xml and yarn-site.xml.
For instance I'm changing property "yarn.resourcemanager.scheduler.class" to "org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler" based on this link : see this
I'm also moving appropriate jar files specific to the schedulers to the correct directory.
For your reference, I'm using the "yarn" runtime mode, and Cloudera and Hadoop 2.
Thanks a ton for your help

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

Run Pig with Lipstick on AWS EMR

I'm running an AWS EMR Pig job using script-runner.jar as described here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Now, I want to hook up Netflix' Lipstick to monitor my scripts. I set up the server, and in the wiki here: https://github.com/Netflix/Lipstick/wiki/Getting-Started I can't quite figure out how to do the last step:
hadoop jar lipstick-console-[version].jar -Dlipstick.server.url=http://$LIPSTICK_URL
Should I substitute script-runner.jar with this?
Also, after following the build process in wiki I ended up with 3 different console jars:
lipstick-console-0.6-SNAPSHOT.jar
lipstick-console-0.6-SNAPSHOT-withHadoop.jar
lipstick-console-0.6-SNAPSHOT-withPig.jar
What is the purpose of the latter two jars?
UPDATE:
I think I'm making progress, but it still does not seem to work.
I set the pig.notification.listener parameter as described here and lipstick server url. There is more than one way to do it in EMR. Since I am using ruby API, I had to specify a step
hadoop_jar_step:
jar: 's3://elasticmapreduce/libs/script-runner/script-runner.jar'
properties:
- pig.notification.listener.arg: com.netflix.lipstick.listeners.LipstickPPNL
- lipstick.server.url: http://pig_server_url
Next, I added lipstick-console-0.6-SNAPSHOT.jar to hadoop classpath. For this, I had to create a bootstrap action as follows:
bootstrap_actions:
- name: copy_lipstick_jar
script_bootstrap_action:
path: #s3 path to bootstrap_lipstick.sh
where contents of bootstrap_lipstick.sh is
#!/bin/bash
hadoop fs -copyToLocal s3n://wp-data-west-2/load_code/java/lipstick-console-0.6-SNAPSHOT.jar /home/hadoop/lib/
The bootstrap action copies the lipstick jar to cluster nodes, and /home/hadoop/lib/ is already in hadoop classpath (EMR takes care of that).
It still does not work, but I think I am missing something really minor ... Any ideas appreciated.
Thanks!
Currently Lipstick's Main class is a drop-in replacement to Pig's Main class. This is a hack (and far from ideal) to have access to the logical and physical plans for your script before and after optimization that are simply not accessible otherwise. As such it unfortunately won't work to just register the LipstickPPNL class as a PPNL for Pig. You've got to run Lipstick Main as though it was Pig.
I have not tried to run lipstick on EMR but it looks like you're going to need to use a custom jar step, not a script step. See the docs here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-launch-custom-jar-cli.html
The jar name would be the lipstick-console-0.6-SNAPSHOT-withHadoop.jar. It contains all the necessary dependencies to run Lipstick. Additionally the lipstick.server.url will need to be set.
Alternatively, you might take a look at https://www.mortardata.com/ which runs on EMR and has lipstick integration built-in.

Storm topology configuration

How do you provide a custom configuration to a storm topology? For example, if I have a topology that I built that connects to a MySQL cluster and I want to be able to change which servers I need to to connect to without recompiling, how would I do that? My preference would be to use a config file, but my concern is that the file itself is not deployed to the cluster, therefore it won't be run (unless my understanding of how a cluster works is flawed). The only way I've seen so far to pass configuration options into a storm topology at runtime is via a command-line parameter, but that is messy when you get a good number of parameters.
One thought did have is to leverage a shell script to read the file into a variable and pass the contents of that variable in as a string to the topology, but I'd like something a little cleaner if possible.
Has anyone else encountered this? If so, how did you solve it?
EDIT:
It appears to need to provide more clarification. My scenario is that I have a topology that I want to be able to deploy in different environments without having to recompile it. Normally, I'd create a config file that contains things like database connection parameters and have that passed in. I'd like to know how to do something like that in Storm.
You can specify a configuration (via a yaml file typically) which you submit with your topology. How we manage this ourselves in our own project is we have separate config files for development and one for production, and inside it we store our server, redis and db IPs and Ports etc. Then when we run our command to build the jar and submit the topology to storm it includes the correct config file depending on your deployment environment. The bolts and spouts simply read the configuration they require from the stormConf map which is passed to them in your bolt's prepare() method.
From http://storm.apache.org/documentation/Configuration.html :
Every configuration has a default value defined in defaults.yaml in the Storm codebase. You can override these configurations by defining a storm.yaml in the classpath of Nimbus and the supervisors. Finally, you can define a topology-specific configuration that you submit along with your topology when using StormSubmitter. However, the topology-specific configuration can only override configs prefixed with "TOPOLOGY".
Storm 0.7.0 and onwards lets you override configuration on a per-bolt/per-spout basis.
You'll also see on http://nathanmarz.github.io/storm/doc/backtype/storm/StormSubmitter.html that submitJar and submitTopology is passed a map called conf.
Hope this gets you started.
I solved this problem by just providing the config in code:
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, SOME_OPTS);
I tried to provide a topology-specific storm.yaml but it doesn't work. Correct me if you make it work to use a storm.yaml.
Update:
For anyone who wants to know what SOME_OPTS is, this is from Satish Duggana on the Storm mailing list:
Config.TOPOLOGY_WORKER_CHILDOPTS: Options which can override
WORKER_CHILDOPTS for a topology. You can configure any java options
like memory, gc etc
In your case it can be
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "-Xmx1g");
What might actually serve you best is to store the configuration in a mutable key value store (s3, redis, etc.) and then pull that in to configure a database connection that you then use (I assume here you are already planning to limit how often you talk to the database so that the overhead of getting this config is not a big deal). This design allows you to change the database connection on-the-fly, with no need to even redeploy the topology.
The idea is that when you build your topology, you create instances of your spouts and bolts (among other things) and these instances are serialized and distributed to the right places in the cluster. If you want to configure the behavior of a spout or bolt, you do so when creating the topology before submitting it and you do so by setting instance variables on the bolt or spout that, in turn, drive the configurable behavior you want.
i also faced the same issue.I solved it by configuring NFS in my cluster and i put that configuration file in shared location so that it would be available for all cluster machines.Its very easy to configure NFS in linux system link.
I face the same problem as u did, and here is my tricky solution:
Use a simple java file as the configure file, say topo_config.java, it looks like:
package com.xxx
public class topo_config {
public static String zk_host = "192.168.10.60:2181";
public static String kafka_topic = "my_log_topic";
public static int worker_num = 2;
public static int log_spout_num = 4;
// ...
}
This file is put in my configure folder, and then write a script, say compile.sh which will copy it to the right package and do the compilation stuff, looks like:
cp config/topo_config.java src/main/java/com/xxx/
mvn package
The configuration is achieved directly:
Config conf = new Config();
conf.setNumWorkers(topo_config.worker_num);
we have seen the same issue and solved it by adding the below per topology
config.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "-Xmx4096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70 -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true");
Also verified using Nimbus UI it show like below.
topology.worker.childopts -Xmx4096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70 -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true

how to get multipleOutput in hadoop

I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)

Resources