Spark Streaming to ElasticSearch - elasticsearch

I'm trying to replicate this example Streamlining Search Indexing using Elastic Search by Holden Karau using the Spark Java API. I've successfully made it work as a normal Java application with some changes in the code. Instead of using saveAsHadoopDataset method I'm sending my tweets with:
JavaEsSpark.saveToEs(rdd,"/test/collection");
and running my code with:
java -cp ./target/hbase-spark-playground-1.0-SNAPSHOT.jar spark.examples.SparkToElasticSearchStreaming local[2] collection-name
My current problem is how to execute it on a Yarn Cluster. A code snippet of what I'm doing can be found here:
https://gist.github.com/IvanFernandez/b3a3e25397f8b402256b
and running my class this way:
spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name
I think that the main problem is that I don't have any elasticSearch configuration in the foreach transformation so I can't reach my elasticSearch master. Any ideas?

The es cluster or other configuration information should be set in the SparkConf, which is already done in your code snippet as args[2] set as es.nodes. In your yarn command the third argument with es host is missing, also I believe that your command is using spark-submit to submit the application.

Can you please try setting the spark.es.nodes and es.port properties in as shown below in SparkConf:
sparkConf.set("spark.es.nodes", args[2]);
sparkConf.set("es.port", args[3]); // HTTP Port of elastic search
And use the below command to run the app on yarn:
spark-submit --class spark.examples.SparkToElasticSearchStreaming --master yarn-cluster --executor-memory 400m --num-executors 1 ./target/hbase-spark-playground-1.0-SNAPSHOT.jar yarn-cluster collection-name localhost 9200

Related

how to switch between cluster types in Apache Spark

I'm trying to switch cluster manager from standalone to 'YARN' in Apache Spark that I've installed for learning.
I read following thread to understand which cluster type should be chosen
However, I'd like to know the steps/syntax to change the cluster type.
Ex: from Standalone to YARN or from YARN to Standalone.
In spark there is one function name as --master that can helps you to execute your script on yarn Cluster mode or standalone mode.
Run the application on local mode or standalone used this with spark-submit command
--master Local[*]
or
--master spark://192.168.10.01:7077 \
--deploy-mode cluster \
Run on a YARN cluster
--master yarn
--deploy-mode cluster
For more information kindly visit this link.
https://spark.apache.org/docs/latest/submitting-applications.html
If you are not running through command line then you can directly set this master on SparkConf object.
sparkConf.setMaster(http://path/to/master/url:port) in cluster mode
or
sparkConf.setMaster(local[*]) in client/local mode

Can't see Yarn Job when doing Spark-Submit on Yarn Cluster

I am using spark-submit for my job with the command below:
spark-submit script_test.py --master yarn --deploy-mode cluster
spark-submit script_test.py --master yarn-cluster --deploy-mode cluster
The job is working fine. I can see it under the Spark History Server UI. However, I cannot see it under the RessourceManager UI ( YARN).
I have the feeling that my job is not sent to the cluster but it is running only in one node. However, I see nothing wrong on the way I use the Spark-submit command.
Am-i wrong? How can I check it? Or send the job to yarn cluster?
When you are using --master yarn means that in some place you have configured the yarn-site with hosts, ports, and so on.
Maybe the machine where you are using the spark-submit doesn't know where is the Yarn master.
You could check your hadoop/yarn/spark config files, specially the yarn-site.xml to check if the host of the Resource Manager is correct or not.
Those files are in different folders depending on which distribution of Hadoop you are using. In HDP I guess they are in /etc/hadoop/conf
Hope it helps.

Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.
I can run pyspark, and submit jobs with spark-submit.
However, when I create a standalone job, like job.py, I create a SparkContext, like so:
sc=SparkContext("local", "App Name")
This doesn't seem right, but I'm not sure what to put there.
When I submit the job, I am sure it is not utilizing the whole cluster.
If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to
a.) pass as arguments to spark-submit
b.) pass as arguments to SparkContext() in the script itself.
You can create spark context using
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
and you have to submit the program to spark-submit using the following command for spark standalone cluster
./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
For Mesos cluster
./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
For YARN cluster
./bin/spark-submit --master yarn --deploy-mode cluster code.py
For YARN master, the configuration would be read from HADOOP_CONF_DIR.

Missing java system properties when running spark-streaming on Mesos cluster

I submit a spark app to mesos cluster(running in cluster mode), and pass java system property through "--drive-java-options=-Dkey=value -Dkey=value", however these system properties are not available at runtime, seems they are not set. --conf "spark.driver.extraJavaOptions=-Dkey=value" doesn't work either
More details:
the command is
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} --driver-java-options "-Dconfiguration.http=http://10.3.101.119:9090/application.conf" --conf "spark.executor.extraJavaOptions=-Dconfiguration.http=http://10.3.101.119:9090/application.conf" ${jar file}
I have a two-node mesos cluster, one node both runs master and slave, and the other runs slave only. I submit the spark application on master node.
Internally, the application hopes to read a configuration file from java system property "configuration.http", if the property is not available, the application will load a default file from the root of the classpath. When I submit the application, from the logs, i saw the default configuration file is loaded.
And the actual command to run the application is
"sh -c '/home/ubuntu/spark-1.6.0/bin/spark-submit --name ${appName} --master mesos://zk://10.3.101.184:2181/mesos/grant --driver-cores 1.0 --driver-memory 1024M --class ${classname} ./${jar file} '"
from here you can see the system property is lost
You might have a look at this blog post which recommends using an external properties file for this purpose:
$ vi app.properties
spark.driver.extraJavaOptions -Dconfiguration.http=http://10.3.101.119:9090/application.conf
spark.executor.extraJavaOptions –Dconfiguration.http=http://10.3.101.119:9090/application.conf
Then try to run this via
bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} —-properties-file app.properties ${jar file}
See
How to pass -D parameter or environment variable to Spark job?
Separate logs from Apache spark

How to report JMX from Spark Streaming on EC2 to VisualVM?

I have been trying to get a Spark Streaming job, running on a EC2 instance to report to VisualVM using JMX.
As of now I have the following config file:
spark/conf/metrics.properties:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And I start the spark streaming job like this:
(the -D bits I have added afterwards in the hopes of getting remote access to the ec2's jmx)
terminal:
spark/bin/spark-submit --class my.class.StarterApp --master local --deploy-mode client \
project-1.0-SNAPSHOT.jar \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
There are two issues with the spark-submit command line:
local - you must not run Spark Standalone with local master URL because there will be no threads to run your computations (jobs) and you've got two, i.e. one for a receiver and another for the driver. You should see the following WARN in the logs:
WARN StreamingContext: spark.master should be set as local[n], n > 1
in local mode if you have receivers to get data, otherwise Spark jobs
will not get resources to process the received data.
-D options are not picked up by the JVM as they're given after the Spark Streaming application and effectively became its command-line arguments. Put them before project-1.0-SNAPSHOT.jar and start over (you have to fix the above issue first!)
spark-submit --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8090 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"/path/example/src/main/python/pi.py 10000
Notes:the configurations format : --conf "params" . tested under spark 2.+

Resources