Spark-submit job performance - hadoop

I am currently running spark-submit on the following environment:
Single node (RAM: 40GB, VCores: 8, Spark Version: 2.0.2, Python: 3.5)
My pyspark program basically will read one 450MB unstructured file from HDFS. Then it will loop through each lines and grab the necessary data and place it list. Finally it will use createDataFrame and save the data frame into Hive table.
My pyspark program code snippet:
sparkSession = (SparkSession
.builder
.master("yarn")
.appName("FileProcessing")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate())
lines = sparkSession.read.text('/user/test/testfiles').collect()
for line in lines:
// perform some data extrating and place it into rowList and colList using normal python operation
df = sparkSession.createDataFrame(rowList, colList)
df.registerTempTable("tempTable")
sparkSession.sql("create table test as select * from tempTable");
My spark-submit command is as the following:
spark-submit --master yarn --deploy-mode cluster --num-executors 2 --driver-memory 4g --executor-memory 8g --executor-cores 3 --files /usr/lib/spark-2.0.2-bin-hadoop2.7/conf/hive-site.xml FileProcessing.py
It took around 5 minutes to complete the processing. Is the performance consider good? How can I tune it in terms of setting the executor memory and executor cores so that the process can complete within 1-2 minutes, is it possible?
Appreciate your response. Thanks.

For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!

Related

Spark Performance tuning / optimization

I have pretty standard use case and need suggestion on how to improve the Spark(2.4) Job:
Dataframe1 (df1) = 10M records and
Dataframe2 (df2) = 50M records
then : join df1 & df2
use windowing functions etc
Result Dataframe (df3) = 2B records
further process i.e filter and generate 5 different dateset from prior df3. (when it issue starts)
The issues i face is initial few steps it works fine in notebook but as soon i reach to df3, further processing gets really slow and gets failed/killed.
What would be best way to optimized this processing? so far i tried using:
r4.xlarge cluster, also r5.16xlarge (500 GB Memory)cluster (should i try any other like M4 or C4 clusters or what would you suggest for this kind of processing)
spark conf used:
spark.conf.set("spark.executor.memory", "64g")
spark.conf.set("spark.driver.memory", "64g")
spark.conf.set("spark.executor.memoryOverHead", "24g")
spark.conf.set("spark.driver.memoryOverHead", "24g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.paralellism", 100)
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.sql.broadcastTimeout", "7200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
using cache on df1,df2,df3.
once memory is used,i see disk spill, so i tried freeing GC using:
spark.conf.set("spark.driver.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
spark.conf.set("spark.executor.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
above steps, didn't do much help, please suggest what config, memory and cluster setting might help
or
What other optimization technique can be used here?

Setting JVM options when configuring elastic search

I'm configuring jvm options for an Elasticsearch cluster, and I wonder which jvm heap
would be best for my usecase.
The machine has 16GB memory and will be dedicated to a single node of elasticsearch.
The default value is 1GB, and I'm not familar with Java/JVM but I feel like this is too small.
Any help would be appreciated.
If you use Windows, you can type Windows + R, then systempropertiesadvanced , then set, for example:
ES_JAVA_OPTS
-Xms2g -Xmx2g
(You can increase value as you want, 2 is a number, g is gigabyte, m is megabyte)
Reference document: https://www.elastic.co/guide/en/elasticsearch/reference/master/advanced-configuration.html#set-jvm-options
https://www.javadevjournal.com/java/jvm-parameters/

Mesos task resources - CPU & Mem

I use Meosos for batch Jobs. Jobs will be running as a docker container by the framework. The are 2 salves running on each VM. The resource for each Job was set to
CPUS - 0.1
MEM - 1G
Its a 4 core machine and mesos was considering it as 8 core as there are 2 slaves in each VM. So, it tried to overload the VM by submitting too many tasks, literally up to 80 jobs ( (4+4)/0.1 = 80). So, during the peak load VM used to crash.
Tried changing the CPU to 0.5 so that the VM will not be overloaded. (( (4+4)/0.5 = 20)). But, looks like CPU usage still goes up to 95%. The tasks are not CPU intensive task, but not sure why it is trying to consume 95%.
Is it like, tasks will be using the resource no matter even it actually requires them? So, it will allocate 0.5 by default or max to 0.5 in case it requires?
Having two agents on the same host/VM is more like an antipattern. If you want to oversubscribe on resources, have a look at the Mesos docs at http://mesos.apache.org/documentation/latest/oversubscription/

This isn't normal right ? Required AM memory (471859200+47185920 MB) is above the max threshold (2048 MB)

I have read alot about solving that kind of problem by setting yarn.scheduler.maximum-allocation-mb, which I have set to 2gb as I am currently running select count(*) from <table> which isn't a heavy computation, I guess. But what's Required AM memory (471859200+47185920 MB) supposed to mean? Other question says problem of about (1024+2048) or something like that.
I am setting up on a single machine, i.e my desktop which has 4-gb ram and 2 cores. Is this very low spec to run Spark as Hive execution engine?
Currently I am running this job from java and my setup is
Connection connect = DriverManager.getConnection("jdbc:hive2://saurab:10000/default", "hiveuser", "hivepassword");
Statement state = connect.createStatement();
state.execute("SET hive.execution.engine=spark");
state.execute("SET spark.executor.memory=1g");
state.execute("SET spark.yarn.executor.memoryOverhead=512m");
yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3g</value>
</property>
And a simple query
String query = "select count(*) from sales_txt";
ResultSet res = state.executeQuery(query);
if (res.next()) {
System.out.println(res.getString());
}
Also what are those two memory numbers (A+B) ?
AM stands for Application Master for running Spark on Yarn. Good explanation here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-applicationmaster.html
It's not clear why you need to run yarn on your single machine to test this out. You could run this in standalone mode to remove the yarn overhead, and test your spark application code.
https://spark.apache.org/docs/latest/
The spark.*.memory and spark.yarn.executor.memoryOverhead need to be set when you deploy the spark application. They cannot be set in those statements.

JVM tuning for better Solr performance

Now we are using Solr1.4 in Master/Slave mode and want to improve the performance for Slave query.
The biggest issue for us is the index file is about 30G.
The Slave server config as below:
Dell PC Server: 48G memory and 2 CPU;
RedHat 64 Linux;
JDK64 1.6.0_22;
Tomcat 6.18.
Our current JAVA_OPTS is "–Xms2048M –Xmx20480 –server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=20 -XX:SurvivorRatio=2"
Do you have more suggestion for JAVA_OPTS?
The JAVA_OPTS seem fine. quite a few questions :-
Is you max for 20GB ram peaking out ? can you check the memory stats as to whats the max utilized ?
Is there any heavy processing happening on Slave ? CPU stats ?
How are the queries ??? are you using highlighting ?
Whats the number of results you are returing for single query ?
what do your cache stats say ? are they utilized properly ?
Is your index optimized ??
do you use warming queries to improve performance on the slow running queries ?
If the above seems fine, can you consider enabling the http caching ?
use the following opts
-XX:+UseCompressedOops
(This will help in reducing the heap size)
-XX:+DoEscapeAnalysis

Resources