Now we are using Solr1.4 in Master/Slave mode and want to improve the performance for Slave query.
The biggest issue for us is the index file is about 30G.
The Slave server config as below:
Dell PC Server: 48G memory and 2 CPU;
RedHat 64 Linux;
JDK64 1.6.0_22;
Tomcat 6.18.
Our current JAVA_OPTS is "–Xms2048M –Xmx20480 –server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=20 -XX:SurvivorRatio=2"
Do you have more suggestion for JAVA_OPTS?
The JAVA_OPTS seem fine. quite a few questions :-
Is you max for 20GB ram peaking out ? can you check the memory stats as to whats the max utilized ?
Is there any heavy processing happening on Slave ? CPU stats ?
How are the queries ??? are you using highlighting ?
Whats the number of results you are returing for single query ?
what do your cache stats say ? are they utilized properly ?
Is your index optimized ??
do you use warming queries to improve performance on the slow running queries ?
If the above seems fine, can you consider enabling the http caching ?
use the following opts
-XX:+UseCompressedOops
(This will help in reducing the heap size)
-XX:+DoEscapeAnalysis
Related
I have pretty standard use case and need suggestion on how to improve the Spark(2.4) Job:
Dataframe1 (df1) = 10M records and
Dataframe2 (df2) = 50M records
then : join df1 & df2
use windowing functions etc
Result Dataframe (df3) = 2B records
further process i.e filter and generate 5 different dateset from prior df3. (when it issue starts)
The issues i face is initial few steps it works fine in notebook but as soon i reach to df3, further processing gets really slow and gets failed/killed.
What would be best way to optimized this processing? so far i tried using:
r4.xlarge cluster, also r5.16xlarge (500 GB Memory)cluster (should i try any other like M4 or C4 clusters or what would you suggest for this kind of processing)
spark conf used:
spark.conf.set("spark.executor.memory", "64g")
spark.conf.set("spark.driver.memory", "64g")
spark.conf.set("spark.executor.memoryOverHead", "24g")
spark.conf.set("spark.driver.memoryOverHead", "24g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.paralellism", 100)
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.sql.broadcastTimeout", "7200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
using cache on df1,df2,df3.
once memory is used,i see disk spill, so i tried freeing GC using:
spark.conf.set("spark.driver.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
spark.conf.set("spark.executor.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
above steps, didn't do much help, please suggest what config, memory and cluster setting might help
or
What other optimization technique can be used here?
We are having an issue with virtual servers (VMs) running out of native memory. These VMs are running:
Linux 7.2(Maipo)
Wildfly 9.0.1
Java 1.8.0._151 running with (different JVMs have different heap sizes. They range from 0.5G to 2G)
The JVM args are:
-XX:+UseG1GC
-XX:SurvivorRatio=1
-XX:NewRatio=2
-XX:MaxTenuringThreshold=15
-XX:-UseAdaptiveSizePolicy
-XX:G1HeapRegionSize=16m
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=64m
-javaagent:/<path to new relic.jar>
After about a month, sometimes longer, the VMs start to use all of their swap space and then eventually the OOM-Killer notices that java is using too much memory and kills one of our JVMs.
The amount of memory being used by the java process is larger than heap + metaSpace + compressed as revealed by using -XX:NativeMemoryTracking=detail
Are there tools that could tell me what is in this native memory(like a heap dump but not for the heap)?
Are there any tools that can map java heap usage to native memory usage (outside the heap) that are not jemalloc? I have used jemalloc to try to achieve this but the graph that is being drawn contains only hex values and not human readable class names so I cant really get anything out of it. Maybe I'm doing something wrong or perhaps I need another tool.
Any suggestions would be greatly appreciated.
You can use jcmd.
Start application with -XX:NativeMemoryTracking=summary or -
XX:NativeMemoryTracking=detail
Use jcmd to monitor the NMT (native memory tracker)
jcmd "pid" VM.native_memory baseline //take the baseline
jcmd "pid" VM.native_memory detail.diff // use based on your need to analyze more on change in native memory from its baseline
I have read alot about solving that kind of problem by setting yarn.scheduler.maximum-allocation-mb, which I have set to 2gb as I am currently running select count(*) from <table> which isn't a heavy computation, I guess. But what's Required AM memory (471859200+47185920 MB) supposed to mean? Other question says problem of about (1024+2048) or something like that.
I am setting up on a single machine, i.e my desktop which has 4-gb ram and 2 cores. Is this very low spec to run Spark as Hive execution engine?
Currently I am running this job from java and my setup is
Connection connect = DriverManager.getConnection("jdbc:hive2://saurab:10000/default", "hiveuser", "hivepassword");
Statement state = connect.createStatement();
state.execute("SET hive.execution.engine=spark");
state.execute("SET spark.executor.memory=1g");
state.execute("SET spark.yarn.executor.memoryOverhead=512m");
yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3g</value>
</property>
And a simple query
String query = "select count(*) from sales_txt";
ResultSet res = state.executeQuery(query);
if (res.next()) {
System.out.println(res.getString());
}
Also what are those two memory numbers (A+B) ?
AM stands for Application Master for running Spark on Yarn. Good explanation here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-applicationmaster.html
It's not clear why you need to run yarn on your single machine to test this out. You could run this in standalone mode to remove the yarn overhead, and test your spark application code.
https://spark.apache.org/docs/latest/
The spark.*.memory and spark.yarn.executor.memoryOverhead need to be set when you deploy the spark application. They cannot be set in those statements.
I am currently running spark-submit on the following environment:
Single node (RAM: 40GB, VCores: 8, Spark Version: 2.0.2, Python: 3.5)
My pyspark program basically will read one 450MB unstructured file from HDFS. Then it will loop through each lines and grab the necessary data and place it list. Finally it will use createDataFrame and save the data frame into Hive table.
My pyspark program code snippet:
sparkSession = (SparkSession
.builder
.master("yarn")
.appName("FileProcessing")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate())
lines = sparkSession.read.text('/user/test/testfiles').collect()
for line in lines:
// perform some data extrating and place it into rowList and colList using normal python operation
df = sparkSession.createDataFrame(rowList, colList)
df.registerTempTable("tempTable")
sparkSession.sql("create table test as select * from tempTable");
My spark-submit command is as the following:
spark-submit --master yarn --deploy-mode cluster --num-executors 2 --driver-memory 4g --executor-memory 8g --executor-cores 3 --files /usr/lib/spark-2.0.2-bin-hadoop2.7/conf/hive-site.xml FileProcessing.py
It took around 5 minutes to complete the processing. Is the performance consider good? How can I tune it in terms of setting the executor memory and executor cores so that the process can complete within 1-2 minutes, is it possible?
Appreciate your response. Thanks.
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!
I would like to tweak the postgres config for use on a Windows server. Here is my current posgresql.conf file: http://pastebin.com/KpSi2zSd
I would like to increase work_mem and maintenance_work_mem, but if I raise the values above 1GB I get this error when starting the service:
Nothing is added to the log files (at least not in data\pg_log). How can I figure out what is causing the issue (increase logging)? Could the have anything to do with issues management between windows and postgres?
Here are my server specs:
Windows Server 2012 R2 Datacenter (64 bit)
Intel CPU E5-2670 v2 # 2.50 GHz
512 GB RAM
PostgreSQL 9.3
Under Windows the value for work_mem is limited to 2GB (even on a 64bit system) - there is no workaround as far as I know.
I don't know why you couldn't set it to 1GB though. Maybe the sum of work_mem and maintenance_work_mem has another limit I am not aware of.
Setting work_mem that high by default is usually not a good idea. With 512GB RAM and just 10 users this might work, but keep in mind that the amount of work_mem is requested by a statement for every sort, group or hash operation in a single query. So you could have a statement requesting this amount of memory 15 or 20 times.
You don't need to change this in postgresql.conf - this can be changed dynamically if you know that the following query will benefit from a large work_mem, by running:
set session work_mem='2097151';
If you use a higher number, you'll get an error message telling you the limit:
ERROR: 2097152 is outside the valid range for parameter "work_mem" (64 .. 2097151)
Even if Postgres isn't using all the memory, it still benefits from it. Postgres (unlike e.g. Oracle) relies heavily on the filesystem cache rather than doing all the caching itself. Values for shared_buffers beyond roughly 8GB rarely show any benefit.
What you do need to tell Postgres is how much memory the operating system usually uses for caching, by setting effective_cache_size to the appropriate value. Postgres doesn't use that for caching, but it influences the planner's choice to e.g. prefer an index scan over a seq scan if the index is likely to be in the file system cache.
You can see the current size of the file system cache in the Windows task manager (or e.g. ProcessExplorer)
as described above, in windows it is more beneficial to rely on the OS cache.
If you use RAMMAP from sysinternals (Microsoft) you can see exactly what is being used by postgres in the OS cache, and hence how much is actually cached to it.