I have pretty standard use case and need suggestion on how to improve the Spark(2.4) Job:
Dataframe1 (df1) = 10M records and
Dataframe2 (df2) = 50M records
then : join df1 & df2
use windowing functions etc
Result Dataframe (df3) = 2B records
further process i.e filter and generate 5 different dateset from prior df3. (when it issue starts)
The issues i face is initial few steps it works fine in notebook but as soon i reach to df3, further processing gets really slow and gets failed/killed.
What would be best way to optimized this processing? so far i tried using:
r4.xlarge cluster, also r5.16xlarge (500 GB Memory)cluster (should i try any other like M4 or C4 clusters or what would you suggest for this kind of processing)
spark conf used:
spark.conf.set("spark.executor.memory", "64g")
spark.conf.set("spark.driver.memory", "64g")
spark.conf.set("spark.executor.memoryOverHead", "24g")
spark.conf.set("spark.driver.memoryOverHead", "24g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.paralellism", 100)
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.sql.broadcastTimeout", "7200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
using cache on df1,df2,df3.
once memory is used,i see disk spill, so i tried freeing GC using:
spark.conf.set("spark.driver.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
spark.conf.set("spark.executor.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
above steps, didn't do much help, please suggest what config, memory and cluster setting might help
or
What other optimization technique can be used here?
Related
I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.
As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)
I have read alot about solving that kind of problem by setting yarn.scheduler.maximum-allocation-mb, which I have set to 2gb as I am currently running select count(*) from <table> which isn't a heavy computation, I guess. But what's Required AM memory (471859200+47185920 MB) supposed to mean? Other question says problem of about (1024+2048) or something like that.
I am setting up on a single machine, i.e my desktop which has 4-gb ram and 2 cores. Is this very low spec to run Spark as Hive execution engine?
Currently I am running this job from java and my setup is
Connection connect = DriverManager.getConnection("jdbc:hive2://saurab:10000/default", "hiveuser", "hivepassword");
Statement state = connect.createStatement();
state.execute("SET hive.execution.engine=spark");
state.execute("SET spark.executor.memory=1g");
state.execute("SET spark.yarn.executor.memoryOverhead=512m");
yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3g</value>
</property>
And a simple query
String query = "select count(*) from sales_txt";
ResultSet res = state.executeQuery(query);
if (res.next()) {
System.out.println(res.getString());
}
Also what are those two memory numbers (A+B) ?
AM stands for Application Master for running Spark on Yarn. Good explanation here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/yarn/spark-yarn-applicationmaster.html
It's not clear why you need to run yarn on your single machine to test this out. You could run this in standalone mode to remove the yarn overhead, and test your spark application code.
https://spark.apache.org/docs/latest/
The spark.*.memory and spark.yarn.executor.memoryOverhead need to be set when you deploy the spark application. They cannot be set in those statements.
I am currently running spark-submit on the following environment:
Single node (RAM: 40GB, VCores: 8, Spark Version: 2.0.2, Python: 3.5)
My pyspark program basically will read one 450MB unstructured file from HDFS. Then it will loop through each lines and grab the necessary data and place it list. Finally it will use createDataFrame and save the data frame into Hive table.
My pyspark program code snippet:
sparkSession = (SparkSession
.builder
.master("yarn")
.appName("FileProcessing")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate())
lines = sparkSession.read.text('/user/test/testfiles').collect()
for line in lines:
// perform some data extrating and place it into rowList and colList using normal python operation
df = sparkSession.createDataFrame(rowList, colList)
df.registerTempTable("tempTable")
sparkSession.sql("create table test as select * from tempTable");
My spark-submit command is as the following:
spark-submit --master yarn --deploy-mode cluster --num-executors 2 --driver-memory 4g --executor-memory 8g --executor-cores 3 --files /usr/lib/spark-2.0.2-bin-hadoop2.7/conf/hive-site.xml FileProcessing.py
It took around 5 minutes to complete the processing. Is the performance consider good? How can I tune it in terms of setting the executor memory and executor cores so that the process can complete within 1-2 minutes, is it possible?
Appreciate your response. Thanks.
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!
I have a Neo4J-enterprise database running on a DigitalOcean VPS with 8Gb RAM and 80Gb SSD.
The performance of the Neo4J instance is awful at the moment:
match (n) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5 # 349ms
match (n) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25 # 1588ms
I understand regex are expensive, but on likewise queries where I replace the 'a.' or 'a.*' part with any other letter, Neo4j simply crashes. I can see a huge build-up in memory before that (towards 90%), and the CPU sky-rocketing.
My Neo4j is populated as follows:
Number Of Relationship Type Ids In Use: 1,
Number Of Node Ids In Use: 172412046,
Number Of Relationship Ids In Use: 172219328,
Number Of Property Ids In Use: 344453742
The VPS only runs Neo4J (on debian 7/amd64). I use the NUMA+parallelGC flags as they're supposed to be faster. I've been tweaking my RAM settings, and although it doesn't crash at often now, I have a feeling there should be some gainings to be made
neostore.nodestore.db.mapped_memory=1024M
neostore.relationshipstore.db.mapped_memory=2048M
neostore.propertystore.db.mapped_memory=6144M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
# caching
cache_type=hpc
node_cache_array_fraction=7
relationship_cache_array_fraction=5
# node_cache_size=3G
# relationship_cache_size=1G --> these throw a not-enough-heap-mem error
The data is essentially a series of tree, where on node0 only a full text search is needed, the following nodes are searched by a property with floating point values.
node0 -REL-> node0.1 -REL-> node0.1.1 ... node0.1.1.1.1
\
-REL-> node0.2 -REL-> node0.2.1 ... node0.2.1.1
There are aprox. 5.000 top-nodes like node0.
Should I reconfigure my memory/cache usage, or should I just add more RAM?
--- Edit on Indexes ---
Because all tree's of nodes al always 4-levels deep, each level has a label for quick finding.in this case all node0 nodes have a label (called 0gram). the n.gram='0gram' should use the index coupled to the label.
--- Edit on new Config ---
I upgraded the VPS to 16Gb. The nodeStore has 2.3Gb (11%), PropertyStore 13.8Gb (64%) and the relastionshipStore amounts to 5.6Gb (26%) on the SSD.
On this basis I created a new config (detailed above).
I'm waiting for the full set of queries and will do some additional testing in the mean time
Yes you need to create an index, what's your label called? Imagine it being called :NGram
create index on :NGram(gram);
match (n:NGram) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5
match (n:NGram) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25
What you're doing is not a graph search but just a lookup via full scan + property comparison with a regexp. Not a very efficient operation. What you need is FullTextSearch (which is not supported with the new schema indexes but still with the legacy indexes).
Could you run this query (after you created the index) and say how many nodes it returns?
match (n:NGram) where n.gram='0gram' return count(*)
which is the equivalent to
match (n:NGram {gram:'0gram'}) return count(*)
I wrote a blog post about it a few days ago, please read it and see if it applies to your case.
How big is your Neo4j database on disk?
What is the configured heap size? (in neo4j-wrapper.conf?)
As you can see you use more RAM than you machine has (not even counting OS or filesystem caches).
So you would have to reduce the mmio sizes, e.g. to 500M for nodes 2G for rels and 1G for properties.
Look at your store-file sizes and set mmio accordingly.
Depending on the number of nodes having n.gram='0gram' you might benefit a lot from setting a label on them and index for the gram property. If you have this in place a index lookup will directly return all 0gram nodes and apply regex matching only on those. Your current statement will load each and every node from the db and inspect its properties.
Now we are using Solr1.4 in Master/Slave mode and want to improve the performance for Slave query.
The biggest issue for us is the index file is about 30G.
The Slave server config as below:
Dell PC Server: 48G memory and 2 CPU;
RedHat 64 Linux;
JDK64 1.6.0_22;
Tomcat 6.18.
Our current JAVA_OPTS is "–Xms2048M –Xmx20480 –server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=20 -XX:SurvivorRatio=2"
Do you have more suggestion for JAVA_OPTS?
The JAVA_OPTS seem fine. quite a few questions :-
Is you max for 20GB ram peaking out ? can you check the memory stats as to whats the max utilized ?
Is there any heavy processing happening on Slave ? CPU stats ?
How are the queries ??? are you using highlighting ?
Whats the number of results you are returing for single query ?
what do your cache stats say ? are they utilized properly ?
Is your index optimized ??
do you use warming queries to improve performance on the slow running queries ?
If the above seems fine, can you consider enabling the http caching ?
use the following opts
-XX:+UseCompressedOops
(This will help in reducing the heap size)
-XX:+DoEscapeAnalysis