Spark SQL performance with Simple Scans - performance

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.

Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

Related

On DSE 5.0.5 (Cassandra 3.0.11) which settings would reduce read latencies with constant writes in the background?

We have a 5 node DSE cassandra cluster and an application whose job is to write asynchronously to keyspace A (which is based on a HDD), and read synchronously from keyspace B (which is on an SSD). Reads from table
Additional info:
The table on A is using TWCS with 48h windows, while the table on keyspace B is using LCS with default settings
Spark jobs partition reads in chunks of 20h at most
Both tables are using TDE with AES256 keys and 1KB chunks
Azul Zing is being used as the JVM with default settings apart from heap sizing and GC logging
With this scenario alone the read latencies from keyspace B are fine throughout the day, but everyday we have a spark job that will read from keyspace A and write to B. The moment the spark executors "attack" keyspace A, read latencies from keyspace B suffer a bit (99th percentil goes from 8-12ms to 130ms for a few seconds).
My question is, which cassandra.yaml properties would likely help the most on reducing the read latencies on keyspace B just for this moment the spark job starts? I've been trying different memtable/commitlog settings, but haven't been able to lower the read latency to acceptable levels
It’s hard to generalize without knowing why your latency hurts, if we could we’d bake those defaults into the database
However, I’ll try to guess
Throttle down concurrent reads so there are fewer concurrent requests - this will trade throughout for more consistent performance
if your disk is busy, consider smaller compression chunk sizes
if you’re seeing GC pauses, consider tuning your jvm - the Cassandra-8150 jira has some good suggestions
if your sstables-per-read is more than a few, reconsider your data model to keep your partitions from spanning multiple TWCS windows
make sure your key cache is enabled. If you can spare the heap, raise it, it may help.
Jeff's answer should be your starting point but if that doesn't solve it, consider changing your spark job to off-peak time. Keep in mind that LCS is optimized for read-heavy tables, but from the moment that spark starts to "migrate" the data, that table using LCS, will for some time (until the spark job finishes) become a write-heavy table. This would be an anti-pattern for LCS utilization. I can't know for sure without looking into servers details, but I would say that due to the sheer number of SSTables that are created during the spark job, LCS is not able to keep up with the compaction to maintain the standard read latency.
If you can't schedule the spark job at an off-peak time, then you should consider changing the compaction strategy in the keyspace B to STCS.

hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

Spark Executors hang after Out of Memory

I have a spark application running on EMR (16 nodes, 1 master, 15 core, r3.2xlarge instances). For spark executor configuration, we use dynamic Allocation.
While loading the data into the RDD, I see that sometimes when there's a huge amount of data (700 Gb), then Spark runs Out of Memory, but it does not fail the App. Rather the app sits there hung. I'm not sure why this happens but here is my theory :-
We use dataframes which might be caching things.
The spark flag spark.dynamicAllocation.cachedExecutorIdleTimeout is set to infinity
My theory is that it might be caching things while creating dataframes but the cache is never relinquished and this leads to a Spark hang.
There are two solutions
Increase cluster size (worse case)
Figure out a way to add a timeout to Spark app.
Programatically kill the EMR step (could not find an API which does this)
Any leads about how to go about it ?
There could be two other possibilities. Either the partitions are too big, or you have sever skewness (size of partitions varies a lot).
Try to increase the number of partitions (anf hence, reduce their size) using repartition. This will randomly reshuffle the data throughout your executors (good to reduce skewness, but slow). Ideally, I like my partitions to be around 64Mo, depending on your machines.

Running parallel queries in Spark

How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?
Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.

Resources