Spark cache inefficiency - caching

I have quite powerful cluster with 3 nodes each 24 cores and 96gb RAM = 288gb total. I try to load 100gb of tsv files into Spark cache and do series of simple computation over data, like sum(col20) by col2-col4 combinations. I think it's clear scenario for cache usage.
But during Spark execution I found out that cache NEVER load 100% of data despite plenty of RAM space. After 1 hour of execution I have 70% of partitions in cache and 75gb cache usage out of 170gb available. It's looks like Spark somehow limit number of blocks/partitions it adds to cache instead to add all at very first action and have a great performance from very beginning.
I use MEMORY_ONLY_SER / Kryo ( cache size appr. 110% of on-disk data size )
Does someone have similar experience or know some Spark configs / environment conditions that could cause this caching behaviour ?

So, "problem" was solved with further reducing of split size. With mapreduce.input.fileinputformat.split.maxsize set to 100mb I got 98% cache load after 1st action finished, and 100% at 2nd action.
Other thing that worsened my results was spark.speculation=true - I try to avoid long-running tasks with that, but speculation management creates big performance overhead, and is useless for my case. So, just left default value for spark.speculation ( false )
My performance comparison for 20 queries are as following:
- without cache - 160 minutes ( 20 times x 8 min, reload each time 100gb from disk to memory )
- cache - 33 minutes total - 10m to load cache 100% ( during first 2 queries ) and 18 queries x 1.5 minutes each ( from in-memory Kryo-serialized cache )

Related

How to determine what causes ES's query API instability

Normally, my ES query API takes less than 1s.But sometimes these queries get slow.
cluster consists of three 32G machines (16G allocated to ES).The index consists of 20 primaries and 1 replica, 303,000,000 dos count and 500gb primaries storage size and 1tb storage size.
Here's kibana's monitoring data:
`
Personally, I think it's the result of GC. I want to add machines.But I need to find a reason to convince my leader.
Yes it could be a GC problem. But can you be more specific? What do you mean by slow?
Anyway it seems the allocated heap is way too large for your needs. You have a collection when the heap is at 12Go ( 75% of 16go ) and it goes back to 5go every time. Its generate huge garbage collection.
You should try to lower the heap to like 10Go and check the impact on performance GC count and GC duration.
I recommands you too read this article https://www.elastic.co/blog/a-heap-of-trouble especially the "Together We Can Prevent Forest Fires" part.

How to increase greenplum concurrency and # query per sec

We have a fairly big Greenplum v4.3 cluster. 18 hosts, each host has 3 segment nodes. Each host has approx 40 cores and 60G memory.
The table we have is 30 columns wide, which has 0.1 billion rows. The query we are testing has 3-10 secs response time when there is no concurrency pressure. As we increase the # of queries we fired in parallel, the latency is decreasing from avg 3 secs to 50ish secs as expected.
But we've found that regardless how many queries we fired in parallel, we only have like very low QPS(query per sec), almost just 3-5 queries/sec. We've set the max_memory=60G, memory_limit=800MB, and active_statments=100, hoping the CPU and memory can be highly utilized, but they are still poorly used, like 30%-40%.
I have a strong feeling, we tried to feed up the cluster in parallel badly, hoping to take the best out of the CPU and Memory utilization. But it doesn't work as we expected. Is there anything wrong with the settings? or is there anything else I am not aware of?
There might be multiple reasons for such behavior.
Firstly, every Greenplum query uses no more than one processor core on one logical segment. Say, you have 3 segments on every node with 40 physical cores. Running two parallel queries will utilize maximum 2 x 3 = 6 cores on every node, so you will need about 40 / 6 ~= 6 parallel queries to utilize all of your CPUs. So, maybe for your number of cores per node its better to create more segments (gpexpand can do this). By the way, are the tables that used in the queries compressed?
Secondly, it may be a bad query. If you will provide a plan for the query, it may help to understand. There some query types in Greenplum that may have master as a bottleneck.
Finally, that might be some bad OS or blockdev settings.
I think this document page Managing Resources might help you mamage your resources
You can use Resource Group limit/controll your resource especialy concurrency attribute(The maximum number of concurrent transactions, including active and idle transactions, that are permitted in the resource group).
Resouce queue help limits ACTIVE_STATEMENTS
Note: The ACTIVE_STATEMENTS will be the total statement current running, when you have 50s cost queries and next incoming queries, this could not be working, mybe 5 * 50 is better.
Also, you need config memory/CPU settings to enable your query can be proceed.

How much load can cassandra handle on m1.xlarge instance?

I setup 3 nodes of Cassandra (1.2.10) cluster on 3 instances of EC2 m1.xlarge.
Based on default configuration with several guidelines included, like:
datastax_clustering_ami_2.4
not using EBS, raided 0 xfs on ephemerals instead,
commit logs on separate disk,
RF=3,
6GB heap, 200MB new size (also tested with greater new size/heap values),
enhanced limits.conf.
With 500 writes per second, the cluster works only for couple of hours. After that time it seems like not being able to respond because of CPU overload (mainly GC + compactions).
Nodes remain Up, but their load is huge and logs are full of GC infos and messages like:
ERROR [Native-Transport-Requests:186] 2013-12-10 18:38:12,412 ErrorMessage.java (line 210) Unexpected exception during request java.io.IOException: Broken pipe
nodetool shows many dropped mutations on each node:
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 7
BINARY 0
READ 2
MUTATION 4072827
_TRACE 0
REQUEST_RESPONSE 1769
Is 500 wps too much for 3-node cluster of m1.xlarge and I should add nodes? Or is it possible to further tune GC somehow? What load are you able to serve with 3 nodes of m1.xlarge? What are your GC configs?
Cassandra is perfectly able to handle tens of thousands small writes per second on a single node. I just checked on my laptop and got about 29000 writes/second from cassandra-stress on Cassandra 1.2. So 500 writes per second is not really an impressive number even for a single node.
However beware that there is also a limit on how fast data can be flushed to disk and you definitely don't want your incoming data rate to be close to the physical capabilities of your HDDs. Therefore 500 writes per second can be too much, if those writes are big enough.
So first - what is the average size of the write? What is your replication factor? Multiply number of writes by replication factor and by average write size - then you'll approximately know what is required write throughput of a cluster. But you should take some safety margin for other I/O related tasks like compaction. There are various benchmarks on the Internet telling a single m1.xlarge instance should be able to write anywhere between 20 MB/s to 100 MB/s...
If your cluster has sufficient I/O throughput (e.g. 3x more than needed), yet you observe OOM problems, you should try to:
reduce memtable_total_space_mb (this will cause C* to flush smaller memtables, more often, freeing heap earlier)
lower write_request_timeout to e.g. 2 seconds instead of 10 (if you have big writes, you don't want to keep too many of them in the incoming queues, which reside on the heap)
turn off row_cache (if you ever enabled it)
lower size of the key_cache
consider upgrading to Cassandra 2.0, which moved quite a lot of things off-heap (e.g. bloom filters and index-summaries); this is especially important if you just store lots of data per node
add more HDDs and set multiple data directories, to improve flush performance
set larger new generation size; I usually set it to about 800M for a 6 GB heap, to avoid pressure on the tenured gen.
if you're sure memtable flushing lags behind, make sure sstable compression is enabled - this will reduce amount of data physically saved to disk, at the cost of additional CPU cycles

UDF optimization in Hadoop

I testing my UDF on Windows virtual machine with 8 cores and 8 GB RAM. I have created 5 files of 2 GB about and run the pig script after modifying "mapred.tasktracker.map.tasks.maximum".
The following runtime and statistics:
mapred.tasktracker.map.tasks.maximum = 2
duration = 20 min 54 sec
mapred.tasktracker.map.tasks.maximum = 4
duration = 13 min 38 sec and about 30 sec for task
35% better
mapred.tasktracker.map.tasks.maximum = 8
duration = 12 min 44 sec and about 1 min for task
only 7% better
Why such a small improvement when changing settings? any ideas? Job was divided into 145 tasks.
![4 slots][1]
![8 slots][2]
Couple of observations:
I imagine your windows machine only has a single disk backing this VM - so there is a limit to how much data you can read off disk at any one time (and write back for the spills). By increasing the task slots, your effectively driving up the read / write demands on your disk (and a more disk thrashing too potentially). If you have multiple disks backing your VM (and not virtual disks all on the same physical disk, i mean virtual disks backed by different physical disks), you would probably see a performance increase over what you've already seen.
By adding more map slots, you've reduced the amount of assignment waves that the Job Tracker needs to do - and each wave has a polling overhead (TT polling the jobs, JT polling the TTs and assigning new tasks to free slots). A 2 slot TT vs 8 slot TT will mean that you have 145/2=~73 assignment waves (if all tasks ran in equal time - obviously not realistic) vs 145/8=~19 waves - thats a ~3x increase in the amount of polling needed to be done (and it all adds up).
mapred.tasktracker.map.tasks.maximum configures the maximum number of map tasks that will be run simultaneously by a task tracker. There is a practical hardware limit to how many tasks a single node can run at a time. So there will be diminishing returns when you keep increasing this number.
For example, say the tasktracker node has 8 cores. Say 4 cores are being used by processes other than the tasktracker. That leaves 4 cores for the mapred tasks. So your task time will improve from mapred.tasktracker.map.tasks.maximum = 1 to 4, but after that, it would just remain static because the other tasks will just be waiting. In fact, if you increase it too much, the contention and context switching might make it slower. The recommended value for this parameter is the No. of CPU cores - 1

MySQL query caching: limited to a maximum cache size of 128 MB?

My application is very database intensive so I've tried really hard to make sure the application and the MySQL database are working as efficiently as possible together.
Currently I'm tuning the MySQL query cache to get it in line with the characteristics of queries being run on the server.
query_cache_size is the maximum amount of data that may be stored in the cache and query_cache_limit is the maximum size of a single resultset in the cache.
My current MySQL query cache is configured as follows:
query_cache_size=128M
query_cache_limit=1M
tuning-primer.sh gives me the following tuning hints about the running system:
QUERY CACHE
Query cache is enabled
Current query_cache_size = 128 M
Current query_cache_used = 127 M
Current query_cache_limit = 1 M
Current Query cache Memory fill ratio = 99.95 %
Current query_cache_min_res_unit = 4 K
However, 21278 queries have been removed from the query cache due to lack of memory
Perhaps you should raise query_cache_size
MySQL won't cache query results that are larger than query_cache_limit in size
And mysqltuner.pl gives the following tuning hints:
[OK] Query cache efficiency: 31.3% (39K cached / 125K selects)
[!!] Query cache prunes per day: 2300654
Variables to adjust:
query_cache_size (> 128M)
Both tuning scripts suggest that I should raise the query_cache_size. However, increasing the query_cache size over 128M may reduce performance according to mysqltuner.pl (see http://mysqltuner.pl/).
How would you tackle this problem? Would you increase the query_cache_size despite mysqltuner.pl's warning or try to adjust the querying logic in some way? Most of the data access is handled by Hibernate, but quite a lot of hand-coded SQL is used in the application as well.
The warning issued by mysqltuner.py is actually relevant even if your cache has no risk of being swapped.
It is well-explained in the following:
http://blogs.oracle.com/dlutz/entry/mysql_query_cache_sizing
Basically MySQL spends more time grooming the cache the bigger the cache is and since the cache is very volatile under even moderate write loads (queries gets cleared often), putting it too large will have an adverse effect on your application performance. Tweak the query_cache_size and query_cache_limit for your application, try finding a breaking point where you have most hits per insert, a low number of lowmem_prunes and keep a close eye on your database servers load while doing so too.
Usually "too big cache size" warnings are issued under assumption that you have few physical memory and the cache itself well need to be swapped or will take resources that are required by the OS (like file cache).
If you have enough memory, it's safe to increase query_cache size (I've seen installations with 1GB query cache).
But are you sure you are using the query cache right? Do have lots of verbatim repeating queries? Could you please post the example of a typical query?
You should be easy on increasing your cache, it is not only a "not that much available mem" thing!
Reading for instance the manual you get this quote:
Be cautious about sizing the query cache excessively large, which increases the overhead required to maintain the cache, possibly beyond the benefit of enabling it. Sizes in tens of megabytes are usually beneficial. Sizes in the hundreds of megabytes might not be.
There are various other sources you can check out!
A non-zero prune rate may be an indication that you should increase the size of your query cache. However, keep in mind that the overhead of maintaining the cache is likely to increase with its size, so do this in small increments and monitor the result. If you need to dramatically increase the size of the cache to eliminate prunes, there is a good chance that your workload is not a good match for the query cache.
So don't just put as much as you can in that query cache!
The best thing, would be to gradually increase the query cache and measure performance on your site. It's some sort of default in performance questions, but in cases like this 'testing' is one of the best things you can do.
Be careful with setting the query_cache_size and limit to high. MySQL only uses a single thread to read from the query cache.
With the query_cache_size set to 4G and query_cache_limit 12M we had a query cache rate of 85% but noticed a recurring spikes in connections.
After changing the query_cache_size to 256M with 64K query_cache_limit the query cache ratio dropped to 50% but the overall performance increased.
Overhead for Query cache is around 10% so I would disable query caching. Usually if you can't get your hit rate over 40 or 50 % maybe query cache isn't right for your database.
I've blog about this topic... Mysql query_cache_size performance here.
Query Cache gets invalidated/flush every time there is an insert, Use InnoDB/cache and avoid query cache or set it to a very small value.

Resources