Hadoop YARN - Job consuming more resources than user limit factor - hadoop

I've got a default queue in YARN with the following configuration:
Capacity: 70%
Max Capacity: 90%
User Limit Factor 0.08
Minimum User Limit 8%
Maximum Applications Inherited
Maximum AM Resource Inherited
Priority 0
Ordering Policy Fair
Maximum Application Lifetime -1
Default Application Lifetime -1
Enable Size Based Weight Ordering Disabled
Maximum Allocation Vcores Inherited
Maximum Allocation Mb Inherited
But even with the User Limit Factor of 0.08 and Minimum User Limit of 8%, there are jobs running with more than 8% of the resources of the queue, as you can see below:
How is this even possible? Is the User Limit Factor/Minimum User Limit not working? Is there any other configuration I should be aware of?

Minimum user limit of 8% means if 12 users arrive at the same time - they will get 8 % of resources each so as to give them equal share.
However in your case the timing of the jobs start suggest that the tasks submitted first will have more cluster resources available since they are not coming
in at the same time. A good understanding on these parameters can be had from here and here

Related

HBase: Why are there evicted blocks before the max size of the BlockCache is reached?

I am currently using a stock configuration of Apache HBase, with RegionServer heap at 4G and BlockCache sizing at 40%, so around 1.6G. No L2/BucketCache configured.
Here are the BlockCache metrics after ~2K requests to RegionServer. As you can see, there were blocks evicted already, probably leading to some of the misses.
Why were they evicted when we aren't even close to the limit?
Size 2.1 M Current size of block cache in use (bytes)
Free 1.5 G The total free memory currently available to store more cache entries (bytes)
Count 18 Number of blocks in block cache
Evicted 14 The total number of blocks evicted
Evictions 1,645 The total number of times an eviction has occurred
Mean 10,984 Mean age of Blocks at eviction time (seconds)
StdDev 5,853,922 Standard Deviation for age of Blocks at eviction time
Hits 1,861 Number requests that were cache hits
Hits Caching 1,854 Cache hit block requests but only requests set to cache block if a miss
Misses 58 Block requests that were cache misses but set to cache missed blocks
Misses Caching 58 Block requests that were cache misses but only requests set to use block cache
Hit Ratio 96.98% Hit Count divided by total requests count
What you are seeing is the effect of the LRU treating blocks with three levels of priority: single-access, multi-access, and in-memory. For the default L1 LruBlockCache class their share of the cache can be set with (default values in brackets):
hbase.lru.blockcache.single.percentage (25%)
hbase.lru.blockcache.multi.percentage (50%)
hbase.lru.blockcache.memory.percentage (25%)
For the 4 GB heap example, and 40% set aside for the cache, you have 1.6 GB heap, which is further divided into 400 MB, 800 MB, and 400 MB for each priority level, based on the above percentages.
When a block is loaded from storage it is flagged as single-access usually, unless the column family it belongs to has been configured as IN_MEMORY = true, setting its priority to in-memory (obviously). For single-access blocks, if another read access is requesting the same block, it is flagged as multi-access priority.
The LruBlockCache has an internal eviction thread that runs every 10 seconds and checks if the blocks for each level together are exceeding their allowed percentage. Now, if you scan a larger table once, and assuming the cache was completely empty, all of the blocks are tagged single-access. If the table was 1 GB in size, you have loaded 1 GB into a 400 MB cache space, which the eviction thread then is going to reduce in due course. In fact, dependent on how long the scan is taking, the 10 seconds of the eviction thread is lapsing during the scan and will start to evict blocks once you exceed the 25% threshold.
The eviction will first evict blocks from the single-access area, then the multi-access area, and finally, if there is still pressure on the heap, from the in-memory area. That is also why you should make sure your working set for in-memory flagged column families is not exceeding the configured cache area.
What can you do? If you have mostly single-access blocks, you could tweak the above percentages to give more to the single-access area of the LRU.

Cassandra key cache optimization

I want to optimize the key cache in cassandra. I know about the key_cache_size_in_mb: The capacity in megabytes of all key caches on the node. Now while increasing what stats I need to look in order to determine the increase is actually benefiting the system.
Currently with the default settings I am getting
Key Cache : entries 20342, size 17.51 MB, capacity 100 MB, 4806 hits, 29618 requests, 0.162 recent hit rate, 14400 save period in seconds.
I have opsCenter up and running too.
Thanks
Look at number of hits and recent hit rate.
http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra

How much load can cassandra handle on m1.xlarge instance?

I setup 3 nodes of Cassandra (1.2.10) cluster on 3 instances of EC2 m1.xlarge.
Based on default configuration with several guidelines included, like:
datastax_clustering_ami_2.4
not using EBS, raided 0 xfs on ephemerals instead,
commit logs on separate disk,
RF=3,
6GB heap, 200MB new size (also tested with greater new size/heap values),
enhanced limits.conf.
With 500 writes per second, the cluster works only for couple of hours. After that time it seems like not being able to respond because of CPU overload (mainly GC + compactions).
Nodes remain Up, but their load is huge and logs are full of GC infos and messages like:
ERROR [Native-Transport-Requests:186] 2013-12-10 18:38:12,412 ErrorMessage.java (line 210) Unexpected exception during request java.io.IOException: Broken pipe
nodetool shows many dropped mutations on each node:
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 7
BINARY 0
READ 2
MUTATION 4072827
_TRACE 0
REQUEST_RESPONSE 1769
Is 500 wps too much for 3-node cluster of m1.xlarge and I should add nodes? Or is it possible to further tune GC somehow? What load are you able to serve with 3 nodes of m1.xlarge? What are your GC configs?
Cassandra is perfectly able to handle tens of thousands small writes per second on a single node. I just checked on my laptop and got about 29000 writes/second from cassandra-stress on Cassandra 1.2. So 500 writes per second is not really an impressive number even for a single node.
However beware that there is also a limit on how fast data can be flushed to disk and you definitely don't want your incoming data rate to be close to the physical capabilities of your HDDs. Therefore 500 writes per second can be too much, if those writes are big enough.
So first - what is the average size of the write? What is your replication factor? Multiply number of writes by replication factor and by average write size - then you'll approximately know what is required write throughput of a cluster. But you should take some safety margin for other I/O related tasks like compaction. There are various benchmarks on the Internet telling a single m1.xlarge instance should be able to write anywhere between 20 MB/s to 100 MB/s...
If your cluster has sufficient I/O throughput (e.g. 3x more than needed), yet you observe OOM problems, you should try to:
reduce memtable_total_space_mb (this will cause C* to flush smaller memtables, more often, freeing heap earlier)
lower write_request_timeout to e.g. 2 seconds instead of 10 (if you have big writes, you don't want to keep too many of them in the incoming queues, which reside on the heap)
turn off row_cache (if you ever enabled it)
lower size of the key_cache
consider upgrading to Cassandra 2.0, which moved quite a lot of things off-heap (e.g. bloom filters and index-summaries); this is especially important if you just store lots of data per node
add more HDDs and set multiple data directories, to improve flush performance
set larger new generation size; I usually set it to about 800M for a 6 GB heap, to avoid pressure on the tenured gen.
if you're sure memtable flushing lags behind, make sure sstable compression is enabled - this will reduce amount of data physically saved to disk, at the cost of additional CPU cycles

UDF optimization in Hadoop

I testing my UDF on Windows virtual machine with 8 cores and 8 GB RAM. I have created 5 files of 2 GB about and run the pig script after modifying "mapred.tasktracker.map.tasks.maximum".
The following runtime and statistics:
mapred.tasktracker.map.tasks.maximum = 2
duration = 20 min 54 sec
mapred.tasktracker.map.tasks.maximum = 4
duration = 13 min 38 sec and about 30 sec for task
35% better
mapred.tasktracker.map.tasks.maximum = 8
duration = 12 min 44 sec and about 1 min for task
only 7% better
Why such a small improvement when changing settings? any ideas? Job was divided into 145 tasks.
![4 slots][1]
![8 slots][2]
Couple of observations:
I imagine your windows machine only has a single disk backing this VM - so there is a limit to how much data you can read off disk at any one time (and write back for the spills). By increasing the task slots, your effectively driving up the read / write demands on your disk (and a more disk thrashing too potentially). If you have multiple disks backing your VM (and not virtual disks all on the same physical disk, i mean virtual disks backed by different physical disks), you would probably see a performance increase over what you've already seen.
By adding more map slots, you've reduced the amount of assignment waves that the Job Tracker needs to do - and each wave has a polling overhead (TT polling the jobs, JT polling the TTs and assigning new tasks to free slots). A 2 slot TT vs 8 slot TT will mean that you have 145/2=~73 assignment waves (if all tasks ran in equal time - obviously not realistic) vs 145/8=~19 waves - thats a ~3x increase in the amount of polling needed to be done (and it all adds up).
mapred.tasktracker.map.tasks.maximum configures the maximum number of map tasks that will be run simultaneously by a task tracker. There is a practical hardware limit to how many tasks a single node can run at a time. So there will be diminishing returns when you keep increasing this number.
For example, say the tasktracker node has 8 cores. Say 4 cores are being used by processes other than the tasktracker. That leaves 4 cores for the mapred tasks. So your task time will improve from mapred.tasktracker.map.tasks.maximum = 1 to 4, but after that, it would just remain static because the other tasks will just be waiting. In fact, if you increase it too much, the contention and context switching might make it slower. The recommended value for this parameter is the No. of CPU cores - 1

Hadoop CapacitySchueduler not using excess capacity

I'm running the Hadoop CapacityScheduler with multiple queues and multiple users. I have three queues with capacities 70%, 20% and 10% respectively e.g.
mapred.capacity-scheduler.queue.default.capacity=70
For all the queues the I have
mapred.capacity-scheduler.queue.default.maximum-capacity=100
I was surprised to find that the queues hardly ever seemed to use their excess capacity (they would all "max out" at their queue-specific capacity) even though excess capacity was available. I later discovered that the queues would make use of excess capacity only if they contained jobs from multiple users.
I.e. any number of jobs submitted to a queue by a single user will never make user of excess capacity. Only if a second job is submitted by a different user will the excess capacity be used.
I would like a single user to use all cluster resources if there are no other jobs taking up any resources.
I have studied the CapacityScheduler documentation thoroughly and played around with the properties with no success.
Please if anyone knows how to do this let me know.
You may take a look at the property "mapred.capacity-scheduler.queue.queue-name.user-limit-factor" in http://hadoop.apache.org/common/docs/r1.0.3/capacity_scheduler.html.
By default, this value is set to 1 which ensures that a single user can never take more than the queue's configured capacity irrespective of how idle the cluster is. You can set it to be a larger number to achieve what you want.

Resources