What is HBase compaction-queue-size at all? - hadoop

Any one knows what regionserver queue size is meant?
By doc's definition:
9.2.5. hbase.regionserver.compactionQueueSize Size of the compaction queue. This is the number of stores in the region that have been
targeted for compaction.
It is the number of Store(or store files? I have heard two version of it) of regionserver need to be major compacted.
I have a job writing data in a hotspot style using sequential key(non distributed).
and I saw inside the metric history discovering that at a time it happened a compaction-queue-size = 4.
That's theoretically impossible since I have only one Store to write(sequential key) at any time.
Then I dig into the log ,found there is any hint about queue size > 0:
Every major compaction say "This selection was in queue for 0sec"
013-11-26 12:28:00,778 INFO
[regionserver60020-smallCompactions-1385440028938]
regionserver.HStore: Completed major compaction of 3 file(s) in f1 of
myTable.key.md5....
into md5....(size=607.8 M), total size for
store is 645.8 M. This selection was in queue for 0sec, and took 39sec
to execute.
Just more confusing is : Isn't multi-thread enabled at earlier version and just allocate each compaction job to a thread ,by this reason why there exists compaction queue ?
Too bad that there's no detail explanation in hbase doc.

I don't fully understand your question. But let me attempt to answer it to the best of my abilities.
First let's talk about some terminology for HBase. Source
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
A Region in HBase is defined as the Rows between two row key's. If you have more than one ColumnFamily in your Table, you will get one Store per ColumnFamily per Region. Every Store will have a MemStore and 0 or more StoreFiles
StoreFiles are created when the MemStore is flushed. Every so often, a background thread will trigger a compaction to keep the number of files in check. There are two types of compactions: major and minor. When a Store is targeted for a minor compaction, it will also pick up some adjacent StoreFiles and rewrites them as one. A minor compaction will not remove deleted/expired data. If a minor compaction picks up all StoreFiles in a Store, it's promoted to a major compaction. In a major compaction, all StoreFiles of a Store are rewritten as one StoreFile.
Ok... so what is a Compaction Queue?
It is the number of Stores in a RegionServer that have been targeted for compaction. Similarly a Flush Queue is the number of MemStores that are awaiting flush.
As to the question of why there is a queue when you can do it asynchronously, I have no idea. This would be a great question to ask on the HBase mailing list. It tends to have faster response times.
EDIT: The compaction queue is there to not take up 100% of the resources of a RegionServer.

Related

Does Clickhouse take the amount of free disk space when scheduling background merges?

I have a Clickhouse cluster (three nodes) that contains a Merge Tree table, and Aggregating Merge Tree and a materialized view that fills the aggregating merge tree with data we insert in the merge tree. All tables are present on each node. (see the full schema in this gist here).
I recently increased the storage size (from 4TB per node to 4.5TB) and I noticed that right after that Clickhouse seemed to have become more aggressive at running background merges. It seems to run longer merges with higher rows merged per second rate, to the point that some merge impacts the IO bandwidth of the servers with negative effects on the insertion rate.
I noticed this setting here. It mentions that Clickhouse would schedule a merge if there are enough free resources in the background pool.
Does anybody know if that takes into account the amount of disk space? More space -> more likely to run merges that would create bigger partitions? The value we use for that parameter is the default one. And I noticed indeed that the biggest active partitions we have are around 150GB though I cannot say how big they were before adding storage.
Please let me know if there is any additional context needed.
Thanks
Yes, CH merge scheduler takes into account the amount of free disk space.
150GB merge able to start only if 300GB+ free disk space available.

How do partition size affect read/write performances in Cassandra?

I can partition my table into a small amount of bigger partitions or several smaller partitions, but in my use case the big partition is still small in size, it will never exceed 100MB. There will be millions of users reading from this table so is there a risk of congestion when having so many users reading from a single partition?
I can imagine that splitting the read queries between several physical nodes is faster than reading from a single physical node, but does splitting read queries between several virtual nodes improve performance in the same way? The number of big partitions will exceed the number of physical nodes, so will spreading the data further through the virtual nodes with smaller partitions improve the read performance? Is the answer any different for updating partitions of counter tables?
So basically, what I need to know is if millions of users reading from the same partition (that is below 100MB in size) will introduce congestion. This is the answer that actually matters for my project. But I also want to know if spreading the data further (regular and counter tables), beyond the number of physical nodes through smaller partitions will increase the read/write performance.
Any reference links would be extremely appreciated since I'll be writing a report and referencing an article, journal or documentation is always preferred.
In my opinion accessing the same partition ( We are actually talking about "row" in cassandra 3.0) is not a problem. If the load on your cluster is increasing then you just need to add more node, this is the no single point of failure principle. Each node in your cluster is able to fulfil the user request ( depending on your replication factor and read consistency).
Also if you know that a partition key is going to be accessed a lot then you can play with the key and row cache functionality of your table, you will avoid any disk access

On DSE 5.0.5 (Cassandra 3.0.11) which settings would reduce read latencies with constant writes in the background?

We have a 5 node DSE cassandra cluster and an application whose job is to write asynchronously to keyspace A (which is based on a HDD), and read synchronously from keyspace B (which is on an SSD). Reads from table
Additional info:
The table on A is using TWCS with 48h windows, while the table on keyspace B is using LCS with default settings
Spark jobs partition reads in chunks of 20h at most
Both tables are using TDE with AES256 keys and 1KB chunks
Azul Zing is being used as the JVM with default settings apart from heap sizing and GC logging
With this scenario alone the read latencies from keyspace B are fine throughout the day, but everyday we have a spark job that will read from keyspace A and write to B. The moment the spark executors "attack" keyspace A, read latencies from keyspace B suffer a bit (99th percentil goes from 8-12ms to 130ms for a few seconds).
My question is, which cassandra.yaml properties would likely help the most on reducing the read latencies on keyspace B just for this moment the spark job starts? I've been trying different memtable/commitlog settings, but haven't been able to lower the read latency to acceptable levels
It’s hard to generalize without knowing why your latency hurts, if we could we’d bake those defaults into the database
However, I’ll try to guess
Throttle down concurrent reads so there are fewer concurrent requests - this will trade throughout for more consistent performance
if your disk is busy, consider smaller compression chunk sizes
if you’re seeing GC pauses, consider tuning your jvm - the Cassandra-8150 jira has some good suggestions
if your sstables-per-read is more than a few, reconsider your data model to keep your partitions from spanning multiple TWCS windows
make sure your key cache is enabled. If you can spare the heap, raise it, it may help.
Jeff's answer should be your starting point but if that doesn't solve it, consider changing your spark job to off-peak time. Keep in mind that LCS is optimized for read-heavy tables, but from the moment that spark starts to "migrate" the data, that table using LCS, will for some time (until the spark job finishes) become a write-heavy table. This would be an anti-pattern for LCS utilization. I can't know for sure without looking into servers details, but I would say that due to the sheer number of SSTables that are created during the spark job, LCS is not able to keep up with the compaction to maintain the standard read latency.
If you can't schedule the spark job at an off-peak time, then you should consider changing the compaction strategy in the keyspace B to STCS.

Spark internals - Does repartition loads all partitions in memory?

I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?
Assuming, initially there were 5 partition and they had -
1st partition - 100 elements
2nd partition - 200 elements
3rd partition - 500 elements
4th partition - 5000 elements
5th partition - 200 elements
Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.
In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.
From what I have understood, repartition will certainly trigger shuffle. From Job Logical Plan document following can be said about repartition
- for each partition, every record is assigned a key which is an increasing number.
- hash(key) leads to a uniform records distribution on all different partitions.
If Spark can't load all data into memory then memory issue will be thrown. So default processing of Spark is all done in memory i.e. there should always be sufficient memory for your data.
Persist option can be used to tell spark to spill your data in disk if there is not enough memory.
Jacek Laskowski also explains about repartitions.
Understanding your Apache Spark Application Through Visualization should be sufficient for you to test and know by yourself.

hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.

Resources