How HBase partitions table across regionservers? - parallel-processing

Please tell me how HBase partitions table across regionservers.
For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
Or maybe this data would be spread differently?
Or maybe can I somehow create more regions than I have region servers, so (according to given example) server 1 would have keys 0 - 0,5M and 3M - 3,5M, this way my data would be spread more equally, is this possible?
update
I just found that there's option hbase.hregion.max.filesize, do you think this will solve my problem?

WRT partitionning, you can read Lars' blog post on HBase's architecture or Google's Bigtable paper which HBase "clones".
If your row key is only a timestamp, then yes the region with the biggest keys will always be hit with new requests (since a region is only served by a single region server).
Do you want to use timestamps in order to do short scans? If so, consider salting your keys (search google for how Mozilla did it with Sorocco).
Can your prefix the timestamp with any ID? For example, if you only request data for specific users, then prefix the ts with that user ID and it will give you a much better load distribution.
If not, then use UUIDs or anything else that will randomly distribute your keys.
About hbase.hregion.maxfilesize
Setting the maxfilesize on that table (which you can do with the shell), doesn't make it that each region is exactly X MB (where X is the value you set) big. So let's say your row keys are all timestamps, which means that each new row key is bigger than the previous one. This means that it will always be inserted in the region with the empty end key (the last one). At some point, one of the files will grow bigger than maxfilesize (through compactions), and that region will be split around the middle. The lower keys will be in their own region, the higher keys in another one. But since your new row key is always bigger than the previous, this means that you will only write to that new region (and so on).
tl;dr even though you have more than 1,000 regions, with this schema the region with the biggest row keys will always get the writes, which means that the hosting region server will become a bottleneck.

Option hbase.hregion.max.filesize which is by default 256MB sets max region size, after reaching this limit region is split. This means, that my data will be stored in multiple regions of 256MB and possibly one smaller.
So
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
This is not true, because latest data will be also split in regions of size 256MB and stored on different regionservers.

Related

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).
I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

Google datastore - index a date created field without having a hotspot

I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.

How can I pre-split a table in HBase

I am storing data in HBase having 5 region servers. I am using md5 hash of url as my row keys. Currently all the data is getting stored in one region server only. So I want to pre-split the regions so that data will go uniformly across all region server.
I want to have table split into five regions by first character of a rowkey, so that data with rowkey starting from 0 to 3 goes in 1st region server, 3-6 to 2nd , 7-9 to 3rd, a-d to 4th, d-f to 5th. How can I do it?
You can provide a SPLITS property when creating the table.
create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}
The 4 split points will generate 5 regions.
Please be noticed that HBase's DefaultLoadBalancer doesn't guarantee a 100% even distribution between regionservers, it could happen that a regionserver hosts multiple regions from the same table.
For more information about how it works take a look at this:
public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)
Generate a global load balancing plan according to the specified map
of server information to the most loaded regions of each server. The
load balancing invariant is that all servers are within 1 region of
the average number of regions per server. If the average is an integer
number, all servers will be balanced to the average. Otherwise, all
servers will have either floor(average) or ceiling(average) regions.
HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so
that we can fetch from both ends of the queue. At the beginning, we
check whether there was empty region server just discovered by Master.
If so, we alternately choose new / old regions from head / tail of
regionsToMove, respectively. This alternation avoids clustering young
regions on the newly discovered region server. Otherwise, we choose
new regions from head of regionsToMove. Another improvement from
HBASE-3609 is that we assign regions from regionsToMove to underloaded
servers in round-robin fashion. Previously one underloaded server
would be filled before we move onto the next underloaded server,
leading to clustering of young regions. Finally, we randomly shuffle
underloaded servers so that they receive offloaded regions relatively
evenly across calls to balanceCluster(). The algorithm is currently
implemented as such:
Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server
that already has <= MAX regions. Order the regions to move from most
recent to least.
Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that
already has >= MIN regions. Regions being assigned to underloaded
servers are those that were shed in the previous step. It is possible
that there were not enough regions shed to fill each underloaded
server to MIN. If so we end up with a number of regions required to do
so, neededRegions. It is also possible that we were able to fill each
underloaded but ended up with regions that were unassigned from
overloaded servers but that still do not have assignment. If neither
of these conditions hold (no regions needed to fill the underloaded
servers, no regions leftover from overloaded servers), we are done and
return. Otherwise we handle these cases below.
If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from
each (this brings them from having MAX regions to having MIN regions).
We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded
servers. Iterate the least loaded servers filling each to MIN. If we
still have more regions that need assignment, again iterate the least
loaded servers, this time giving each one (filling them to MAX) until
we run out.
All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX
regions at the end of the balancing. This ensures the minimal number
of regions possible are moved.
TODO: We can at-most reassign the number of regions away from a
particular server to be how many they report as most loaded. Should we
just keep all assignment in memory? Any objections? Does this mean we
need HeapSize on HMaster? Or just careful monitor? (current thinking
is we will hold all assignments in memory)
If you have all the data have already been stored, I recommend you just move some regions to another region servers manually using hbase shell.
hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’
Move a region. Optionally specify target regionserver else we choose
one at random. NOTE: You pass the encoded region name, not the region
name so this command is a little different to the others. The encoded
region name is the hash suffix on region names: e.g. if the region
name were
TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.
then the encoded region name portion is
527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus
startcode. For example: host187.example.com,60020,1289493121758
In case you are using Apache Phoenix for creating tables in HBase, you can specify SALT_BUCKETS in the CREATE statement. The table will split into as many regions as the bucket mentioned. Phoenix calculates the Hash of rowkey (most probably a numeric hash % SALT_BUCKETS) and assigns the column cell to the appropriate region.
CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)) SALT_BUCKETS=3;
This will pre-split the table into 3 regions
Alternatively, the HBase default UI, allows you to split regions accordingly.

Smart chunking from a huge table

I have a huge table in a data warehouse (Vertica). I am accessing this table in chunks for optimization purposes. The way I am deciding my chunks is pretty straightforward. I have a primary key column say A and I take a MAX(A). I have a chunk size of 20000 and I have now created (A/20000)+1 chunks. I frame query for each chunk and retrieve the data .
There problem with this approach is as follows:
My number of chunks is dependent on MAX(A) and MAX(A) is growing very fast and thereby my number of chunks increases with it as well.
I have decided on number 20000 because that is what gives me optimal performance. But distribution of primary key within the chunks of 20000 is so scattered. For example the 0-20000 might contain only 3 elements and range 20000-40000 might contain 500 elements and no ranges come close to 20000.
I am trying to figure whether there are any good approximation algorithm for this problem which minimizes the number of chunks and fill in close to 20000 primary keys in one chunk.
Any pointers towards the solution is appreciated.
I'm not sure what optimization purposes means, but I think the best approach would be to create a timestamp column, or use an eligible timestamp column to partition on. You could then partition on a larger frame of reference so there isn't a wide range between the partitions.
If the table is partitioned, it will be able to benefit from partition pruning. This means that Vertica can eliminate the storage containers during query execution which do not match on the timestamp predicate.
Otherwise, you can look at the segmentation clause and use the max/min from the storage containers. This could be slightly more complicated.

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Resources