How can I pre-split a table in HBase - hadoop

I am storing data in HBase having 5 region servers. I am using md5 hash of url as my row keys. Currently all the data is getting stored in one region server only. So I want to pre-split the regions so that data will go uniformly across all region server.
I want to have table split into five regions by first character of a rowkey, so that data with rowkey starting from 0 to 3 goes in 1st region server, 3-6 to 2nd , 7-9 to 3rd, a-d to 4th, d-f to 5th. How can I do it?

You can provide a SPLITS property when creating the table.
create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}
The 4 split points will generate 5 regions.
Please be noticed that HBase's DefaultLoadBalancer doesn't guarantee a 100% even distribution between regionservers, it could happen that a regionserver hosts multiple regions from the same table.
For more information about how it works take a look at this:
public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)
Generate a global load balancing plan according to the specified map
of server information to the most loaded regions of each server. The
load balancing invariant is that all servers are within 1 region of
the average number of regions per server. If the average is an integer
number, all servers will be balanced to the average. Otherwise, all
servers will have either floor(average) or ceiling(average) regions.
HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so
that we can fetch from both ends of the queue. At the beginning, we
check whether there was empty region server just discovered by Master.
If so, we alternately choose new / old regions from head / tail of
regionsToMove, respectively. This alternation avoids clustering young
regions on the newly discovered region server. Otherwise, we choose
new regions from head of regionsToMove. Another improvement from
HBASE-3609 is that we assign regions from regionsToMove to underloaded
servers in round-robin fashion. Previously one underloaded server
would be filled before we move onto the next underloaded server,
leading to clustering of young regions. Finally, we randomly shuffle
underloaded servers so that they receive offloaded regions relatively
evenly across calls to balanceCluster(). The algorithm is currently
implemented as such:
Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server
that already has <= MAX regions. Order the regions to move from most
recent to least.
Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that
already has >= MIN regions. Regions being assigned to underloaded
servers are those that were shed in the previous step. It is possible
that there were not enough regions shed to fill each underloaded
server to MIN. If so we end up with a number of regions required to do
so, neededRegions. It is also possible that we were able to fill each
underloaded but ended up with regions that were unassigned from
overloaded servers but that still do not have assignment. If neither
of these conditions hold (no regions needed to fill the underloaded
servers, no regions leftover from overloaded servers), we are done and
return. Otherwise we handle these cases below.
If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from
each (this brings them from having MAX regions to having MIN regions).
We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded
servers. Iterate the least loaded servers filling each to MIN. If we
still have more regions that need assignment, again iterate the least
loaded servers, this time giving each one (filling them to MAX) until
we run out.
All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX
regions at the end of the balancing. This ensures the minimal number
of regions possible are moved.
TODO: We can at-most reassign the number of regions away from a
particular server to be how many they report as most loaded. Should we
just keep all assignment in memory? Any objections? Does this mean we
need HeapSize on HMaster? Or just careful monitor? (current thinking
is we will hold all assignments in memory)

If you have all the data have already been stored, I recommend you just move some regions to another region servers manually using hbase shell.
hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’
Move a region. Optionally specify target regionserver else we choose
one at random. NOTE: You pass the encoded region name, not the region
name so this command is a little different to the others. The encoded
region name is the hash suffix on region names: e.g. if the region
name were
TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.
then the encoded region name portion is
527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus
startcode. For example: host187.example.com,60020,1289493121758

In case you are using Apache Phoenix for creating tables in HBase, you can specify SALT_BUCKETS in the CREATE statement. The table will split into as many regions as the bucket mentioned. Phoenix calculates the Hash of rowkey (most probably a numeric hash % SALT_BUCKETS) and assigns the column cell to the appropriate region.
CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)) SALT_BUCKETS=3;
This will pre-split the table into 3 regions
Alternatively, the HBase default UI, allows you to split regions accordingly.

Related

how to make apache ignite scale linearly with increase in Number of nodes?

I'm running some tests and found that 1 node is faster and produces more result than 2 and 4 nodes? I'm not able to understand why it is happening.
I'm using parition_aware=True and lazy=True while writing and querying data to ignite.
here are some of the result I got. Its for crossJoin of two 100k row tables.
Results I got after running some queries
Different result sets for different Ignite topologies is an implicit indicator that your affinity collocation configuration is incorrect. You need to distribute your entries across a cluster in the particular way allowing to join tables locally. Make sure that leads and products have the same affinity key column, and use it for your join. This concept is called collocated join, it helps to avoid additional network hops.
For this particular case it seems you are trying to calculate Levenshtein distance, the only way to do that is cross join, it's basically a cartesian product of tables. It means that for each row from the left table you'll need to traverse all the records from the right table (there are some possible optimisations though). The only way to achieve that is to leverage non-collocated joins. But keep in mind that it implies additional network activity. Here's a rough estimation of how much we actually need.
Assume we want to compute the cross join of tables A and B. Let's also assume that the table A contains n rows and the table B contains m rows. In that case for a cluster with k nodes (we are not taking backups into account, they don't take part in SQL) we would come up with some complexity estimation in terms of network data transfer.
There are rows in the table A on every node on the average. For every node-local row in A there are approximately rows in B (residing on the other nodes) to fetch through network. Having k nodes total we'll have the required network activity proportional to . With the growing number of nodes it will creep up to (the entire dataset squared). And it's not really good in fact. Having a smaller number of nodes actually decreases the network load in this scenario.
In a nutshell:
try enabling distributed joins, it will fix the result set size
it's difficult to say what's going on without profiling and query execution plans

Choosing a safe number of members for a CP Subsystem

Tried scouring the documentation, but I'm still uncertain about the CP Subsystem setup for my current situation.
We have a Hazelcast cluster spread across 2 data centers, each data center having an even number of members, say 4, but can have as many as double during rollout.
The boxes in each data center are configured to be part of a separate partition group => 2 data centers - 2 partition groups, with 4-8 members each at a snapshot in time.
What would be the best number to set as CP Subsystem member count, considering that one data center might be decoupled as part of BAU?
I initially thought of setting the count to 5, to enforce having at least one box from each data center in the Raft consensus as a general situation (rollover happens only for a short amount of time during redeployment, so maybe it is not that big of a deal), but that might mean that consensus will not be possible when one data center will be decoupled. On the other hand, if I set up a value smaller than the box count in one dc, say 3, what would happen if all the boxes in the consensus group were to be assigned in the same dc and that dc would go away abruptly due to network conditions? These are mostly assumptions, since CP is a relatively new topic for me, so please correct me if I am wrong.
We prefer three datacenters, but sometimes a third is not available.
My team was faced with this same decision several years ago when expanding into a new jurisdiction. There were a lot of options, here are some. In all of these scenarios we did extensive testing for how the system behaved with network partitions.
Make a primary datacenter and a secondary datacenter
This is the option we ended up going with. We put 2/3 of the hosts in one datacenter and 1/3 in the secondary data-center. As much as possible, we weighted client traffic towards the primary datacenter. We also communicated with our customers about this preference so they could do the same if they wanted.
If the datacenter had multiple rooms, we made sure to have hosts spread across the different rooms to help mitigate power/network outages within the datacenter. At the minimum, we ensured the hosts are on different racks.
We also had multiple clusters and for each cluster we usually switched which datacenter was the primary and which was the secondary. We didn't do this in some jurisdictions with notorious power troubles.
Split half and half
It's up to the gods what happens when a datacenter goes down. This is why we chose the first option: we wanted the choice of what happens when each datacenter goes down.
Have a tie-breaker in a different region
Put a host in an entirely different region from the two datacenters. Most of the time the latency will be too high for this host to fully participate in making consensus decisions, but in the case of a network partition it can help move the majority to one of the partitions.
The tie-breaker host must be a part of the quorum and cannot be kicked out because of latency delays.
Build a new datacenter
These things are very expensive, but it makes the durability story much nicer. Not always an option.

Why can the maxparallelism of a flink job not be updated without losing state?

I just read that the maximum parallelism (defined by setMaxParallelism) of a Flink job cannot be changed without losing state. This surprised me a bit, and it is not that hard to imagine a scenario where one starts running a job, only to find out the load is eventually 10x larger than expected (or perhaps the efficiency of the code is below expectations) resulting in a desire to increase parallelism.
I could not find many reasons for this, other than some references to key groups. The most tangible statement I found here:
The max parallelism mustn't change when scaling the job, because it would destroy the mapping of keys to key groups.
However, this still leaves me with the questions:
Why is it hard/impossible to let a job change its max paralellism?
Based on the above, the following conceptual solution came to mind:
In the state, keep track of the last used max parallelism
When starting a job, indicate the desired max parallelism
Given that both settings are known, it should be possible to infer how the mappings would need to change to remain valid initially.
If needed a new state could be defined based on the old state with the new maxparallelism, to 'fit' the new job.
I am not saying this conceptual solution is ideal, or that it would be trivial to implement. I just wonder if there is more to the very rigid nature of the maximum parallelism. And trying to understand whether it is just a matter of 'this flexibility is not implemented yet' or 'this goes so much against the nature of Flink that one should not want it'.
Every key is assigned to exactly one key group by computing a hash of the key modulo the number of key groups. So changing the number of key groups affects the assignment of keys to key groups. Each task manager is responsible for one or more key groups, so the number of key groups is the same as the maximum parallelism.
The reason this number is painful to change is that it is effectively baked into the state snapshots (checkpoints and savepoints). These snapshots are indexed by key group, so that on system start-up, each task manager can efficiently load just the state they require.
There is are in-memory data structures that scale up significantly as the number of key groups rises, which is why the max parallelism doesn't default to some rather large value (the default is 128).
The State Processor API can be used to rewrite state snapshots, should you need to change the number of key groups, or migrate between state backends.

Retrieving sequential numbers in a distributed system

System consists of tens of peer servers (none of them is leader/master).
To create an entity, service should acquire the next sequential number basing on some group key: there are different sequences for each group key.
Let's say to create instance of entity A, service has to get sequence with group key A, while to create entity B, service has to get sequence number with group key B.
Getting the same number twice is prohibited. Missing numbers are allowed.
Currently I have implemented solution with a RDBMS, having a record for each of the group keys and updating its current sequence value in a transaction:
UPDATE SEQUENCES SET SEQ_ID=SEQ_ID + 1 WHERE KEY = ?
However, this approach only allows to get 200-300 queries per seconds because of locking and synchronisation.
Another approach I consider is to have local buffer of sequences at the each node. Once buffer is empty, service queries DB to get the next batch of ids and store them locally: UPDATE SEQUENCES SET SEQ_ID=SEQ_ID + 1000 WHERE KEY = ? if the batch size is 1000. This may help to lower contention. However, if node goes down it loses all these acquired sequence numbers, which, if happened frequently, can lead to overflowing the maximum value of sequence (e.g. max int).
I don't know in advance, how many sequence numbers will be needed.
I don't want to introduce additional dependencies between servers and have one of them to generate sequence numbers and serving to the others.
What are the general ways to solve similar problems?
Which other RDBMS-based approaches can be considered?
Which other NOT RDBMS-based approaches can be considered?
What other problems can happen with local buffer solution?

How HBase partitions table across regionservers?

Please tell me how HBase partitions table across regionservers.
For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
Or maybe this data would be spread differently?
Or maybe can I somehow create more regions than I have region servers, so (according to given example) server 1 would have keys 0 - 0,5M and 3M - 3,5M, this way my data would be spread more equally, is this possible?
update
I just found that there's option hbase.hregion.max.filesize, do you think this will solve my problem?
WRT partitionning, you can read Lars' blog post on HBase's architecture or Google's Bigtable paper which HBase "clones".
If your row key is only a timestamp, then yes the region with the biggest keys will always be hit with new requests (since a region is only served by a single region server).
Do you want to use timestamps in order to do short scans? If so, consider salting your keys (search google for how Mozilla did it with Sorocco).
Can your prefix the timestamp with any ID? For example, if you only request data for specific users, then prefix the ts with that user ID and it will give you a much better load distribution.
If not, then use UUIDs or anything else that will randomly distribute your keys.
About hbase.hregion.maxfilesize
Setting the maxfilesize on that table (which you can do with the shell), doesn't make it that each region is exactly X MB (where X is the value you set) big. So let's say your row keys are all timestamps, which means that each new row key is bigger than the previous one. This means that it will always be inserted in the region with the empty end key (the last one). At some point, one of the files will grow bigger than maxfilesize (through compactions), and that region will be split around the middle. The lower keys will be in their own region, the higher keys in another one. But since your new row key is always bigger than the previous, this means that you will only write to that new region (and so on).
tl;dr even though you have more than 1,000 regions, with this schema the region with the biggest row keys will always get the writes, which means that the hosting region server will become a bottleneck.
Option hbase.hregion.max.filesize which is by default 256MB sets max region size, after reaching this limit region is split. This means, that my data will be stored in multiple regions of 256MB and possibly one smaller.
So
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
This is not true, because latest data will be also split in regions of size 256MB and stored on different regionservers.

Resources