I have 2 Hbase tables and I want to force each of them to a different region server. Is there a way to tell HBase to do this?
You can move a region to another region server using hbase shell move command:
hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’
Move a region. Optionally specify target regionserver else we choose
one at random. NOTE: You pass the encoded region name, not the region
name so this command is a little different to the others. The encoded
region name is the hash suffix on region names: e.g. if the region
name were
TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.
then the encoded region name portion is
527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus
startcode. For example: host187.example.com,60020,1289493121758
More shell commands here
Though if both tables are large they can have regions on every RegionServer in a cluster, so I'm not sure what you are going to accomplish with that.
Related
In most cases, Geode allocates one partitioned region for each data
structure. For example, each Sorted Set is allocated its own
partitioned region, in which the key is the user data and the value is
the user-provided score, and entries are indexed by score. The two
exceptions to this design are data types String and HyperLogLog. All
Strings are allocated to a single partitioned region.
For WAN replication, we create a gateway-sender and then assign this sender to a particular region for replication. With redis adaptor, we only have two regions by default as written above. Since a region for a "set" data structure will be created only when we add a key for it. How can we replicate those regions with redis adaptor?
https://cwiki.apache.org/confluence/display/GEODE/GemFire+Multi-site+%28WAN%29+Architecture
Steps for WAN replication done by me:
start locator --name=dc1 --properties-file=gemfire.properties
start server --name=redis --redis-port=11211 --J=-Dgemfireredis.regiontype=REPLICATE
create gateway-sender --id=dc1 --remote-distributed-system-id=3
create gateway-receiver
Now, I list regions which are currently available.
Cluster-1 gfsh>list regions
List of regions
---------------
ReDiS_HlL
ReDiS_StRiNgS
Assign both the regions to the gateway-sender
alter region --name=ReDiS_StRiNgS --gateway-sender-id=dc1
It is able to replicate the strings but not other data structures.
gemfire.properties
mcast-port=0
locators=1dc1[10334]
distributed-system-id=1
remote-locators=dc2[10334]
I have ran the same commands on dc2.
Before creating the region for other data structures, the Redis adapter implementation looks as the cache.xml to see if the region is defined. So, in your case, you can define a region with a gateway-sender in cache.xml while starting the server. Please see this reference for creating the cache.xml file this hierarchy information will also be useful. Once you have you can run the following command:
gfsh>start server --cache-xml-file=/path/to/cache.xml
I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?
As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.
Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Sources and recommended reading:
How Scaling Really Works in Apache HBase
HBase and data locality
HBase File Locality in HDFS
Major compaction and data locality
Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?
My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.
Sources and recommended reading:
Understanding Map Reduce on HTable
The MapReduce Integration section of Tutorial: HBase
I need to process some data in MR and load it into an external system that sits on the same physical machines as my MR nodes. Right now I run the job and read the output from HDFS and re-route individual records back out onto the desired nodes.
Is it possible to define some mapping such that records with key X always go straight to the desired node Y? Simply put, I want to control where hadoop routes post-sorted partitioned groups.
Not easily. The only way I know of to affect the physical location of a block of data on the fly is to implement a custom BlockPlacementPolicy. I'll just throw out some ideas for your use case.
A custom BlockPlacementPolicy can route blocks based on the file name
The file name of a partition can be modified using MultipleOutputs in MapReduce
Keys can be routed to specific partitions using a custom Partitioner
It seems like you can get the result you're looking for, but it won't be pretty.
Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html
i have a problem while write data in hbase.I have 4 region server.when i write data and use radom key ,data write to any region but they are in one region server.One server are busy, three server are free.How do write regularity in all region server.
HBase partitions it's tables across region servers. See :
How HBase partitions table across regionservers?
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
I am not sure how random or far apart your random key should be to be able to write to different partitions.
See discussions on hbase.hregion.max.filesize and base.hregion.maxfilesize which suggests that tables are split to new regions when the appropriate data size has been reached.