In most cases, Geode allocates one partitioned region for each data
structure. For example, each Sorted Set is allocated its own
partitioned region, in which the key is the user data and the value is
the user-provided score, and entries are indexed by score. The two
exceptions to this design are data types String and HyperLogLog. All
Strings are allocated to a single partitioned region.
For WAN replication, we create a gateway-sender and then assign this sender to a particular region for replication. With redis adaptor, we only have two regions by default as written above. Since a region for a "set" data structure will be created only when we add a key for it. How can we replicate those regions with redis adaptor?
https://cwiki.apache.org/confluence/display/GEODE/GemFire+Multi-site+%28WAN%29+Architecture
Steps for WAN replication done by me:
start locator --name=dc1 --properties-file=gemfire.properties
start server --name=redis --redis-port=11211 --J=-Dgemfireredis.regiontype=REPLICATE
create gateway-sender --id=dc1 --remote-distributed-system-id=3
create gateway-receiver
Now, I list regions which are currently available.
Cluster-1 gfsh>list regions
List of regions
---------------
ReDiS_HlL
ReDiS_StRiNgS
Assign both the regions to the gateway-sender
alter region --name=ReDiS_StRiNgS --gateway-sender-id=dc1
It is able to replicate the strings but not other data structures.
gemfire.properties
mcast-port=0
locators=1dc1[10334]
distributed-system-id=1
remote-locators=dc2[10334]
I have ran the same commands on dc2.
Before creating the region for other data structures, the Redis adapter implementation looks as the cache.xml to see if the region is defined. So, in your case, you can define a region with a gateway-sender in cache.xml while starting the server. Please see this reference for creating the cache.xml file this hierarchy information will also be useful. Once you have you can run the following command:
gfsh>start server --cache-xml-file=/path/to/cache.xml
Related
I have 2 Hbase tables and I want to force each of them to a different region server. Is there a way to tell HBase to do this?
You can move a region to another region server using hbase shell move command:
hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’
Move a region. Optionally specify target regionserver else we choose
one at random. NOTE: You pass the encoded region name, not the region
name so this command is a little different to the others. The encoded
region name is the hash suffix on region names: e.g. if the region
name were
TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.
then the encoded region name portion is
527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus
startcode. For example: host187.example.com,60020,1289493121758
More shell commands here
Though if both tables are large they can have regions on every RegionServer in a cluster, so I'm not sure what you are going to accomplish with that.
I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?
As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.
Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Sources and recommended reading:
How Scaling Really Works in Apache HBase
HBase and data locality
HBase File Locality in HDFS
Major compaction and data locality
Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?
My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.
Sources and recommended reading:
Understanding Map Reduce on HTable
The MapReduce Integration section of Tutorial: HBase
Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html
I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.
i have a problem while write data in hbase.I have 4 region server.when i write data and use radom key ,data write to any region but they are in one region server.One server are busy, three server are free.How do write regularity in all region server.
HBase partitions it's tables across region servers. See :
How HBase partitions table across regionservers?
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
I am not sure how random or far apart your random key should be to be able to write to different partitions.
See discussions on hbase.hregion.max.filesize and base.hregion.maxfilesize which suggests that tables are split to new regions when the appropriate data size has been reached.