Log Structured Merge Tree in Hbase - hadoop

I am working on Hbase. I have query regarding how Hbase store the data in sorted order with LSM.
As per my understanding, Hbase use LSM Tree for data transfer in large scale data processing. when Data comes from client, it store in-memory sequentially first and than sort and store as B-Tree as Store file. Than it is merging the Store file with Disk B-Tree(of key). is it correct ? Am I missing something ?
If Yes, than in cluster env. there are multiple RegionServers who take the client request. On that case, How all the Hlogs (of each regionServer) merge with disk B-Tree(as existing key spread across the all dataNode disk) ?
Is it like Hlog only merge the data with Hfile of same regionServer ?

You can take a look at this two articles that describe exactly what you want
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
http://blog.cloudera.com/blog/2012/06/hbase-write-path/
In brief:
The client send data to the region server that is responsible to handle the key
(.META. contains key ranges for each region)
The user operation (e.g. put) is written to the Write-Ahead-Log (WAL, the HLog)
(The log is used just for "safety" if the region server crash the log is replayed to recover data not written to disk)
After writing to the log, data is also written to the MemStore
...once the memstore reach a threshold (conf property)
The memstore is flushed on disk, creating a single hfile
...when the number of hfiles grows too much (conf property) the compaction kicks in (merge)
In terms of on disk data structure:
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
The article above cover the hfile format...
it's an append only format, and can be seen like a b+tree. (Keeping in mind that this b+tree cannot be modified in place)
The HLog is only used for "safety", once the data is written to the hfiles, the logs can be thrown away

According to LSM-tree model in HBase the data consists of two parts - in-memory tree which contains most recent updates upon the data and disk store tree which arranges the rest part of the data into a form of immutable sequential B-tree located on the hard drive. From time to time HBase service decides that it has enough changes in memory to flush them into file storage. In that case it performs the rolling merge of data from the virtual space to disc, executing an operation similar to merge step of Merge sort algorithm.
In HBase infrastructure such data model is based on several components which organize all data across the cluster as a collections of LSM-trees located on slave servers and driven by the main master service. The system is driven by the following components:
HMaster - primary HBase service which maintains the correct state of slave Region Server nodes by managing and balancing the data among them. Besides it drives the changes of metadata information in the storage, like table or column creations and updates.
Zookeeper - represents a distributed cache used by HBase services and its clients to store reconciled up-to-date information about naming and configurations.
Regional servers - HBase worker nodes which perform the management and storage of pieces of the information in LSM-tree fashion
HDFS - used by Regional servers behind the scene for the actual storage of the data
From Low-level the most part of HBase functionality is located within Regional server which performs the read-write work upon the tables. Every table technically can be distributed across different Regional servers as a collection of of separate pieces called HRegions. Single Regional server node can hold several HRegions of one table. Each HRegion holds a certain range of rows shared between the memory and disc space and sorted by key attribute. These ranges do not intersect between different regions so we can relay on their sequential behavior across the cluster. Individual Regional server HRegion includes following parts:
Write Ahead Log (WAL) file - the first place when data is been persisted on every write operation before getting into Memory. As I've mentioned earlier the first part of the LSM-tree is kept in memory, which means that it can be affected by some external factors like power lose from example. Keeping the log file of such operations in a separate place would allow to restore this part easily without any looses.
Memstore - keeps a sorted collection of most recent updates of the information in the memory. It is the actual implementation of the first part of LMS-tree structure, described earlier. Periodically performs rolling merges into the store files called HFiles on the local hard drives
HFile - represents a small pieces of date received from the Memstore and saved in HDFS. Each HFile contains sorted KeyValues collection and B-Tree+ index which allows to seek the data without reading the whole file. Periodically HBase performs merge sort operations upon these files to make them fit the configured size of standard HDFS block and avoid small files problem
You can walk through these elements manually by pushing the data and passing it through the whole LSM-tree process. I described how to do it in my recent article:
https://oyermolenko.blog/2017/02/21/hbase-as-primary-nosql-hadoop-storage/

Related

Apache Ignite NearCaches Vs CachMode

Apache Ignite has two concepts, one of them is NearCache, and another one is the CacheMode enumaration.
What is the main difference between two concepts?
Near cache is the local hot cache that keeps often accessed data. It significantly speeds up data processing, saving time on network round-trips.
CacheMode defines how your data will be stored. It could be LOCAL for single node, which means data are not distributed in grid. Other two PARTITIONED and REPLICATED means respectively: cache data divided between nodes on some equal parts (called partitions) or each node keeps full data from that cache.
PARTITIONED allows you to keep in grid more data than available in separate machine, REPLICATED gives 100% data survivorability (if all nodes crashed except one - you will not loose your data).
More details you can find in documentation https://apacheignite.readme.io/docs/near-caches and https://apacheignite.readme.io/docs/cache-modes

HBase on Hadoop, data locality deep diving

I have read multiple articles about how HBase gain data locality i.e link
or HBase the Definitive guide book.
I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.
Questions:
Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?
What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?
Thanks for the help.
HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality.
When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication.
Now I will answer your questions,
Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.
Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.
Hope this helps.

Cache huge data in-memory

I am looking for an in-memory cache solution which can handle big data (<5GB). For a user inputted search term, the database (elasticsearch) will return a large amount of data which the tool will analyze and show via different webpages of the tool. Now my problem is that I want to cache this big data temporarily till the user session gets over so that I don't have to fetch it again from elasticsearch every time the user opens a new page. It will have to be in-memory because disk based will take over a minute which would be very slow.
I initially thought memcached but it has a max limit of 128MB. After reading quite a bit, Redis seems suitable but it is unclear to me whether a bunch of Redis nodes can work in tandem or not. Is it possible to set up a pool of many Redis nodes so that a suitable node will be automatically chosen on SET and the data returned upon GET without me having to specify the node?
TL;DR
Problem: Cache big data (<5GB) in an in-memory cache
Possible solution: Redis
Question: Can I pool a bunch of Redis nodes so that I can fetch a key stored in any of them without specifying a particular node. I don't need to distribute my data since data for a single user will fit into the RAM of a single node.
A Redis Cluster sounds like a good fit for your usecase!
Redis cluster provides a mechanism for data sharding by means of hash slots. These slots are equally distributed over the nodes in your cluster when setting it up.
Whenever you store a value in the cluser, the corresponding hash slot for the given key is calculated and the data is forwarded to the responsible node. And the same way you can afterwards query your data again. So the answer to your question is certainly yes.
However, the max value size per key is 512MB. I'm not sure if I got your storage requirement correctly. I assume 5GB is the estimated total amount over all users.
Checkout the redis cluster tutorial.
You can also look into NCache(.net) / Tayzgrid(java) by Alachisoft,
Both of these solutions provide distributed caching with dynamic clustering which allows to add or remove nodes in cluster at runtime with out losing any data. Also intelligent client makes sure to refer to appropriate node to fetch/store a record against any key.

where is data stored in hadoop

Though I have understood the architecture of hadoop a bit , I have some void in understanding of where the data is exactly situated.
My question is like " Suppose I have large data of some random books .. is the data of books stored in multiple Nodes previously using HDFS and we perform MapReduce on each node and get the result in our system ?
'OR'
Do we store data some where in large database and whenever we want to perform the MapReduce operation, we take the chunks and store them in multiple Nodes for performing operation ?
Either is possible, it really depends on your use case and needs. However, generally Hadoop MapReduce runs against data stored in HDFS. The system is designed around data locality which requires the data be in HDFS. That is the Map tasks run on the same piece of hardware where the data is stored in order to improve performance.
That said if for some reason your data must be stored outside of HDFS and then processed using MapReduce it can be done but is a bit more work and is not as efficient as processing data in HDFS locally.
So lets take two use cases. Start with log files. Log files as they are are not particularly accessible. They just need to be stuck somewhere and stored for later analysis. HDFS is perfect for this. If you really need a log back out you can get it but generally people will be looking for the output of the analytics. So store your logs in HDFS and process them normally.
However, data in the format ideal for HDFS and Hadoop Map Reduce (many records in a single large flat file) is not what I would consider highly accessible. Hadoop Map Reduce expects to have input files that are multi megabytes in size with many records per file. The more you diverge from this case, the more your performance will decline. Sometimes your data is needed online at all times, and HDFS is not ideal for this. For instance we will use your book example. If these books are used in an application that needs the content accessible in an online fashion, I.E. editting and annotating, you may choose to store them in a database. Then when you need to run batch analytics you use a custom InputFormat to retrieve the records from the database and process them in MapReduce.
I am currently doing this with a web crawler that stores the web pages individually in Amazon S3. Web pages are too small to serve as a single efficient input to MapReduce, so I have a custom InputFormat that feeds each mapper several files. The output of this MapReduce job is eventually written back to S3, and because I am using Amazon EMR, the Hadoop cluster goes away.

cassandra and hadoop - realtime vs batch

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop---Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx
Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.
What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)
I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads.
Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries.
This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read.
It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen.
On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies.
Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.
Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).
The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.
The paper for Google File System which was the basis for Hadoop Distributed File System (HDFS) can be found here
The paper for Map Reduce model can be found here
For a detailed explanation on Map Reduce read this post
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Cassandra works on top of HDFS and makes use of it to scale. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.
BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable. The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.
Thus you can see Cassandra's internal allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.
Hadoop consists of two fundamental components: distributed datastore (HDFS) and distributed computation framework (MapReduce). It reads a bunch of input data then writes output from/to the datastore. It needs distributed datastore since it performs parallel computing with the local data on cluster of machines to minimize the data loading time.
While Cassandra is the datastore with linear scalability and fault-tolerance ability. It lacks of the parallel computation ability provided by MapReduce in Hadoop.
The default datastore (HDFS) of Hadoop can be replaced with other storage backend, such as Cassandra, Glusterfs, Ceph, Amazon S3, Microsoft Azure's file system, MapR’s FS, and etc. However, each alternatives has its pros and cons, they should be evaluated based on the needs.
There are some resources that help you integrate Hadoop with Cassandra: http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configHadoop.html

Resources