I am trying to understand the HBase architecture. I can see two different terms are used for same purpose.
Write Ahead Logs and Memstore, both are used to store new data that hasn't yet been persisted to permanent storage.
What's the difference between WAL and MemStore?
Update:
WAL - is used to recover not-yet-persisted data in case a server crashes.
MemStore - stores updates in memory as Sorted Keyvalue.
It seems lot of duplication of data before writing the data to Disk.
WAL is for recovery NOT for data duplication.(further see my answer here)
Pls go through below to understand more...
A Hbase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
WAL can be disabled to improve performance bottleneck.
This is done by calling the Hbase client field
Mutation.writeToWAL(false)
General Note : Its general practice that while doing bulkloading data, WAL is disabled to get speed. But side effect is if you disable WAL you cant get back data to replay if in case any memory crashes.
More over if you use solr+ HBASE + LILY, i.e LILY Morphiline NRT indexes with hbase then it will work on WAL if you disable WAL for performance reasons, then Solr NRT indexing wont work. since Lily works on WAL.
please have a look at Hbase architecture section
Related
Here is the situation :
HDFS is known to be Append-Only (No Update per se).
Hive writes data to its warehouse, which is located in HDFS.
Updates can be performed in Hive
This implies that new data is written, and old data should be somehow marked as deprecated and later wiped out at some point.
I searched but did not manage to find any information about this so far.
Since HBase is built on top of HDFS which has a replication policy for fault tolerance, does this mean HBase is inherently fault tolerant and data stored in HBase will always be accessible thanks to the underlying HDFS? Or does HBase implement a replication policy of its own (e.g table replication over regions)?
Yes, you can create replica of regions in Hbase, as mentioned here. However, note that HBase high availability is for read only. It is not highly available for writes. If region server goes down, then until regions are assigned to a new region server, you will not be able to write.
To enable read replicas, you need to enable Async WAL replication by setting hbase.region.replica.replication.enabled to true. You will also need to enable high availability for the table at creation time by specifying REGION_REPLICATION value greater than 1, as in docs:
CREATE 't1', 'f1', {REGION_REPLICATION => 2}
More details can be found here.
The concept of replication in HBase is different than HDFS replication. Both are different in different context. HDFS is the file system and replicates data for fault tolerant and high availability features from the data file. While HBase replication is mainly around fault tolerant, high availability and data integrity from a database system perspective.
Of course, HDFS replication capability is used for file level replication for HBase. Along with it, HBase also maintains copies of its meta data into backup nodes (which are again replicated by default by HDFS).
HBase also have backup processes to monitor and recover from failure. like Primary and Secondary Region servers. But the data loss in the region server is protected by HDFS replication only.
Hence, the Hbase replication is mainly around recovery of failure and maintaining data integrity as a database engine. It is just like any other robust database system like Oracle.
Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.
We got hdfs of capacity 900TB. As the data stored is growing a lot its difficult to keep track of what is useful and what could be deleted.
I want to analyze hdfs usage for following pattern so that the capacity could be used optimally.
What is the frequently accessed data.
Data not being touched/accessed for long time (Possible candidate for deletion)
Data usage distribution by users.
Active users.
You can derive that data from:
(1) HDFS audit log (access patterns per user/ip)
(2) fsimage (access times per file, data not accessed)
(1) Do you have HDFS audit log enabled? Read more here.
(2) To start with fsimage read this - there is an example to get "Data not being touched/accessed for long time"
You may also want to consider HAR to archive the data (instead of delete) - thus reduce both storage usage and precious memory on the namenode.
I try to execute a performance test on a cloudera hadoop cluster. However, as far as Impala uses cache to store previous queries, how can I empty cache ?
Does Impala use caching?
Impala does not cache data but it does cache some table and file metadata. Although queries might run faster on subsequent iterations because the data set was cached in the OS buffer cache, Impala does not explicitly control this.
Quoted from : http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_faq.html#faq_performance_unique_1__faq_caching_unique_1
The file metadata caching is different from "query caching". It is just caching the locations of files and blocks in HDFS, which is something that most databases already know but Impala may not because it gets table/file metadata from Hive. The file metadata should be available to Impala in your tests.
Impala never caches queries, but file data may be cached in one of two ways:
You've enabled HDFS caching. I assume you're not doing this.
Some data read by HDFS may be in the OS buffer cache. Impala has no control over this. Some googling turns up guidance about clearing the Linux buffer caches, e.g. this unix.stackexchange.com answer.