I have a little experience with cassandra But I have one query regarding cassandra read process.
Suppose we have 7 sstables for a given table in our cassandra db now If we perform any read query which is not cached in memtable So Cassandra will look into the sstables. My question is:-
During this process will cassandra load all the sstables(7) into the memtable or It will just look into the all the sstables and will load relevant rows in memtable instead of loading all the sstables ?
Thanking you in advance!!
And please do correct me If I have interpreted something wrong.
And It also would be great If some one can explain/mention better resources to know about working of sstables.
During this process will cassandra load all the sstables(7)
No. Cassandra wouldn't load all the 7 SSTables. Each SSTable has a BloomFilter (in-memory) that tells the possibility for having the data in that SSTable.
If BloomFilter indicates a possibility of having the data in the SSTable, it looks into the partition key cache and gets the compression offset map (in-memory) to retrieve the compressed block that has the data we are looking for.
If found in the partition key cache, then the compressed block is read (I/O) to get the data.
If not found, it looks into partition summary to get the location of index entry and reads that location (I/O) into memory and continues with compression offset map flow earlier.
To start with, this Cassandra Reads link I think should help and depicts the flow pictorially. Capturing below the read path from above link for quick reference.
And one more thing, there is also a row cache which contains the hot rows (accessed frequently) and this will not result in hitting/loading the SSTable if found in the row cache.
Go through this rowcache link to understand row cache and partition key cache.
Another great presentation shared by Jeff Jirsa, Understanding Cassandra Table Options. Really worth going through it.
On a different note, there is compaction the happens periodically to reduce the number of SSTables and delete the rows based on tombstones.
Related
I am trying to understand how read and write happens in hbase and how hbase does the caching.
From various articles and videos, I found that a read merge happens when a read request is made to hbase. What I understood is:
Whenever read request made, block cache is first checked for the data.
Then memstore is checked. If data found in both block cache or memstore, data is sent to client.
Else fetched from hfiles.
My doubts are
whether both block cache and memstore is always checked for the data? or whether memstore will be ignored if found in block cache.
If memstore not checked (since found in blockcache),how will client get latest value if there was an edit in memstore?
I created a new table. I added one row. I issued get command to fetch the data. I obtained the data but I didn't see any change in cache hits and reads of block cache. Why?
I know there are multiple questions but all these are linked to read merge and hbase caching. I need a clarity on these concepts and I could not find any in documentation.
I am trying to understand the HBase architecture. I can see two different terms are used for same purpose.
Write Ahead Logs and Memstore, both are used to store new data that hasn't yet been persisted to permanent storage.
What's the difference between WAL and MemStore?
Update:
WAL - is used to recover not-yet-persisted data in case a server crashes.
MemStore - stores updates in memory as Sorted Keyvalue.
It seems lot of duplication of data before writing the data to Disk.
WAL is for recovery NOT for data duplication.(further see my answer here)
Pls go through below to understand more...
A Hbase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
WAL can be disabled to improve performance bottleneck.
This is done by calling the Hbase client field
Mutation.writeToWAL(false)
General Note : Its general practice that while doing bulkloading data, WAL is disabled to get speed. But side effect is if you disable WAL you cant get back data to replay if in case any memory crashes.
More over if you use solr+ HBASE + LILY, i.e LILY Morphiline NRT indexes with hbase then it will work on WAL if you disable WAL for performance reasons, then Solr NRT indexing wont work. since Lily works on WAL.
please have a look at Hbase architecture section
I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?
As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.
Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Sources and recommended reading:
How Scaling Really Works in Apache HBase
HBase and data locality
HBase File Locality in HDFS
Major compaction and data locality
Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?
My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.
Sources and recommended reading:
Understanding Map Reduce on HTable
The MapReduce Integration section of Tutorial: HBase
Any file system should provide an API to access its files and directories, etc.
So, what is meant by "HDFS lacks random read and write access"?
So, we should use HBase.
The default HDFS block size is 128 MB. So you cannot read one line here, one line there. You always read and write 128 MB blocks. This is fine when you want to process the whole file. But it makes HDFS unsuitable for some applications, like where you want to use an index to look up small records.
HBase on the other hand is great for this. If you want to read a small record, you will only read that small record.
HBase uses HDFS as its backing store. So how does it provide efficient record-based access?
HBase loads the tables from HDFS to memory or local disk, so most reads do not go to HDFS. Mutations are stored first in an append-only journal. When the journal gets large, it is built into an "addendum" table. When there are too many addendum tables, they all get compacted into a brand new primary table. For reads, the journal is consulted first, then the addendum tables, and at last the primary table. This system means that we only write a full HDFS block when we have a full HDFS block's worth of changes.
A more thorough description of this approach is in the Bigtable whitepaper.
In a typical database where the data is stored in tables in RDBMS format you can read or write to any record from any table without having to know what is there in other records. This is called random writing/reading.
But in HDFS data is stored in the file format(generally) rather than table format. So if you are reading/writing its not as easy as is in RDBMS.
I've been doing some research on HBase and I'm currently finding challenges in understanding how HBase read path works. I have a basic understanding of how it works. But, I don't have clear understanding of how it reads multiple HFiles checking bloom filters. Whats the purpose of metablocks, how hbase uses it for reading the data. Whats the purpose of indexes in hfiles, and how its used ?
Hence needed your help in understanding this concept.
Your time is much appreciated. Thanks
If there are more than one HFile at the time of read, HBase will check whether the row in question is there or not. If it is there HBase will read that row from all the HFiles(and also from memstore), so that client always gets the latest data. I'm sorry didn't quite get block filters thing. Could you please point me to the source where you have read about this? That'll help me in providing you the complete answer.(Do you mean Bloom Filter?)
Purpose of metablock is to keep large amount of data. Metablocks are used by HFile to store a BloomFilter and a string key is associated with each metablock. Metablocks are kept in memory until HFile.close() is called.
An Index is written for metablocks to make reads faster. These indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
And at the end a Fixed File Trailer is written to the HFile. It contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec etc. Now when read starts first of all HFile.loadFileInfo() gets called and File Trailers, which were written earlier are loaded into the memory along with all the indices. It allows to query keys efficiently. Then with the help of HFileScanner client seeks to a specified key, and iterate over it to read the data.
I would like to to point you to the links which had helped in understanding these things. Hopefully you'll find them useful.
Link 1: Apache HBase I/O – HFile (Cloudera)
Link 2: HBase I/O: HFile (th30z)
Link 3: Scanning in HBase
HTH