Read merge in hbase - caching

I am trying to understand how read and write happens in hbase and how hbase does the caching.
From various articles and videos, I found that a read merge happens when a read request is made to hbase. What I understood is:
Whenever read request made, block cache is first checked for the data.
Then memstore is checked. If data found in both block cache or memstore, data is sent to client.
Else fetched from hfiles.
My doubts are
whether both block cache and memstore is always checked for the data? or whether memstore will be ignored if found in block cache.
If memstore not checked (since found in blockcache),how will client get latest value if there was an edit in memstore?
I created a new table. I added one row. I issued get command to fetch the data. I obtained the data but I didn't see any change in cache hits and reads of block cache. Why?
I know there are multiple questions but all these are linked to read merge and hbase caching. I need a clarity on these concepts and I could not find any in documentation.

Related

How does hbase update or invalidate block cache?

I am trying to understand the read and write paths of hbase. When an update of row is done via put command for a specific row, the data must be written to the memstore buffer. But let us say for that key, there was an old value already present in block cache.
At this point a value X is present in block cache and new value Y is present in memstore buffer. If I execute a read command, I am getting Y. But isn't X the expected value? Because as per my understanding, whenever a read comes, block cache will be checked before memstore buffer.
Is my understanding wrong? Or is there any intermediate step where block cache gets updated or invalidated?
Most of the time this interaction is missing in all the docs. As per my understanding, before updating memstore, if the block is present in the block cache it is invalidated, to avoid the case which you are highlighting.

What (really) happens in HDFS during a Hive Update?

Here is the situation :
HDFS is known to be Append-Only (No Update per se).
Hive writes data to its warehouse, which is located in HDFS.
Updates can be performed in Hive
This implies that new data is written, and old data should be somehow marked as deprecated and later wiped out at some point.
I searched but did not manage to find any information about this so far.

Would spark dataframe read from external source on every action?

On a spark shell I use the below code to read from a csv file
val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()
Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?
Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session
After each action spark forgets about the loaded data and any intermediate variables value you used in between.
So, if you invoke 4 actions one after another, it computes everything from beginning each time.
Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.
That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.
TL;DR DataFrame is not different than RDD. You can expect that the same rules apply.
With simple plan like this the answer is yes. It will read data for every show although, if action doesn't require all data (like here0 it won't read complete file.
In general case (complex execution plans) data can accessed from the shuffle files.

How exactly Cassandra read procedure works?

I have a little experience with cassandra But I have one query regarding cassandra read process.
Suppose we have 7 sstables for a given table in our cassandra db now If we perform any read query which is not cached in memtable So Cassandra will look into the sstables. My question is:-
During this process will cassandra load all the sstables(7) into the memtable or It will just look into the all the sstables and will load relevant rows in memtable instead of loading all the sstables ?
Thanking you in advance!!
And please do correct me If I have interpreted something wrong.
And It also would be great If some one can explain/mention better resources to know about working of sstables.
During this process will cassandra load all the sstables(7)
No. Cassandra wouldn't load all the 7 SSTables. Each SSTable has a BloomFilter (in-memory) that tells the possibility for having the data in that SSTable.
If BloomFilter indicates a possibility of having the data in the SSTable, it looks into the partition key cache and gets the compression offset map (in-memory) to retrieve the compressed block that has the data we are looking for.
If found in the partition key cache, then the compressed block is read (I/O) to get the data.
If not found, it looks into partition summary to get the location of index entry and reads that location (I/O) into memory and continues with compression offset map flow earlier.
To start with, this Cassandra Reads link I think should help and depicts the flow pictorially. Capturing below the read path from above link for quick reference.
And one more thing, there is also a row cache which contains the hot rows (accessed frequently) and this will not result in hitting/loading the SSTable if found in the row cache.
Go through this rowcache link to understand row cache and partition key cache.
Another great presentation shared by Jeff Jirsa, Understanding Cassandra Table Options. Really worth going through it.
On a different note, there is compaction the happens periodically to reduce the number of SSTables and delete the rows based on tombstones.

Need help in understanding Hbase Read Path

I've been doing some research on HBase and I'm currently finding challenges in understanding how HBase read path works. I have a basic understanding of how it works. But, I don't have clear understanding of how it reads multiple HFiles checking bloom filters. Whats the purpose of metablocks, how hbase uses it for reading the data. Whats the purpose of indexes in hfiles, and how its used ?
Hence needed your help in understanding this concept.
Your time is much appreciated. Thanks
If there are more than one HFile at the time of read, HBase will check whether the row in question is there or not. If it is there HBase will read that row from all the HFiles(and also from memstore), so that client always gets the latest data. I'm sorry didn't quite get block filters thing. Could you please point me to the source where you have read about this? That'll help me in providing you the complete answer.(Do you mean Bloom Filter?)
Purpose of metablock is to keep large amount of data. Metablocks are used by HFile to store a BloomFilter and a string key is associated with each metablock. Metablocks are kept in memory until HFile.close() is called.
An Index is written for metablocks to make reads faster. These indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
And at the end a Fixed File Trailer is written to the HFile. It contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec etc. Now when read starts first of all HFile.loadFileInfo() gets called and File Trailers, which were written earlier are loaded into the memory along with all the indices. It allows to query keys efficiently. Then with the help of HFileScanner client seeks to a specified key, and iterate over it to read the data.
I would like to to point you to the links which had helped in understanding these things. Hopefully you'll find them useful.
Link 1: Apache HBase I/O – HFile (Cloudera)
Link 2: HBase I/O: HFile (th30z)
Link 3: Scanning in HBase
HTH

Resources