Need help in understanding Hbase Read Path - hadoop

I've been doing some research on HBase and I'm currently finding challenges in understanding how HBase read path works. I have a basic understanding of how it works. But, I don't have clear understanding of how it reads multiple HFiles checking bloom filters. Whats the purpose of metablocks, how hbase uses it for reading the data. Whats the purpose of indexes in hfiles, and how its used ?
Hence needed your help in understanding this concept.
Your time is much appreciated. Thanks

If there are more than one HFile at the time of read, HBase will check whether the row in question is there or not. If it is there HBase will read that row from all the HFiles(and also from memstore), so that client always gets the latest data. I'm sorry didn't quite get block filters thing. Could you please point me to the source where you have read about this? That'll help me in providing you the complete answer.(Do you mean Bloom Filter?)
Purpose of metablock is to keep large amount of data. Metablocks are used by HFile to store a BloomFilter and a string key is associated with each metablock. Metablocks are kept in memory until HFile.close() is called.
An Index is written for metablocks to make reads faster. These indices contains n records (where n is the number of blocks) with block information (block offset, size and first key).
And at the end a Fixed File Trailer is written to the HFile. It contains offsets and counts for all the HFile Indices, HFile Version, Compression Codec etc. Now when read starts first of all HFile.loadFileInfo() gets called and File Trailers, which were written earlier are loaded into the memory along with all the indices. It allows to query keys efficiently. Then with the help of HFileScanner client seeks to a specified key, and iterate over it to read the data.
I would like to to point you to the links which had helped in understanding these things. Hopefully you'll find them useful.
Link 1: Apache HBase I/O – HFile (Cloudera)
Link 2: HBase I/O: HFile (th30z)
Link 3: Scanning in HBase
HTH

Related

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

Spark streaming get pre computed features

I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.

How exactly Cassandra read procedure works?

I have a little experience with cassandra But I have one query regarding cassandra read process.
Suppose we have 7 sstables for a given table in our cassandra db now If we perform any read query which is not cached in memtable So Cassandra will look into the sstables. My question is:-
During this process will cassandra load all the sstables(7) into the memtable or It will just look into the all the sstables and will load relevant rows in memtable instead of loading all the sstables ?
Thanking you in advance!!
And please do correct me If I have interpreted something wrong.
And It also would be great If some one can explain/mention better resources to know about working of sstables.
During this process will cassandra load all the sstables(7)
No. Cassandra wouldn't load all the 7 SSTables. Each SSTable has a BloomFilter (in-memory) that tells the possibility for having the data in that SSTable.
If BloomFilter indicates a possibility of having the data in the SSTable, it looks into the partition key cache and gets the compression offset map (in-memory) to retrieve the compressed block that has the data we are looking for.
If found in the partition key cache, then the compressed block is read (I/O) to get the data.
If not found, it looks into partition summary to get the location of index entry and reads that location (I/O) into memory and continues with compression offset map flow earlier.
To start with, this Cassandra Reads link I think should help and depicts the flow pictorially. Capturing below the read path from above link for quick reference.
And one more thing, there is also a row cache which contains the hot rows (accessed frequently) and this will not result in hitting/loading the SSTable if found in the row cache.
Go through this rowcache link to understand row cache and partition key cache.
Another great presentation shared by Jeff Jirsa, Understanding Cassandra Table Options. Really worth going through it.
On a different note, there is compaction the happens periodically to reduce the number of SSTables and delete the rows based on tombstones.

Map Reduce : Which is the underlying Data Structure used

I am wondering that if such a large datasets are used in Hadoop Map Reduce then what are the data structures which are used by hadoop. If possible please somebody provide me a detail view of underlying data structures in hadoop.
HDFS is the default underlying storage platform of Hadoop.
Its like any other file system in the sense that - it does not care what structure the files have. It only ensures that files will be saved in a redundant fashion and available for retrieval quickly.
So it is totally upto you the user, to store files with whatever structure you like inside them.
A Map Reduce program simply gets the file data fed to it as an input. Not necessarily the entire file, but parts of it depending on InputFormats etc. The Map program then can make
use of the data in whatever way it wants to.
'Hive' - on the other hand deals with TABLES (columns/rows). And you can query them in a SQL like fashion using Hive-QL.
Thanks to all of you
I got the answer of my question. The underlying HDFS uses block as a storing units a detail description of which is mentioned in the following book and networking streaming concepts.
All the details are available in the third chapter of Hadoop: The Definitive Guide.

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources