Hadoop and HBase integration - hadoop

I am new to Big data technologies, I have a question on how hbase is integrated with hadoop. What does it mean by "Hbase sits on top of HDFS"? . My understanding is HDFS is a collection of structured and unstructured data distributed across multiple nodes and HBase is structured data.
How is Hbase integrated with Hadoop to provide real time access to the underlying data. Do we have to write special jobs to build indexes and such? In other words is there an additional layer between Hbase and hdfs that has data in the structure HBase understands

HDFS is a distributed filesystem; One can do most regular FS operations on it such as listing files in a directory, writing a regular file, reading a part of the file, etc. Its not simply "a collection of structured or unstructured data" anymore than your EXT4 or NTFS filesystems are.
HBase is a in-memory Key-Value store which may persist to HDFS (it isn't a hard-requirement, you can run HBase on any distributed-filesystem). For any read key request asked of HBase, it will first check its runtime memory caches to see if it has a value cached, and otherwise visit its stored files on HDFS to seek and read out the specific value. There are various configurations in HBase offered to control the way the cache is utilised, but HBase's speed comes from a combination of caching and indexed persistence (faster, seek-ed file reads).
HBase's file-based persistence on HDFS does the key indexing automatically when it writes, so there is no manual indexing need by its users. These files are regular HDFS files, but specialised in format for HBase's usage, known as HFiles.
These articles are slightly dated, but are still very reflective of the architecture HBase uses: http://blog.cloudera.com/blog/2012/06/hbase-write-path/ and http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/, and should help if you want to dig deeper.

HDFS is a distributed file system, and HBase is a NoSQL database that depends on the HDFS filesystem to store it's data.
You should read up on these technologies, since your structured/unstructured comparison is not correct.
Update
You should check out the Google File System, MapReduce, and Bigtable papers if you are interested in the origins of these technologies.
Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google
file system." ACM SIGOPS operating systems review. Vol. 37. No. 5.
ACM, 2003.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
Chang, Fay, et al. "Bigtable: A distributed storage system for
structured data." ACM Transactions on Computer Systems (TOCS) 26.2
(2008): 4.

It's easy to understand:
HDFS is a distributed filesytem and provides write and read through an apped model.
Hbase is a NOSQL database that builds on the HDFS filesystem and must depend on it.
This can be read about here: Apache hbase document

Related

Can you use HDFS as your principal storage?

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

Does Hadoop HBase support self healing data blocks?

HDFS supports a mechanism which is called 'self-healing'. As far as I understood this means, that when a file (or better a data block) is written into HDFS, the block is replicated over a cluster of data-nodes. HDFS verifies the consistency of the data blocks over all nodes and automatically detects inconsistent data to be replicated again into a new data block. This is a feature which I am looking for.
Now - Hbase is based on HDFS. As far as I understood Hbase is optimized for random access to 'smaler' datasets (with only a few MB). Hbase is also supporting primar keys and query language. This is what I am also looking for.
My Question is: does Hbase still support the 'self-healing' feature of HDFS or is this lost because of the different approach of a relational database analogy?

Hadoop vs. NoSQL-Databases

As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.

Please clarify my understanding of Hadoop/HBase

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.
So with that, here's what I think I know.
As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this
-----------------------------------------
| Mapreduce |
-----------------------------------------
| Hadoop | <-- hbase export--| HBase |
| | --apache pig --> | |
-----------------------------------------
| HDFS |
----------------------------------------
In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.
Now, assuming the above is correct, here is what else I think I know.
Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.
Ok, so if im still accurate to this point, this would be a valid use case.
You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.
Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.
Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?
**********Edit one,
i have updated my beliefs above based on the answers received.
You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.
Since HBase and Hadoop are different database engines, one can not access the data in the other directly. For HBase to get something out of Hadoop, it must go thru Mapreduce and vice versa.
This is not true since Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.

Resources