HDFS supports a mechanism which is called 'self-healing'. As far as I understood this means, that when a file (or better a data block) is written into HDFS, the block is replicated over a cluster of data-nodes. HDFS verifies the consistency of the data blocks over all nodes and automatically detects inconsistent data to be replicated again into a new data block. This is a feature which I am looking for.
Now - Hbase is based on HDFS. As far as I understood Hbase is optimized for random access to 'smaler' datasets (with only a few MB). Hbase is also supporting primar keys and query language. This is what I am also looking for.
My Question is: does Hbase still support the 'self-healing' feature of HDFS or is this lost because of the different approach of a relational database analogy?
Related
Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?
I am new to Big data technologies, I have a question on how hbase is integrated with hadoop. What does it mean by "Hbase sits on top of HDFS"? . My understanding is HDFS is a collection of structured and unstructured data distributed across multiple nodes and HBase is structured data.
How is Hbase integrated with Hadoop to provide real time access to the underlying data. Do we have to write special jobs to build indexes and such? In other words is there an additional layer between Hbase and hdfs that has data in the structure HBase understands
HDFS is a distributed filesystem; One can do most regular FS operations on it such as listing files in a directory, writing a regular file, reading a part of the file, etc. Its not simply "a collection of structured or unstructured data" anymore than your EXT4 or NTFS filesystems are.
HBase is a in-memory Key-Value store which may persist to HDFS (it isn't a hard-requirement, you can run HBase on any distributed-filesystem). For any read key request asked of HBase, it will first check its runtime memory caches to see if it has a value cached, and otherwise visit its stored files on HDFS to seek and read out the specific value. There are various configurations in HBase offered to control the way the cache is utilised, but HBase's speed comes from a combination of caching and indexed persistence (faster, seek-ed file reads).
HBase's file-based persistence on HDFS does the key indexing automatically when it writes, so there is no manual indexing need by its users. These files are regular HDFS files, but specialised in format for HBase's usage, known as HFiles.
These articles are slightly dated, but are still very reflective of the architecture HBase uses: http://blog.cloudera.com/blog/2012/06/hbase-write-path/ and http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/, and should help if you want to dig deeper.
HDFS is a distributed file system, and HBase is a NoSQL database that depends on the HDFS filesystem to store it's data.
You should read up on these technologies, since your structured/unstructured comparison is not correct.
Update
You should check out the Google File System, MapReduce, and Bigtable papers if you are interested in the origins of these technologies.
Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google
file system." ACM SIGOPS operating systems review. Vol. 37. No. 5.
ACM, 2003.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
Chang, Fay, et al. "Bigtable: A distributed storage system for
structured data." ACM Transactions on Computer Systems (TOCS) 26.2
(2008): 4.
It's easy to understand:
HDFS is a distributed filesytem and provides write and read through an apped model.
Hbase is a NOSQL database that builds on the HDFS filesystem and must depend on it.
This can be read about here: Apache hbase document
We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).
Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.
I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.
HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.
When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.
If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.
Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageāfor example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.
As per my understanding both HIVE and HBASE are using HDFS to store the data. When we integrate HIVE and HBASE ----
How the data is moved between them? Or is it like the data wont move and it simply reflects? I am interested to know in 2 scenarios.
One: Table_1 has data and its in HIVE, Table_2 has data and its in HBASE. Now integration happened (whether this scenario possible?).
How the data movement happens? Is it from HBASE to HIVE or HIVE to HBASE.
Two: Setup as scenario One. Now for newly inserted records. Where would they go?
I am new to HBASE and interested in understanding the data movement in detail with and example.
Please improve the question if needed. Thanks in advance.
HDFS is a distributed file system that is well suited for the storage of large files but does not provide fast individual record lookups.
Hive is simply a SQL-like abstraction for interacting with the data in HDFS.
HBase is also built on top of HDFS. It provides fast reads and writes for large tables. HBase accomplishes this by storing your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
So in both cases, data reside in HDFS. That's "where they go."
As for the details of how they work, that's a huge topic where you have to familiarize yourself with such topics as the Hive metastore and storage handlers and the HBase API. I believe this tutorial (Part 1 and Part 2) can help you.