Can you use HDFS as your principal storage? - hadoop

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?

HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available

Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

Related

Does Hadoop HBase support self healing data blocks?

HDFS supports a mechanism which is called 'self-healing'. As far as I understood this means, that when a file (or better a data block) is written into HDFS, the block is replicated over a cluster of data-nodes. HDFS verifies the consistency of the data blocks over all nodes and automatically detects inconsistent data to be replicated again into a new data block. This is a feature which I am looking for.
Now - Hbase is based on HDFS. As far as I understood Hbase is optimized for random access to 'smaler' datasets (with only a few MB). Hbase is also supporting primar keys and query language. This is what I am also looking for.
My Question is: does Hbase still support the 'self-healing' feature of HDFS or is this lost because of the different approach of a relational database analogy?

What size the data volume of traditional database to choose Hadoop?

What size the data volume of traditional database to choose Hadoop? What is the basic bench-marked parameter to choose Hadoop system over traditional database?
There is no specific "size" to move from RDBMS to Hadoop. Two things to know:
They are very different.(read on to know more)
The size of data that RDBMS can handle is dependent on the capability of the DataBase Server.
Traditional databases are RDBMS(Relational Database Management System) where we insert data as rows, which get stored in the database. You may Alter/Query/Update the database.
Hadoop is a framework for storage and processing data(large amounts of data). It has two parts: Storage(Hadoop Distributed File System) and MapReduce(processing framework).
Hadoop stores data as files on its FS. So if you want to Update/alter/query it like RDBMS its not possible.
We do have SQL wrappers over Hadoop like Hive or impala but they aren't as performant as RDBMS on data(not big data).
Even with all this many are considering moving from RDBMS to Hadoop because RDBMS under-performs with large data(bigdata). Hadoop can be used as a DataStore and Queries over it could be run using Hive/Impala. Updates are not readily supported on Hadoop.
There are many pros and cons of using Hadoop over RDBMS. Read more.. here orhere

Localizing HFile blocks in HDFS

We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).
Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.
I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.
HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.
When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.
If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

How the data is moved or reflected between Hive and Hbase in Hive-HBase-Integration.?

As per my understanding both HIVE and HBASE are using HDFS to store the data. When we integrate HIVE and HBASE ----
How the data is moved between them? Or is it like the data wont move and it simply reflects? I am interested to know in 2 scenarios.
One: Table_1 has data and its in HIVE, Table_2 has data and its in HBASE. Now integration happened (whether this scenario possible?).
How the data movement happens? Is it from HBASE to HIVE or HIVE to HBASE.
Two: Setup as scenario One. Now for newly inserted records. Where would they go?
I am new to HBASE and interested in understanding the data movement in detail with and example.
Please improve the question if needed. Thanks in advance.
HDFS is a distributed file system that is well suited for the storage of large files but does not provide fast individual record lookups.
Hive is simply a SQL-like abstraction for interacting with the data in HDFS.
HBase is also built on top of HDFS. It provides fast reads and writes for large tables. HBase accomplishes this by storing your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
So in both cases, data reside in HDFS. That's "where they go."
As for the details of how they work, that's a huge topic where you have to familiarize yourself with such topics as the Hive metastore and storage handlers and the HBase API. I believe this tutorial (Part 1 and Part 2) can help you.

Resources