What size the data volume of traditional database to choose Hadoop? - hadoop

What size the data volume of traditional database to choose Hadoop? What is the basic bench-marked parameter to choose Hadoop system over traditional database?

There is no specific "size" to move from RDBMS to Hadoop. Two things to know:
They are very different.(read on to know more)
The size of data that RDBMS can handle is dependent on the capability of the DataBase Server.
Traditional databases are RDBMS(Relational Database Management System) where we insert data as rows, which get stored in the database. You may Alter/Query/Update the database.
Hadoop is a framework for storage and processing data(large amounts of data). It has two parts: Storage(Hadoop Distributed File System) and MapReduce(processing framework).
Hadoop stores data as files on its FS. So if you want to Update/alter/query it like RDBMS its not possible.
We do have SQL wrappers over Hadoop like Hive or impala but they aren't as performant as RDBMS on data(not big data).
Even with all this many are considering moving from RDBMS to Hadoop because RDBMS under-performs with large data(bigdata). Hadoop can be used as a DataStore and Queries over it could be run using Hive/Impala. Updates are not readily supported on Hadoop.
There are many pros and cons of using Hadoop over RDBMS. Read more.. here orhere

Related

Can you use HDFS as your principal storage?

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

Why Hive when HDFS already provide data storage?

I have started learning Hadoop.I understood that HDFS provides distributed storage system and Mapreduce is for data processing.Now i ma reading Hadoop ecosystem.
From the definition of Hive, it is a data ware house built on hadoop for providing SQL like interface.
My question is when hadoop provides HDFS which is falut tolerant , distributed then why hive? Does hive replaces HDFS?.
Does hive provide only sql interface or storage also?
Hive does not replace HDFS. Hive provides sql type interface to data that is stored in HDFS. Its basically used for querying and analysis of data that is stored. Hive in a sense actually eliminates a lot of boiler plate code, that you would have to write if you were using mapreduce. for example just think of how you are going to create different types of joins(left, right, bucketed) or group by clause or any other sql clause in mapreduce and you will get your answer (you lines of code will easily scale to 100's ). Hive provides them out-of-the-box. You dont need to write those lengthy programs in mapreduce. Hive already does that for you.
One thing to note is, Hive itself uses Mapreduce behind the scenes. So any group by, count, join is converted to mapreduce jobs only. You can change this though to Tez/Spark.
for your second question, hive does not provide any storage, it just uses a database (derby as default, MySQL would be a good choice if you want to use a different db) as a metastore just to store the metadata related to the tables, partitions, views, buckets etc.. (metadata is like location of tables, type of data stored in tables, partitions info of the tables, created date, modified date etc..) you create with hive.
To answer your question in comment...
Hive can process structured (csv,txt etc) data & semi-structured(xml,json,parquet etc). It cannot process unstructured data like audio, video etc.
Note: Semi structured data can be handled in DDLs and also through spark to be put into Hive.
I encourage you to learn what is external and managed tables in hive too.
Happy learning.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Hadoop vs. NoSQL-Databases

As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources