Basic thing about Hadoop and Hive - hadoop

I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.

I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc

Related

Why Hive when HDFS already provide data storage?

I have started learning Hadoop.I understood that HDFS provides distributed storage system and Mapreduce is for data processing.Now i ma reading Hadoop ecosystem.
From the definition of Hive, it is a data ware house built on hadoop for providing SQL like interface.
My question is when hadoop provides HDFS which is falut tolerant , distributed then why hive? Does hive replaces HDFS?.
Does hive provide only sql interface or storage also?
Hive does not replace HDFS. Hive provides sql type interface to data that is stored in HDFS. Its basically used for querying and analysis of data that is stored. Hive in a sense actually eliminates a lot of boiler plate code, that you would have to write if you were using mapreduce. for example just think of how you are going to create different types of joins(left, right, bucketed) or group by clause or any other sql clause in mapreduce and you will get your answer (you lines of code will easily scale to 100's ). Hive provides them out-of-the-box. You dont need to write those lengthy programs in mapreduce. Hive already does that for you.
One thing to note is, Hive itself uses Mapreduce behind the scenes. So any group by, count, join is converted to mapreduce jobs only. You can change this though to Tez/Spark.
for your second question, hive does not provide any storage, it just uses a database (derby as default, MySQL would be a good choice if you want to use a different db) as a metastore just to store the metadata related to the tables, partitions, views, buckets etc.. (metadata is like location of tables, type of data stored in tables, partitions info of the tables, created date, modified date etc..) you create with hive.
To answer your question in comment...
Hive can process structured (csv,txt etc) data & semi-structured(xml,json,parquet etc). It cannot process unstructured data like audio, video etc.
Note: Semi structured data can be handled in DDLs and also through spark to be put into Hive.
I encourage you to learn what is external and managed tables in hive too.
Happy learning.

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

How the data is moved or reflected between Hive and Hbase in Hive-HBase-Integration.?

As per my understanding both HIVE and HBASE are using HDFS to store the data. When we integrate HIVE and HBASE ----
How the data is moved between them? Or is it like the data wont move and it simply reflects? I am interested to know in 2 scenarios.
One: Table_1 has data and its in HIVE, Table_2 has data and its in HBASE. Now integration happened (whether this scenario possible?).
How the data movement happens? Is it from HBASE to HIVE or HIVE to HBASE.
Two: Setup as scenario One. Now for newly inserted records. Where would they go?
I am new to HBASE and interested in understanding the data movement in detail with and example.
Please improve the question if needed. Thanks in advance.
HDFS is a distributed file system that is well suited for the storage of large files but does not provide fast individual record lookups.
Hive is simply a SQL-like abstraction for interacting with the data in HDFS.
HBase is also built on top of HDFS. It provides fast reads and writes for large tables. HBase accomplishes this by storing your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
So in both cases, data reside in HDFS. That's "where they go."
As for the details of how they work, that's a huge topic where you have to familiarize yourself with such topics as the Hive metastore and storage handlers and the HBase API. I believe this tutorial (Part 1 and Part 2) can help you.

Load Data into Hive from Flat files or existing database

We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.

Resources