Hadoop Beginner - Data Ingestion & Analysis - hadoop

HDFS stores both structured & unstructured data.HIVE & IMPALA enables us to write sql queries which are then converted to MapReduce. How the user comes to know about the schema in which data is stored or how those tables are formed from data stored In HDFS?

If you use Parquet file format, there are tools for inspecting the file block directly. See this for instance. And most of the hadoop file formats have similar handy tools too, such as https://orc.apache.org/docs/tools.html for orcfiles.

Related

Save and access table-like data structure in hadoop

I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.
HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.
Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.

Does HBase have it's own structured data (on HDFS)or can it execute on unstructured data on HDFS

I am cutting my teeth into Hadoop ecosystem and have fairly good knowledge of MR, YARN and HDFS.
I am exploring other parts of the ecosystem. I believe HiveQL can be run on HBase in SQL like fashion, and in near real time. If that is so, I believe there is a need to transform unstructured data on HDFS into structured data so that relatively fast queries in HQL can be run. Does this mean data is in HDFS in unstructured form, and then replicated in structured form on HDFS for use by HBase and HQL?
Also, can HiveQL be run directly on unstructured data on HDFS in batch mode(hours.. similar time as Java running as MR job)?
HBase is a key-value store. It doesn't support SQL.
Answer for your question 1 : I believe there is a need to transform unstructured data on HDFS into structured data so that relatively fast queries in HQL can be run
HIVE can process unstructured data by converting it into structured data. It offers a simple way to apply structure to large amounts of unstructured data, and then perform batch SQL-like queries on that data.
Data can be read from a variety of formats, from unstructured flat files with comma or space-separated text, to semi-structured JSON files, to structured HBase tables.
Look at this article Log Analysis on how to convert unstructured log file into structured data and process it.
Answer for your question 2 :Can HiveQL be run directly on unstructured data on HDFS in batch mode(hours.. similar time as Java running as MR job)?
HiveQL cannot directly run on unstructured data. Data should be converted into structured form before processing it. Refer to above example of Log Analysis.
HiveQL cannot run on semi structured data (data in more than one formats). All data on HDFS must be in same format. The format can be specified as metadata in a database used by Hive, which it uses to figure out the structure of data in HDFS. This is executed as map reduce job on HDFS and is indeed long running.
PIG is what is needed to run on HDFS with data in numerous formats. Hive cannot do it. PIG can do it since it is programmatic style.
You can load semi structured data into HBase using a map reduce job. Then run Hive in near real time on HBase.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Data moving from RDBMS to Hadoop, using SQOOP and FLUME

I am in the process of learning Hadoop and stuck with few concepts on moving data from Relational database to Hadoop and vice versa.
I have transferred files from MySQL to HDFS using SQOOP import queries. The files I transferred were structured datasets and not any server log data. I recently read that we usually use flume for moving log files into Hadoop,
My question is:
1. Can we use SQOOP as well for moving log files?
2. If yes, which of SQOOP or FLUME is more preferred for log files and why?
1) Sqoop can be used to transfer data between any rdbms and hdfs. To use scoop the data has to be structured usually specified by schema of database from where data is being imported or exported.Log files are not always structured,depending on source and type of log so sqoop is not used for moving log files.
2)Flume can collect, aggregate data from many different kinds of customizable data sources. It gives more flexibility in controlling what specific events to capture and use in user defined work flow before storing in say hdfs.
I hope it clarified difference between sqoop and flume.
SQOOP is designed to transfer data from RDMS to HDFS whereas FLUME is for moving large amounts of log data.
Both are different and specialized for different purposes.
Like
You can use SQOOP to import data via JDBC ( which you can not do in FLUME ),
and
You can use FLUME to say something like "I want to tail 200 lines of log file from this server".
Read more about FLUME here
http://flume.apache.org/
SQOOP not only transfers data from RDBMS but also from NOSql databases like MongoDB. You can directly transfer data to HDFS or Hive.
Transferring data to Hive you need not have to create table beforehand.. It takes the scheme from database itself.
Flume is used to fetch log data or streaming data

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources