Please clarify my understanding of Hadoop/HBase - hadoop

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.
So with that, here's what I think I know.
As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this
-----------------------------------------
| Mapreduce |
-----------------------------------------
| Hadoop | <-- hbase export--| HBase |
| | --apache pig --> | |
-----------------------------------------
| HDFS |
----------------------------------------
In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.
Now, assuming the above is correct, here is what else I think I know.
Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.
Ok, so if im still accurate to this point, this would be a valid use case.
You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.
Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.
Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?
**********Edit one,
i have updated my beliefs above based on the answers received.

You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.

Since HBase and Hadoop are different database engines, one can not access the data in the other directly. For HBase to get something out of Hadoop, it must go thru Mapreduce and vice versa.
This is not true since Hadoop is not a database Engine. Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS.
Furthermore Map Reduce is not technology, it is a Model to where you can work parallel on HDFS data.

Related

Processing very large dataset in real time in hadoop

I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Hadoop Ecosystem - What technological tool combination to use in my scenrio? (Details Inside)

This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
HBase can fit your scenario.
HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR.
Pig and Hive also integrate with HBase.
As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.

Resources