The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.
We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?
If not what would be the ideal approach to fulfill my requirement.
I am new to Hadoop, MRJob and Hive.
Please provide some suggestion.
"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?
About how many files are there in total which you need to scan?
One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.
You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).
Hope this helps.
Related
I'm new to Hive; so, I'm not sure how companies use Hive. Let me give you a scenario and see if I'm conceptually correct about the use of Hive.
Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs. So, I create a table columns of which correspond to the columns in the log file. Then I load the log file into the table. Now, I can start query the data. So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.
Is that scenario above a common use? And if it is, then how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this. It's not a search engine and will take minutes just to find any reasonable data you're looking for.
HBase would probably be a better alternative if you must stay within the Hadoop ecosystem. (Hive can query Hbase)
Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.
But to answer your questions
how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
Use an EXTERNAL Hive table over an HDFS location for your logs. Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)
You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries). You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS. Or run ALTER TABLE ADD PARTITION yourself on a schedule. Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you
If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format
I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.
I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.
Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageāfor example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.
I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.
I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc