My understanding is that Hive is an SQL-like language that can perform database-related tasks by invoking underlying MapReduce programs. However, I learned that some Hive commands does not invoke MapReduce job. I am curious to know that what are these commands, and why they do not need to invoke MapReduce job.
You are right, Hive uses MR jobs on the background to process the data.
Wen you fire a SQL like query in hive, it converts it into various MR jobs on the background and gives you the result.
Having said that, There are very few queries that doesnt need MR jobs.
for e.g
SEKECT * FROM table LIMIT 10;
If you see in the above query we dont need any data processing. All we need is just to read a few rows from a table.
So the above hive query doesnt fire a MR job
But if we slightly modify the above query.
SELECT COUNT(*) FROM table;
It will fire MR jobs. Because we need to read all the data for this query and MR job will do it for us quickly(parallel processing)
Since hive table is stored in the form of a file in HDFS,processing time and effort are saved by hive for operations like 'Select *' , 'Select * limit' by avoiding mapreduce calls and directly fetching the whole file or a part of the file from hdfs and displaying to the user.
Anyway, this default behavior can also be changed by modifying hive-site.xml hive.fetch.task.conversion property to invoke map-reduce programs for all the operations.
Related
I have a table inside hive, I want to fetch all data from it. The problem is that:
select * from tbl;
Gives me very different results than:
select count(*) from tbl;
Why is that? The second query seems to be running hadoop map reduce, the first does not - it simply returns the results. The table is not partitioned or bucketed, it's in the text (csv) format.
When you submit a Hive query, Hive converts a query into one or more stages. Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
select * from table_name;
This query simply scan the entire table and dump the output on screen, therefore you see the different log output on console.
While select count(*) from table_name just scan the Hive meta_information and put the result from their itself. It also don't run any MapReduce job.
You can run below command on Hive console and you will be able to see the entire information.
hive> describe formatted table_name;
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles xx
numRows xxxxxxxx
In hadoop, aggregation/conditional/arithmetical operations etc required a processing engine to process and execute the result and therefore whenever you submit this type of job, it internally get translated into a MapReduce program, the MapReduce program gets executed on behalf of the query and produce its result to hive and Hive display on your screen and therefore you see a different result.
You can put the EXPLAIN keyword in front of the query to see the query plan and other information.
Please refer Programming Hadoop Book, Chapter 10 to know more about use of Hive EXPLAIN features.
I had one query where we have Hive table created and when we select * from table where=< condition>; ,it gives results immediately without invoking MR job.When I create a same duplicate table and try to execute a query then MR is invoked. What could be the possible reason for this?
I got the answer,The reason was Hive analyze command was issued on the table .Once you execute a hive analyze command it stores number of row,file size in hive metastore.So ,when u do select count(*) from table.It directly fetches it from the hive metastore instead of invoking a map reduce job.
You can also issue a Analyze command on column as well.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Documentation link :
https://cwiki.apache.org/confluence/display/Hive/StatsDev
Local mode (hive not invoking MR) depends on several conditions (see HIVE-1408):
hive.exec.mode.local.auto=true/false - Lets Hive determine whether to run in local mode automatically.
hive.exec.mode.local.auto.input.size.max=1G - When hive.exec.mode.local.auto is true, input bytes should be less than this for local mode.
hive.exec.mode.local.auto.input.files.max=4 - When hive.exec.mode.local.auto is true, the number of tasks should be less than this for local mode.
If the tables have the same data, my guess is that there is a difference in the number of tasks that are spawned when querying the two tables causing one query to run in local mode and another to spawn a MR job.
At my organization we are trying to use HIVE Or PIG as alternative
Primary goal : reduce process time
NETEZZA process time : 90 min
looking to end process : within 30 min
How does process works:
Process is about to maintain incremental history.There are two tables history_table and new_table.History table maintains total history and new
_table has updated records.So every day updated records are added to history table.Process has very complex stored procedures (Joins/deletion/insert/update)
same process is being applied on multiple tables.Every history table has almost billions of records.
Doubts I have :
Does HIVE/PIG perform better than NETEZZA ?
Is UDF in hive a good alternative for Stored Procedure as I want to create generic process for multiple tables (where I can pass table name as argument) ?
Which performs better HIVE or PIG for really complex joins with multiple condition, generating create statement dynamically and exception handling?
use impala which is Netezza on Hadoop, try Kudu for real time and batch or use HBase for real time and impala for batch , but you can query HBase using impala
If I query "SELECT * FROM table" will the order of the output always be the same whenever I run this query? This has been my observation so far, but I was curious if there was any guarantee of this behavior.
In this specific case i think there is a guarantee.
Some queries in Hive won't generate MR jobs and instead will IO the table directly in a serial way.
In your case, querying select * from table will not generate a MR job (unless table is a view).
Reading the table with a single process, reads from the first file to the last and from the head of each file to the end. hence, I believe that the order of the output in this way will be the same whenever you'll run the query.
This is of course not right in the case of MR jobs generated from the SQL.
I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.
I have two approaches
load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which
is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.
I am trying to understand which would be a better approach among the two or if there is a better one.
The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.
Your inputs are appreciated.
Thanks
Hive seems well suited to this kind of tasks, and it should be fairly simple to do:
Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
alter table mytable add partition (day='2013-01-23')
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'
You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.