One query runs map reduce, the other does not - hadoop

I have a table inside hive, I want to fetch all data from it. The problem is that:
select * from tbl;
Gives me very different results than:
select count(*) from tbl;
Why is that? The second query seems to be running hadoop map reduce, the first does not - it simply returns the results. The table is not partitioned or bucketed, it's in the text (csv) format.

When you submit a Hive query, Hive converts a query into one or more stages. Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
select * from table_name;
This query simply scan the entire table and dump the output on screen, therefore you see the different log output on console.
While select count(*) from table_name just scan the Hive meta_information and put the result from their itself. It also don't run any MapReduce job.
You can run below command on Hive console and you will be able to see the entire information.
hive> describe formatted table_name;
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles xx
numRows xxxxxxxx
In hadoop, aggregation/conditional/arithmetical operations etc required a processing engine to process and execute the result and therefore whenever you submit this type of job, it internally get translated into a MapReduce program, the MapReduce program gets executed on behalf of the query and produce its result to hive and Hive display on your screen and therefore you see a different result.
You can put the EXPLAIN keyword in front of the query to see the query plan and other information.
Please refer Programming Hadoop Book, Chapter 10 to know more about use of Hive EXPLAIN features.

Related

Why is mapreduce not Executed for Hive queries?

I had one query where we have Hive table created and when we select * from table where=< condition>; ,it gives results immediately without invoking MR job.When I create a same duplicate table and try to execute a query then MR is invoked. What could be the possible reason for this?
I got the answer,The reason was Hive analyze command was issued on the table .Once you execute a hive analyze command it stores number of row,file size in hive metastore.So ,when u do select count(*) from table.It directly fetches it from the hive metastore instead of invoking a map reduce job.
You can also issue a Analyze command on column as well.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Documentation link :
https://cwiki.apache.org/confluence/display/Hive/StatsDev
Local mode (hive not invoking MR) depends on several conditions (see HIVE-1408):
hive.exec.mode.local.auto=true/false - Lets Hive determine whether to run in local mode automatically.
hive.exec.mode.local.auto.input.size.max=1G - When hive.exec.mode.local.auto is true, input bytes should be less than this for local mode.
hive.exec.mode.local.auto.input.files.max=4 - When hive.exec.mode.local.auto is true, the number of tasks should be less than this for local mode.
If the tables have the same data, my guess is that there is a difference in the number of tasks that are spawned when querying the two tables causing one query to run in local mode and another to spawn a MR job.

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Tuning Hive Queries That Uses Underlying HBase Table

I've got a table in Hbase let's say "tbl" and I would like to query it using
Hive. Therefore I mapped a table to hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");
Queries like:
select * from tbl", "select id from tbl", "select id, data
from tbl
are really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345"
select id from tbl where data["777"] IS NOT NULL
are incredibly slow.
In the contrary when running from Hbase shell:
"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!
When I looked into the mapred job generated by hive on jobtracker I
discovered that "map.input.records" counts ALL the items in Hbase table,
meaning the job makes a full table scan before it even starts any mappers!!
Moreover, I suspect it copies all the data from Hbase table to hdfs to
mapper tmp input folder before executuion.
So, my questions are - Why hbase storage handler for hive does not translate
hive queries into appropriate hbase functions? Why it scans all the records
and then slices them using "where" clause? How can it be improved?
Any suggestions to improve the performance of Hive queries(mapped to HBase Table).
Can we create secondary index on HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive queries.
Lots of questions!, I'll try to answer all and give you a few performance tips:
The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.
Secondary indexes or alternative query paths are not supported by HBase (more info).
Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.
The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.
Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.
EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
tbl
TableScan
alias: tbl
filterExpr:
expr: ((id>= '12345') and (id < '12346'))
type: boolean
Filter Operator
....
Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)
Now, here are a few tips in order to speed up your queries (by a lot):
Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)
SET hbase.scan.cache=10000;
SET hbase.client.scanner.cache=10000;
Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.
SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;
SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];
Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *
Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)
Always EXPLAIN SELECT your queries before executing them.

regarding the Hive commands that do not invoke underlying MapReduce jobs

My understanding is that Hive is an SQL-like language that can perform database-related tasks by invoking underlying MapReduce programs. However, I learned that some Hive commands does not invoke MapReduce job. I am curious to know that what are these commands, and why they do not need to invoke MapReduce job.
You are right, Hive uses MR jobs on the background to process the data.
Wen you fire a SQL like query in hive, it converts it into various MR jobs on the background and gives you the result.
Having said that, There are very few queries that doesnt need MR jobs.
for e.g
SEKECT * FROM table LIMIT 10;
If you see in the above query we dont need any data processing. All we need is just to read a few rows from a table.
So the above hive query doesnt fire a MR job
But if we slightly modify the above query.
SELECT COUNT(*) FROM table;
It will fire MR jobs. Because we need to read all the data for this query and MR job will do it for us quickly(parallel processing)
Since hive table is stored in the form of a file in HDFS,processing time and effort are saved by hive for operations like 'Select *' , 'Select * limit' by avoiding mapreduce calls and directly fetching the whole file or a part of the file from hdfs and displaying to the user.
Anyway, this default behavior can also be changed by modifying hive-site.xml hive.fetch.task.conversion property to invoke map-reduce programs for all the operations.

Is there a default order in which results will be returned in Hive?

If I query "SELECT * FROM table" will the order of the output always be the same whenever I run this query? This has been my observation so far, but I was curious if there was any guarantee of this behavior.
In this specific case i think there is a guarantee.
Some queries in Hive won't generate MR jobs and instead will IO the table directly in a serial way.
In your case, querying select * from table will not generate a MR job (unless table is a view).
Reading the table with a single process, reads from the first file to the last and from the head of each file to the end. hence, I believe that the order of the output in this way will be the same whenever you'll run the query.
This is of course not right in the case of MR jobs generated from the SQL.

Resources