Hive count(*) query is not invoking mapreduce - hadoop

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)

After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.

Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility

From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.

please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Related

One query runs map reduce, the other does not

I have a table inside hive, I want to fetch all data from it. The problem is that:
select * from tbl;
Gives me very different results than:
select count(*) from tbl;
Why is that? The second query seems to be running hadoop map reduce, the first does not - it simply returns the results. The table is not partitioned or bucketed, it's in the text (csv) format.
When you submit a Hive query, Hive converts a query into one or more stages. Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
select * from table_name;
This query simply scan the entire table and dump the output on screen, therefore you see the different log output on console.
While select count(*) from table_name just scan the Hive meta_information and put the result from their itself. It also don't run any MapReduce job.
You can run below command on Hive console and you will be able to see the entire information.
hive> describe formatted table_name;
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles xx
numRows xxxxxxxx
In hadoop, aggregation/conditional/arithmetical operations etc required a processing engine to process and execute the result and therefore whenever you submit this type of job, it internally get translated into a MapReduce program, the MapReduce program gets executed on behalf of the query and produce its result to hive and Hive display on your screen and therefore you see a different result.
You can put the EXPLAIN keyword in front of the query to see the query plan and other information.
Please refer Programming Hadoop Book, Chapter 10 to know more about use of Hive EXPLAIN features.

How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

Few Hive Interview Questions

I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?
Questions:
Difference between "Sort By" and "Group by" in Hive. How they work?
If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
How to optimize Hive Performance?
Difference between "Internal Table" and "External Table"
What is the main difference between Hive and SQL
Please provide me few useful resources, so that I can learn in the better way. Thanks
PFB the answers:
1. Difference between "Sort By" and "Group by" in Hive. How they work?
Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
Ans. I think Reducer will work, because as per Hive documentation --
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Having to randomly pick, it has to have complete result output from Reducer.
- How to optimize Hive Performance?
Ans. These links should answer this
5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER
5 Tips for efficient Hive queries with Hive Query Language
- Difference between "Internal Table" and "External Table"
Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.
"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.
- What is the main difference between Hive and SQL
Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.
SQL is a standard RDBMS language for accessing and manipulating databases.
I am new to Hadoop and Hive as well so I can't give you a complete answer.
From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).
As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.
The Main Difference
From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.
Information obtained from Hadoop The Definitive Guide
Really good book and gives a good overview of all the technologies involved.
EDIT:
For external and internal tables, check out this response:
Difference between Hive internal tables and external tables?
Information regarding Sort By and Group By
Sort By:
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
(Taken from the link provided maybe this will help with the difference between Group By and Sort By)
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified.
Group By:
Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy
Information on Optimizing Hive Performance
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Optimizing Joins
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/
General Hive Performance Tips
https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips
Some extra resources
SQL to Hive Cheat Sheet
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Hive LIMIT Documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
Best of luck in your interview!
From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set
1. Difference between "Sort By" and "Group by" in Hive. How they work?
SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.
GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).
2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement.
select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.
3. How to optimize Hive Performance?
Using Tez session
Using bucketing and Partitioning
Using Orc file format
Using vectorisation
Using CBO
4. Difference between "Internal Table" and "External Table"
Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.
External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.
5. What is the main difference between Hive and SQL
Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Reducer will not run if we use limit in select clause.
select * from table_name limit 5;

Is there a default order in which results will be returned in Hive?

If I query "SELECT * FROM table" will the order of the output always be the same whenever I run this query? This has been my observation so far, but I was curious if there was any guarantee of this behavior.
In this specific case i think there is a guarantee.
Some queries in Hive won't generate MR jobs and instead will IO the table directly in a serial way.
In your case, querying select * from table will not generate a MR job (unless table is a view).
Reading the table with a single process, reads from the first file to the last and from the head of each file to the end. hence, I believe that the order of the output in this way will be the same whenever you'll run the query.
This is of course not right in the case of MR jobs generated from the SQL.

Resources