Hive/Impala count distinct on a partitioned column results in all data files being read? - hadoop

When querying a hive table according to a partitioning column, it would be logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?

Just as a hack, if the column is partitioned - it is bound to be distinct in warehouse dir.
You could try something like:
hadoop fs -ls /<hive_warehouse_directory>/<database.db>/<table_name> | wc -l
Usually, hive warehouse is kept as /user/hive/warehouse

Related

how to validate a data transfer from an external database (oracle) to hdfs

I have a job that transfers data from oracle to hdfs. I need an efficient way to validate this transfer, to make sure that the all the rows are properly transferred.
A simple way I feel is to take the count of rows from Source Oracle table
select count(*) from tablename;
You will get the count of rows from the Oracle table
From HDFS point of View
Count total number of lines(rows) in HDFS file:
hadoop fs -cat /yourdestinationhdfsfiles/* | wc -l
Data Validation strategy
Create a (Temp) Hive table similar to Oracle table structure
Take few records from the Target HDFS file and load the data into HIVE table and validate if records and structure are matching.[Manual process for validation]
Note: This can be done for full data also provided you have enough storage space and processing unit.
Hope this Helps!!!..

Spark-Sql returns 0 records without repairing hive table

I'm doing the following:
Delete hive partition using ALTER TABLE ... DROP IF EXISTS PARTITION (col='val1')
hdfs dfs -rm -r path_to_remove
Run ingestion program that creates this partition (col='val1') and creates avro files under the HDFS folder`
sqlContext.sql("select count(0) from table1 where col='val1'").show returns 0 until MSCK REPAIR TABLE.
Is it compulsory to do the repair step to see the data again in spark-sql? Please advise.
If it's an external table, yes, you need to repair the table. I don't think you need to do that with managed tables.
SparkSQL reads information from the Hive metastore, and without having information about the partition there, nothing can be counted, by Spark or any other tool that uses the metastore

How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;

How to select the last table from a list of hive tables?

I have a list of hive tables and want to select the last table for performing some query.
Here is what I use to get the list of similar hive tables.
show tables 'test_temp_table*';
It displays the below result
test_temp_table_1
test_temp_table_2
test_temp_table_3
test_temp_table_4
test_temp_table_5
test_temp_table_6
I need to run some query on test_temp_table_6. I can do this using shell script by writing the output to a temp file and reading the last value from it but is there a simple way using hive query to get the last table that has the maximum number at the end?
Using shell:
last_table=$(hive -e "show tables 'test_temp_table*';" | sort -r | head -n1)
You can actually run a "select query" on the Hive metastore based on tablenames (and then use regular sql sorting using ORDER BY DESC and LIMIT 1) instead of using "SHOW TABLES", by following the approach mentioned here: Query Hive Metadata Store for table metadata

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Resources