How to select the last table from a list of hive tables? - hadoop

I have a list of hive tables and want to select the last table for performing some query.
Here is what I use to get the list of similar hive tables.
show tables 'test_temp_table*';
It displays the below result
test_temp_table_1
test_temp_table_2
test_temp_table_3
test_temp_table_4
test_temp_table_5
test_temp_table_6
I need to run some query on test_temp_table_6. I can do this using shell script by writing the output to a temp file and reading the last value from it but is there a simple way using hive query to get the last table that has the maximum number at the end?

Using shell:
last_table=$(hive -e "show tables 'test_temp_table*';" | sort -r | head -n1)

You can actually run a "select query" on the Hive metastore based on tablenames (and then use regular sql sorting using ORDER BY DESC and LIMIT 1) instead of using "SHOW TABLES", by following the approach mentioned here: Query Hive Metadata Store for table metadata

Related

hive, get the data location using an one-liner

I wonder if there is a way to get the data location from hive using a one-liner. Something like
select d.location from ( describe formatted table_name partition ( .. ) ) as d;
My current solution is to get the full output and then parse it.
Unlike traditional RDBMS, Hive metadata is stored in a separate database. In most cases it is in MySQL or Postgres. The metastore database details can be found in hive-site.conf. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns etc..
If you do not have access to the metastore, the only option is to describe each table to get the details. If you have a lot of databases and tables, you could write a shell script to get the list of tables using "show tables" and loop around the tables.
Two methods if you do not have access to the metadata.
Parse DESCRIBE TABLE in the shell like in this answer: https://stackoverflow.com/a/43804621/2700344
Also Hive has a virtual column INPUT__FILE__NAME.
select INPUT__FILE__NAME from table
will output locations URLs for each file.
You can split URL by '/', get element you need, aggregate, etc

Finding the total number of rows for all tables in CockroachDB

I’m curious how many total rows I have across all of the tables in my deployment. Does CockroachDB have a command to count the total number of rows in all of my tables?
We don't currently have anything that's better than running a SELECT COUNT(*) query against every table in your database, which will be really slow. Instead, we recommend using the data size in the admin UI as an approximation.
If the exact count of all rows is still desired, you can can use a shell script to gather all the table names from information_schema and issue a COUNT(*) query for all of them.
For example, the following snippet will print out the row counts for every table in the database cats:
tables=$(cockroach sql -e "SELECT table_name FROM information_schema.tables WHERE table_schema='cats'" | sed 1,2d)
for table in $tables; do
cockroach sql -e "SELECT '$table', COUNT(*) FROM cats.$table"
done

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Using Hive in real world applications?

I am a newbee on Hadoop stack, I have learned map-reduce and now hive.
But I am not sure about hive use?
In map-R we have one or more output files n that's our final result, but In hive we can select the records using SQL like queries i.e. HQL but we are not getting any final output file. Results will be shown on terminal only.
Now my Q is how can we use this select HQL so that it can be used by some other analytic's team?
There are lot of ways to extract/export the hive query result outside.
If you want the result in any RDBMS storage you can use Sqoop.
I suggest you go through what Sqoop is and what it does.
And if you want your query results in a file, then there are lot of ways.
Hive supports exporting data from tables.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from table;
Another simple approach would be to simple redirecting you hive query outputs to a file while running your hive queries in CLI.
hive -e "select * from table" > output.txt

Hive/Impala count distinct on a partitioned column results in all data files being read?

When querying a hive table according to a partitioning column, it would be logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?
Just as a hack, if the column is partitioned - it is bound to be distinct in warehouse dir.
You could try something like:
hadoop fs -ls /<hive_warehouse_directory>/<database.db>/<table_name> | wc -l
Usually, hive warehouse is kept as /user/hive/warehouse

Resources