Get row count from all tables in hive - hql

How can I get row count from all tables using hive? I am interested in the database name, table name and row count

You will need to do a
select count(*) from table
for all tables.
To automate this, you can make a small bash script and some bash commands.
First run
$hive -e 'show tables' | tee tables.txt
This stores all tables in the database in a text file tables.txt
Create a bash file (count_tables.sh) with the following contents.
while read line
do
echo "$line "
eval "hive -e 'select count(*) from $line'"
done
Now run the following commands.
$chmod +x count_tables.sh
$./count_tables.sh < tables.txt > counts.txt
This creates a text file(counts.txt) with the counts of all the tables in the database

A much faster way to get approximate count of all rows in a table is to run explain on the table. In one of the explain clauses, it shows row counts like below:
TableScan [TS_0] (rows=224910 width=78)
The benefit is that you are not actually spending cluster resources to get that information.
The HQL command is explain select * from table_name; but when not optimized not shows rows in the TableScan.

select count(*) from table
I think there is no more efficient way.

You can collect the statistics on the table by using Hive ANALAYZE command. Hive cost based optimizer makes use of these statistics to create optimal execution plan.
Below is the example of computing statistics on Hive tables:
hive> ANALYZE TABLE stud COMPUTE STATISTICS;
Query ID = impadmin_20171115185549_a73662c3-5332-42c9-bb42-d8ccf21b7221
Total jobs = 1
Launching Job 1 out of 1
…
Table training_db.stud stats: [numFiles=5, numRows=5, totalSize=50, rawDataSize=45]
OK
Time taken: 8.202 seconds
Links:
http://dwgeek.com/apache-hive-explain-command-example.html/

You can also set the database in the same command and separate with ;.
hive -e 'use myDatabase;show tables'

try this guys to automate-- put in shell after that run bash filename.sh
hive -e 'select count(distinct fieldid) from table1 where extracttimestamp<'2018-04-26'' > sample.out
hive -e 'select count(distinct fieldid) from table2 where day='26'' > sample.out
lc=cat sample.out | uniq | wc -l
if [ $lc -eq 1 ]; then
echo "PASS"
else
echo "FAIL"
fi

How do I mention the specific database that it needs to refer in below snippet:
while read line
do
echo "$line "
eval "hive -e 'select count(*) from $line'"
done

Here's a solution I wrote that uses python:
import os
dictTabCnt={}
print("=====Finding Tables=====")
tableList = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; show tables;\"").read().split('\n')
print("=====Finding Table Counts=====")
for i in tableList:
if i <> '':
strTemp = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; SELECT COUNT(*) FROM {}\"".format(i)).read()
dictTabCnt[i] = strTemp
print("=====Table Counts=====")
for table,cnt in dictTabCnt.items():
print("{}: {}".format(table,cnt))

Thanks to #mukul_gupta for providing shell script.
how ever we are encounting below error for the same
"bash syntax error near unexpected token done"
Solution for this at below link
BASH Syntax error near unexpected token 'done'
Also if any one need how to select DB Name
$hive -e 'use databasename;show tables' | tee tables.txt
for passing db name in select statement, give DB name in tableslist file itself.

Related

Keep Hive session open EMR

I’m running a bash script on AWS EMR that does something like:
for i in (‘tab1’ ‘tab2’ ‘tab3’ ‘tab4’)
do
nrow=$(hive -e “select count(*) from $i”)
done
This takes time as for each count a new hive session have to be setup.
Is there a way to keep the session open throughout the loop?
Do all counts in a single statement. You can also generate the SQL statement instead of hardcoding.
Something like this:
output=$(hive -S -e "select 'tab1', count(*) from tab1
union all
select 'tab2', count(*) from tab2
union all
select 'tab3', count(*) from tab3")
echo "$output" | while read TABLE_NAME COUNT
do
echo "$TABLE_NAME $COUNT"
done

passing multiple dates as a paramters to Hive query

I am trying to pass a list of dates as parameter to my hive query.
#!/bin/bash
echo "Executing the hive query - Get distinct dates"
var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"`
echo $var
echo "Executing the hive query - Get the parition data"
hive -hiveconf paritionvalue=$var -e 'SELECT Product FROM test_dev_db.TransactionMainHistoryTable where tran_date in("${hiveconf:paritionvalue}");'
echo "Hive query - ends"
Output as:
Executing the hive query - Get distinct dates
2009-02-01 2009-04-01
Executing the hive query - Get the parition data
Logging initialized using configuration in file:/hive/conf/hive-log4j.properties
OK
Product1
Product1
Product1
Product1
Product1
Product1
Time taken: 0.523 seconds, Fetched: 6 row(s)
Hive query - ends
It's only taking only first date as input. I would like to pass my dates as ('2009-02-01','2009-04-01')
Note:TransactionMainHistoryTable is partitioned on tran_date column with string type.
Collect array of distinct values using collect_set and concatenate it with delimiter ','. This will produce list without outer quotes 2009-02-01','2009-04-01 and in the second script add outer quotes ' also, or you can add them in the first query. And when executing in inline sql (-e option) you do not need to pass hiveconf variable, direct shell variable substitution will work. Use hiveconf when you are executing script from file (-f option)
Working example (use your table instead of stack):
date_list=$(hive -S -e "select concat_ws('\\',\\'',collect_set(substr(dt,0,10))) from (select stack (2,'2017-01', '2017-02')as dt)s ;")
hive -e "select * from (select stack (2,'2017-01', '2017-02')as dt)s where dt in ('${date_list}');"
Returns:
OK
2017-01
2017-02
Time taken: 1.221 seconds, Fetched: 2 row(s)

Assert on the postgres count output using bash

I would like to make an assertion the output of a postgres query using bash. Concretely, I am writing a bash job that counts the number of rows and if the count is not equal to zero, does something to raise alert.
$ psql MY_DATABASE -c "SELECT COUNT(*) WHERE foo=bar"
count
-------
0
(1 row)
In my script, I would like to assert that the output of above query is zero. However I am not sure where to begin because the output is not a number, but a formatted multi line string.
Is there an option in psql that makes it output a single number when counting, or could you think of any other approaches?
I would suggest to use temporary file to redirect the output and use it. Once your work is done, delete the temp file.
psql your_database -c "SELECT COUNT(*) as Count from table_a where c1=something" -t >assert.tmp
line=$(head -n 1 assert.tmp)
if [ $line -ge 0 ]; then
echo "greater then 0 and values is--"$line
fi
rm assert.tmp
Hope it works for you.

Export results from DB2 to CSV including column names via bash

This question branches off a question already asked.
I want to make a csv file with the db2 results including column names.
EXPORT TO ...
SELECT 1 as id, 'COL1', 'COL2', 'COL3' FROM sysibm.sysdummy1
UNION ALL
(SELECT 2 as id, COL1, COL2, COL3 FROM myTable)
ORDER BY id
While this does work, I am left with an unwanted column and rows of 1 and 2's
Is there a way to do this via the db2 command or a full bash alternative without redundant columns while keeping the header at the top?
e.g.
Column 1 Column 2 Column 3
data 1 data 2 data3
... ... ...
instead of:
1 Column 1 Column 2 Column 3
2 data 1 data 2 data3
2 ... ... ...
All the answers I've seen use two separate export statements. The first generates the column headers:
db2 "EXPORT TO /tmp/header.csv of del
SELECT
SUBSTR(REPLACE(REPLACE(XMLSERIALIZE(CONTENT XMLAGG(XMLELEMENT(NAME c,colname)
ORDER BY colno) AS VARCHAR(1500)),'<C>',', '),'</C>',''),3)
FROM syscat.columns WHERE tabschema=${SCHEMA} and tabname=${TABLE}"
then the query body
db2 "EXPORT TO /tmp/body.csv of del
SELECT * FROM ${SCHEMA}.${TABLE}"
then
cat /tmp/header.csv /tmp/body.csv > ${TABLE}.csv
If you just want the headers for the extracted data and you want those headers to always be on top and you want to be able to change the names of those headers so it appears more user-friendly and put it all into a CSV file.
You can do the following:
# Creates headers and new output file
HEADERS="ID,USERNAME,EMAIL,ACCOUNT DISABLED?"
echo "$HEADERS" > "$OUTPUT_FILE"
# Gets results from database
db2 -x "select ID, USERNAME, DISABLED FROM ${SCHEMA}.USER WHERE lcase(EMAIL)=lcase('$USER_EMAIL')" | while read ID USERNAME DISABLED ;
do
# Appends result to file
echo "${ID},${USERNAME},${USER_EMAIL},${DISABLED}" >> "$OUTPUT_FILE"
done
No temporary files or merging required.
Db2 for Linux/Unix/Windows lacks a (long overdue) simple opting (to the export command) for this common requirement.
But using the bash shell you can run two separate exports (one for the column-headers, the other for the data ) and concat the results to a file via an intermediate named pipe.
Using an intermediate named pipe means you don't need two flat-file copies of the data.
It is ugly and awkward but it works.
Example fragment (you can initialize the variables to suit your environment):
mkfifo ${target_file_tmp}
(( $? != 0 )) && print "\nERROR: failed to create named pipe ${target_file_tmp}" && exit 1
db2 -v "EXPORT TO ${target_file_header} of del SELECT 'COL1', 'COL2', 'COL3' FROM sysibm.sysdummy1 "
cat ${target_file_header} ${target_file_tmp} >> ${target_file} &
(( $? > 0 )) && print "Failed to append ${target_file} . Check permissions and free space" && exit 1
db2 -v "EXPORT TO ${target_file_tmp} of del SELECT COL1, COL2, COL3 FROM myTable ORDER BY 1 "
rc=$?
(( rc == 1 )) && print "Export found no rows matching the query" && exit 1
(( rc == 2 )) && print "Export completed with warnings, your data might not be what you expect" && exit 1
(( rc > 2 )) && print "Export failed. Check the messages from export" && exit 1
This would work for your simple case
EXPORT TO ...
SELECT C1, C2, C3 FROM (
SELECT 1 as id, 'COL1' as C1, 'COL2' as C2, 'COL3' as C3 FROM sysibm.sysdummy1
UNION ALL
(SELECT 2 as id, COL1, COL2, COL3 FROM myTable)
)
ORDER BY id
Longer term, EXTERNAL TABLE support (already in Db2 Warehouse) which has the INCLUDEHEADER option is (I guess) going to appear in Db2 at some point.
I wrote a stored procedure that extracts the header via describe command. The names can be retrieved from a temporary table, and be exported to a file. The only thing it is still not possible is to concatenate the files via SQL, thus a cat to both file and redirection to another file is necessary as last step.
CALL DBA.GENERATE_HEADERS('SELECT * FORM SYSCAT.TABLES') #
EXPORT TO myfile_header OF DEL SELECT * FROM SESSION.header #
EXPORT TO myfile_body OF DEL SELECT * FORM SYSCAT.TABLES #
!cat myfile_header myfile_body > myfile #
The code of the stored procedure is at: https://gist.github.com/angoca/8a2d616cd1159e5d59eff7b82f672b72
More information at: https://angocadb2.blogspot.com/2019/11/export-headers-of-export-in-db2.html.

Assign query result to variable and access it from other file

I have two files namely file1.sh and file2.sh.
The file1.sh contains the DB2 query, the query return the total number of employees in the employee table.
Now I want to assign the total number of employees into a variable within the file file1.sh.
File 1:
#!/bin/bash
#database connection goes here
echo The total number employees:
db2 -x "select count(*) from employee"
When i run above file that display the total number of employees.
But
I want to store that total into some variable and want it to access from another file that is file2.sh.
File 2:
#!/bin/bash
#Here i want to use total number of employees
#Variable to be accessed here
Using the following two scripts, driver.sh and child.sh:
driver.sh
#!/bin/bash
cnt=`./child.sh syscat.tables`
echo "Number of tables: ${RESULT}"
cnt=`./child.sh syscat.columns`
echo "Number of columns: ${RESULT}"
child.sh
#!/bin/bash
db2 connect to pocdb > /dev/null 2>&1
cnt=`db2 -x "select count(*) from ${1}"`
db2 connect reset > /dev/null 2>&1
db2 terminate > /dev/null 2>&1
echo ${cnt}
results
[db2inst1#dbms stack]$ ./driver.sh
Number of tables: 474
Number of columns: 7006
[db2inst1#dbms stack]$ ./child.sh syscat.columns
7006

Resources