Is it possible to count the number of partitions? - hadoop

I am working on a test in which I must find out the number of partitions of a table and check if it is right. If I use show partitions TableName I get all the partitions by name, but I wish to get the number of partitions, like something along the lines show count(partitions) TableName (which retuns OK btw.. so it's not good) and get 12 (for ex.).
Is there any way to achieve this??

Using Hive CLI
$ hive --silent -e "show partitions <dbName>.<tableName>;" | wc -l
--silent is to enable silent mode
-e tells hive to execute quoted query string

You could use:
select count(distinct <partition key>) from <TableName>;

By using the below command, you will get the all partitions and also at the end it shows the number of fetched rows. That number of rows means number of partitions
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)];
< failed pictoral example >

You can use the WebHCat interface to get information like this. This has the benefit that you can run the command from anywhere that the server is accessible. The result is JSON - use a JSON parser of your choice to process the results.
In this example of piping the WebHCat results to Python, only the number 24 is returned representing the number of partitions for this table. (Server name is the name node).
curl -s 'http://*myservername*:50111/templeton/v1/ddl/database/*mydatabasename*/table/*mytablename*/partition?user.name=*myusername*' | python -c 'import sys, json; print len(json.load(sys.stdin)["partitions"])'
24

In scala you can do following:
sql("show partitions <table_name>").count()

I used following.
beeline -silent --showHeader=false --outputformat=csv2 -e 'show partitions <dbname>.<tablename>' | wc -l

Use the following syntax:
show create table <table name>;

Related

How to retain last N partitions for a hive external table?

I need to retain say last 7 partitions and data of a given hive external table.
This can be either done via a shell script or a hive hql script.
The table is partitioned by intgestion_date=YYYY-MM-DD
what would be the best way to find the cutoff date (of 7th partition) which I can then use in the drop partitions where clause to drop everything older than that.
since it's an external table, I will have to change the table properties to make it internal before the drop and then revert it.
There are different possible approaches: drop all partitions older than 7 days, this is easy (shell):
hive -e "ALTER TABLE mytable DROP IF EXISTS PARTITION(intgestion_date < '$(date -d "7 days ago" '+%Y-%m-%d')')"
But it seems this is not exactly what you want. Need to get 7th partition first and use it in the previous statement. Execute show partition, use sort, head and tail to get 7th partition:
seventh_partition=$(hive -e -S "show partitions table_name" | sort -r | head -n 7 | tail -n 1)
#extract value
part_value=${seventh_partition#*=}
#Execute drop older than 7th partition. Replace hive -e with echo and check what it prints
hive -e "ALTER TABLE table_name DROP IF EXISTS PARTITION(intgestion_date < '$part_value')"

Filter Column in CSV and get the unique value

I am having three columns in a CSV: Client Name, save Set Name and Status. For some clients, we have two Status as Failed and Success both. So, I want to filter those clients only which have status as only Failed. Clients who are having two entries such as Failed and success also, I want to omit.
When I am using the listed command, it's giving me values whose status was successful also might be later on. I want values which are only Failed. Not successful even once
cat "$pwd"/Daily-Failed.csv|egrep -i 'failed|Interrupted'|awk -F',' '{print $2,$3,$9}'|sort -u > "$pwd"/Final-Failed/Failed.csv
(edit) Or with newlines:
cat "$pwd"/Daily-Failed.csv|
egrep -i 'failed|Interrupted'|
awk -F',' '{print $2,$3,$9}'|
sort -u > "$pwd"/Final-Failed/Failed.csv
enter image description here
Please find the input and desired output. Input Client Name, Save Set, Status
Star,D:/,Failed
Star,C:/,Failed
Moon,C:/,Failed
Galaxy,D:/,Failed
Sun,D:/,Failed
Star,C:/,Success
Sun,D:/,Success
Output "Client Name","Save Set",Status
Galaxy,D:/,Failed
Moon,C:/,Failed
Star,D:/,Failed
I want to filter those clients only which have status as only Failed. Clients who are having two entries such as Failed and success also, I want to omit.
I'm going to assume, looking at your sample input (Which really needs to be text in your question, not an image), that both the Client Name and Save Set columns matter - you have (Star, C:/) with both success and failure rows, and (Star, D:/) with just failure, and the latter shows up in your output, and that's the only way that would make sense given your stated goal. On the other hand you also have two (Sun, D:/) rows, one success, one failure, and that shows up in your output even though it doesn't meet your criteria any way you look at it...
Anyways, this sort of grouping and filtering of tabular data screams database, and I like to script sqlite to make it do all the work in such cases:
#!/bin/sh
filename=Daily-Failed.csv
sqlite3 -batch -csv -header <<EOF
.import '${filename}' tbl
SELECT *
FROM tbl
GROUP BY "Client Name", "Save Set"
HAVING count(*) = 1 AND Status = 'Failed'
EOF
after taking the data in your image and turning it into a CSV file Daily-Failed.csv looking like
Client Name,Save Set,Status
Star,D:/,Failed
Star,C:/,Failed
Moon,C:/,Failed
Galaxy,D:/,Failed
Sun,D:/,Failed
Star,C:/,Success
Sun,D:/,Success
that script will output
"Client Name","Save Set",Status
Galaxy,D:/,Failed
Moon,C:/,Failed
Star,D:/,Failed

Show runtime of a query on monetdb

I am testing monetdb for a colunmnar storage.
I already installed and run the server
but, when I connect to the client and run a query, the response does not show the time to execute the query.
I am connecting as:
mclient -u monetdb -d voc
I already tried to connect with interactive like:
mclient -u monetdb -d voc -i
Output example:
sql>select count(*) from voc.regions;
+---------+
| L3 |
+=========+
| 5570699 |
+---------+
1 tuple
As mkersten mentioned, I would read through the options of the mclient utility first.
To get server and client timing measurements I used --timer=performance option when starting mclient.
Inside mclient I would then disable the result output by setting \f trash to ignore the results when measuring only.
Prepend trace to your query and you get your results like this:
sql>\f trash
sql>trace select count(*) from categories;
sql:0.000 opt:0.266 run:1.713 clk:5.244 ms
sql:0.000 opt:0.266 run:2.002 clk:5.309 ms
The first of the two lines shows you the server timings, the second one the overall timing including passing the results back to the client.
If you use the latest version MonetDB-Mar18 you have good control over the performance timers, which includes parsing, optimization, and runtime at server. See mclient --help.

Hive query in Shell Script

I have an external hive table on top of a parquet file.
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
I want to get the count of table using shell script.
I tried with following command
myVar =$(hive -S -e " select count(*) from parquet_test;")
echo $myVar
Added -S to run hive in silent mode still I get whole map reduce log and count in the myVar variable. How to get only count.
I don't have access to any of the configuration file to enable or disable the level of logging. Is there any other way?
Finally found a work around.
First flushed the query result into a file in HDFS then read answer from file.
The file only contains the result of the query.
(hive -S -e " INSERT OVERWRITE LOCAL DIRECTORY '/home/test/result/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select count(*) from parquet_test;")
Then reading the file into a variable
Count var=$(hdfs dfs -tail /home/test/result/)
echo $var
Thank you
myVar=$(eval "hive -S -e 'select count(*) from parquet_test;' ")
echo $myVar

Search a table in all databases in hive

In Hive, how do we search a table by name in all databases?
I am a Teradata user. Is there any counterpart of systems tables (present in Teradata) like dbc.tables, dbc.columns which are present in HIVE?
You can use SQL like to search a table.
Example:
I want to search a table with the name starting from "Benchmark" I don't know the rest of it.
Input in HIVE CLI:
show tables like 'ben*'
Output:
+-----------------------+--+
| tab_name |
+-----------------------+--+
| benchmark_core_month |
| benchmark_core_qtr |
| benchmark_core_year |
+-----------------------+--+
3 rows selected (0.224 seconds)
Or you can try below command if you are using Beeline
!tables
Note: It will work with Beeline only (JDBC client based)
More about beeline: http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/
you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
You should query the metastore.
You can find the connection properties within hive-site.xml
bash
<$HIVE_HOME/conf/hive-site.xml grep -A1 jdo
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
--
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
--
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
--
<name>javax.jdo.option.ConnectionPassword</name>
<value>cloudera</value>
Within the metastore you can use a query similar to the following
mysql
select *
from metastore.DBS as d
join metastore.TBLS as t
on t.DB_ID =
d.DB_ID
where t.TBL_NAME like '% ... put somthing here ... %'
order by d.NAME
,t.TBL_NAME
;
Searching for tables with name containing infob across all Hive databases
for i in `hive -e "show schemas"`; do echo "Hive DB: $i"; hive -e "use $i; show tables"|grep "infob"; done
Hive stores all its metadata information in Metastore. Metastore schema can be found at: link: https://issues.apache.org/jira/secure/attachment/12471108/HiveMetaStore.pdf
It has tables like DBS for database, TBLS for tables and Columns. You may use appropriate join, to find out table name or column names.
#hisi's answer is elegant. However it induce an error with lacking memory for GC on our cluster. So, there is another less elegant approach that works for me.
Let foo is the table name to search. So
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/foo$'
If one does not remember exact name of table but only substring bar in table name, then command is
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/[^/]\{1,\}$' | grep bar
That's an extention of Mantej Singh's answer: you can use pyspark to find tables across all Hive databases (not just one):
from functools import reduce
from pyspark import SparkContext, HiveContext
from pyspark.sql import DataFrame
sc = SparkContext()
sqlContext = HiveContext(sc)
dbnames = [row.databaseName for row in sqlContext.sql('SHOW DATABASES').collect()]
tnames = []
for dbname in dbnames:
tnames.append(sqlContext.sql('SHOW TABLES IN {} LIKE "%your_pattern%"'.format(dbname)))
tables = reduce(DataFrame.union, tnames)
tables.show()
The way to do this is to iterate through the databases searching for table with a specified name.

Resources