I need to retain say last 7 partitions and data of a given hive external table.
This can be either done via a shell script or a hive hql script.
The table is partitioned by intgestion_date=YYYY-MM-DD
what would be the best way to find the cutoff date (of 7th partition) which I can then use in the drop partitions where clause to drop everything older than that.
since it's an external table, I will have to change the table properties to make it internal before the drop and then revert it.
There are different possible approaches: drop all partitions older than 7 days, this is easy (shell):
hive -e "ALTER TABLE mytable DROP IF EXISTS PARTITION(intgestion_date < '$(date -d "7 days ago" '+%Y-%m-%d')')"
But it seems this is not exactly what you want. Need to get 7th partition first and use it in the previous statement. Execute show partition, use sort, head and tail to get 7th partition:
seventh_partition=$(hive -e -S "show partitions table_name" | sort -r | head -n 7 | tail -n 1)
#extract value
part_value=${seventh_partition#*=}
#Execute drop older than 7th partition. Replace hive -e with echo and check what it prints
hive -e "ALTER TABLE table_name DROP IF EXISTS PARTITION(intgestion_date < '$part_value')"
Related
Im new to hadoop etc.
Connect via beeline to hiveserver2. Then I create table:
create table test02(id int, name string);
Table creates and I try to insert values:
insert into test02(id, name) values (1, "user1");
And nothing happens. table02 and values__tmp__table__1 are created but they are both empty.
Hadoop directory "/user/$username/warehouse/test01" is empty to.
0: jdbc:hive2://localhost:10000> insert into test02 values (1,"user1");
No rows affected (2.284 seconds)
0: jdbc:hive2://localhost:10000> select * from test02;
+------------+--------------+
| test02.id | test02.name |
+------------+--------------+
+------------+--------------+
No rows selected (0.326 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+------------------------+
| tab_name |
+------------------------+
| test02 |
| values__tmp__table__1 |
+------------------------+
2 rows selected (0.137 seconds)
Temp tables like these are created when hive needs to manage intermediate data during an operation. Hive automatically deletes all temporary tables at the end of the Hive session in which they are created. If you close the session and open it again, you won't find the temp table.
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_data-access/content/temp-tables.html
Insert data like this ->
insert into test02 values (999, "user_new");
Data would be inserted into test02 and a temp table like values__tmp__table__1 (temp table will gone after the hive session).
I found a solution. I'm new to Hadoop&co, so the answer was not obvious to me.
First, I turned Hive logging to level ERROR to see the problem:
Find hive-exec-log4j2.properties ({your hive directory}/conf/)
Find property.hive.log.level and set the value to ERROR (..log.level = ERROR)
Then, while executing the command insert into via Beeline, I saw all of the errors. The main error was:
There are 0 datanode(s) running and no node(s) are excluded in this operation
I found the same question elsewhere. The top answer helped me, which was to delete all /tmp/* files (which stored all of my local HDFS data).
Then, like the first time, I initialized namenode (-format) and Hive (ran my metahive script).
The problem was solved—though it did expose another issue, which I'll need to look into: the insert into executes in 25+ seconds.
I want to add a prefix to some hive tables, something like the following:
alter table sales_info rename to archived_sales_info;
except there's some 200 tables and I'd rather not do them all by hand. Is there any way to do this either via hive itself or perhaps a bash script?
You can create the shell script as below
#!/bin/bash
hive -S -e " show tables" > table_list.txt
while read -r line;
do
hive -S -e "alter table $line rename to archived_$line;"
echo $line
done < table_list.txt
Before :
> show tables;
OK
t1
t2
Time taken: 0.016 seconds, Fetched: 2 row(s)
After executing script :
> show tables;
OK
archived_t1
archived_t2
Time taken: 0.016 seconds, Fetched: 2 row(s)
Added echo in loop so that you can keep track of which tables has been changed you can redirect it to file also like echo $line >> changed.txt
You can do modifications in code as per your requirement. But it should solve your purpose without any changes.
As this is almost certainly a one time thing, considering the following flow:
List tables
Copy to your favorite editor, for instance excel
Use the list of tables to create a list of alter statements
Of course you can also run a script, but this is the fastest I could think of.
I am trying to create timestamp based partition in hive. But hive is creating data based partition. Below is my code. Could someone please help?
cat test1.sh
dat=`date +'%Y%m%d %H:%m:%S'`
hive -f load.hql -hiveconf file_load_timestamp=$dat;
cat load.hql
INSERT OVERWRITE table perm.test partition(file_load_timestamp='${hiveconf:dat}')
SELECT a,b FROM work.temp;
dt=20180102/ = HDFS path is getting created like this.
dt=20180102 103455/ = Expecting HDFS path to be created like this.
When I tried with %Y%m%d_%H:%m:%S' format its working as expected. But I need space between date and timestamp.
To create a folder name in HDFS with space in between, it is required to escape the space with \
hadoop fs -mkdir test\ 123
create a folder in hdfs with name test 123.
Similarly, hive maintains the partitions in folders created with the partition value. Thats why providing the date format %Y%m%d\ %H%m%S will help to create folder with spaces.
Below is tested and working:
INSERT OVERWRITE table person_details1 partition(datelocal='20180102\ 200128') select * from person_details;
datelocal is String
Edited:Executed the code, Below is working one:
hduser#Amit:~$ cat test1.sh
#!/bin/sh
dat=`date +'%Y%m%d\ %H%m%S'`
hive -f load.hql -hiveconf datelocal="$dat";
hduser#Amit:~$ cat load.hql
INSERT OVERWRITE table amit.person_details1 partition(datelocal='${hiveconf:datelocal}') select * from amit.person_details;
I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.
I am working on a test in which I must find out the number of partitions of a table and check if it is right. If I use show partitions TableName I get all the partitions by name, but I wish to get the number of partitions, like something along the lines show count(partitions) TableName (which retuns OK btw.. so it's not good) and get 12 (for ex.).
Is there any way to achieve this??
Using Hive CLI
$ hive --silent -e "show partitions <dbName>.<tableName>;" | wc -l
--silent is to enable silent mode
-e tells hive to execute quoted query string
You could use:
select count(distinct <partition key>) from <TableName>;
By using the below command, you will get the all partitions and also at the end it shows the number of fetched rows. That number of rows means number of partitions
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)];
< failed pictoral example >
You can use the WebHCat interface to get information like this. This has the benefit that you can run the command from anywhere that the server is accessible. The result is JSON - use a JSON parser of your choice to process the results.
In this example of piping the WebHCat results to Python, only the number 24 is returned representing the number of partitions for this table. (Server name is the name node).
curl -s 'http://*myservername*:50111/templeton/v1/ddl/database/*mydatabasename*/table/*mytablename*/partition?user.name=*myusername*' | python -c 'import sys, json; print len(json.load(sys.stdin)["partitions"])'
24
In scala you can do following:
sql("show partitions <table_name>").count()
I used following.
beeline -silent --showHeader=false --outputformat=csv2 -e 'show partitions <dbname>.<tablename>' | wc -l
Use the following syntax:
show create table <table name>;