I want to add a prefix to some hive tables, something like the following:
alter table sales_info rename to archived_sales_info;
except there's some 200 tables and I'd rather not do them all by hand. Is there any way to do this either via hive itself or perhaps a bash script?
You can create the shell script as below
#!/bin/bash
hive -S -e " show tables" > table_list.txt
while read -r line;
do
hive -S -e "alter table $line rename to archived_$line;"
echo $line
done < table_list.txt
Before :
> show tables;
OK
t1
t2
Time taken: 0.016 seconds, Fetched: 2 row(s)
After executing script :
> show tables;
OK
archived_t1
archived_t2
Time taken: 0.016 seconds, Fetched: 2 row(s)
Added echo in loop so that you can keep track of which tables has been changed you can redirect it to file also like echo $line >> changed.txt
You can do modifications in code as per your requirement. But it should solve your purpose without any changes.
As this is almost certainly a one time thing, considering the following flow:
List tables
Copy to your favorite editor, for instance excel
Use the list of tables to create a list of alter statements
Of course you can also run a script, but this is the fastest I could think of.
Related
I need to retain say last 7 partitions and data of a given hive external table.
This can be either done via a shell script or a hive hql script.
The table is partitioned by intgestion_date=YYYY-MM-DD
what would be the best way to find the cutoff date (of 7th partition) which I can then use in the drop partitions where clause to drop everything older than that.
since it's an external table, I will have to change the table properties to make it internal before the drop and then revert it.
There are different possible approaches: drop all partitions older than 7 days, this is easy (shell):
hive -e "ALTER TABLE mytable DROP IF EXISTS PARTITION(intgestion_date < '$(date -d "7 days ago" '+%Y-%m-%d')')"
But it seems this is not exactly what you want. Need to get 7th partition first and use it in the previous statement. Execute show partition, use sort, head and tail to get 7th partition:
seventh_partition=$(hive -e -S "show partitions table_name" | sort -r | head -n 7 | tail -n 1)
#extract value
part_value=${seventh_partition#*=}
#Execute drop older than 7th partition. Replace hive -e with echo and check what it prints
hive -e "ALTER TABLE table_name DROP IF EXISTS PARTITION(intgestion_date < '$part_value')"
I have a .csv file with the following sample data format:
REFID|PARENTID|QTY|DESCRIPTION|DATE
AA01|1234|1|1st item|null
AA02|12345|2|2nd item|null
AA03|12345|3|3rd item|null
AA04|12345|4|4th item|null
To load the above file into a table I am using below BCP command:
/bcp $TABLE_NAME in $FILE_NAME -S $DB_SERVER -t "|" -F 1 -U $DB_USERNAME -d $DB_NAME
What i am trying to look here is like below (adding sysdate instead of null from bcp)
AA01|1234|1|1st item|3/16/2020
AA02|12345|2|2nd item|3/16/2020
AA03|12345|3|3rd item|3/16/2020
AA04|12345|4|4th item|3/16/2020
Update : I was able to exclude header with #Jamie answer by -F 1 option, but looking for some help on inserting date with bcp. Tried looking some old Q&A, but no luck so far..
To exclude a single header record, you can use the -F option. This will tell BCP which line in the file is the first line to begin loading from. For your sample, -F2 should work fine. However, your command has other issues. See comments.
There is no way to introduce new data using the BCP command as you stated. BCP cannot introduce a date value while copying data into your table. To accomplish this I suggest a default for your date column or to first load the raw data into a table without the date column then you can introduce the date value as you see fit in late processing.
I have an external hive table on top of a parquet file.
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
I want to get the count of table using shell script.
I tried with following command
myVar =$(hive -S -e " select count(*) from parquet_test;")
echo $myVar
Added -S to run hive in silent mode still I get whole map reduce log and count in the myVar variable. How to get only count.
I don't have access to any of the configuration file to enable or disable the level of logging. Is there any other way?
Finally found a work around.
First flushed the query result into a file in HDFS then read answer from file.
The file only contains the result of the query.
(hive -S -e " INSERT OVERWRITE LOCAL DIRECTORY '/home/test/result/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select count(*) from parquet_test;")
Then reading the file into a variable
Count var=$(hdfs dfs -tail /home/test/result/)
echo $var
Thank you
myVar=$(eval "hive -S -e 'select count(*) from parquet_test;' ")
echo $myVar
I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.
With the Unix shell script, I am doing a bcp out from a table in Server1 using NATIVE format to a file - XXXX.bcpdat, then bcp in the file to a table of same structure in Server2.
The bcp command we have is
bcp "$dbname".."$tablename" out XXXX.bcpdat -n
bcp "$dbname".."$tablename" in XXXX.bcpdat -n -b10000
This bcp_out & bcp in works as expected from/into tables.
But i want to da an urgent change here -
I want to get the total number of rows (a row may have 120 or 30 or 40 records)in the bcp data file (XXXX.bcpdat)
But with the file in Native format i couldn differentiate each row & how its being separated. If i pass head -10 XXXX.bcpdat or tail -10 XXXX.bcpdat it prints everything in the file. "wc -l" or "awk" or "cut" is not helping me to get the count of rows from the file. There is no differentiation where a row ends like how it is in character load of bcp. It would really be great if someone help me at the earliest, how i can get the total number of rows (not records) that is in the bcpdat file. Thanks a loot in advance.