sqoop import using exclude tables - sqoop

I have 100 tables in a MySQL database, using sqoop import-all-tables I want to import only 50 tables to hdfs, using exclude command we have to specify 50 tables or there is any other option?

Yes You are right You can use Sqoop "import-all-tables" along with another parameter, --exclude-tables,with which you can exclude some of the table that you don't want to import in the database.
Another option you can try is shell script as below:
1)Prepare a inputfile which has list of DBNAME.TABLENAME
2)The shell script will have this file as input, iterate line by line and execute sqoop statement for each line
while read line;
do
DBNAME=`echo $line | cut -d'.' -f1`
tableName=`echo $line | cut -d'.' -f2`
sqoop import -Dmapreduce.job.queuename=$QUEUE_NAME --connect '$JDBC_URL;databaseName=$DBNAME;username=$USERNAME;password=$PASSWORD' --table $tableName --target-dir $DATA_COLLECTOR/$tableName --fields-terminated-by '\001' -m 1
done<inputFile

Related

Shell Hive - Identify a column in entire Hive Database

The below code not go in loop if I change first line to hive -S -e 'show databases like 'abc_xyz%''|
Can you please help to fix this issue?`
hive -S -e 'show databases'|
while read database
do
eval "hive -S -e 'show tables in $database'"|
while read line
do
if eval "hive -S -e 'describe $database.$line'"| grep -q "<column_name>"; then
output="Required table name: $database.$line"'\n';
else
output=""'\n';
fi
echo -e "$output"
done
done```
Wildcards in the show databases command pattern can only be '*' for any character(s) or '|' for a choice for Hive < 4.0.0.
For example like this:
show databases like 'abc_xyz*|bcd_xyz*'
SQL-style patterns '%' for any character(s), and '_' for a single character work only since Hive 4.0.0

shell read and pass mulptiple variables in for or while loop

Below is my input file and the code I am using
FILE :
cat $TESTFILE
2020-01-13,COST_CH_RPT
2018-04-19,LOSS_CH_RPT
CODE :
for i in `cat $TESTFILE`
do
export date=`cat $TESTFILE|cut -d',' -f1`
echo date=$date
export Name=`cat $TESTFILE|cut -d',' -f2`
echo Name=$Name
beeline --outputformat=csv2 --hivevar Name=$Name --hivevar date=$date -u ${beeURL} -f TEST.hql
done
The objective is to run the hql file for every line in the file , the above code is running twice for the two lines available , but the variables that are being passed for both the runs is the same , which is the first line in the file .
How can i differentiate the input variables for each run.
As noted by comment, the current solution re-process the TESTFILE multiple times, incorrectly. Simpler alternative is to use the 'read' to loop thru the lines:
while IFS=, read date Name ; do
echo beeline --outputformat=csv2 --hivevar Name="$Name" --hivevar date="$date" -u "${beeURL}" -f TEST.hql
done < $TESTFILE
It simply iterate over the line in the TESTFILE, and execute the beeline command. Suggesting using quotes to protect against error in the input file - in particular, spaces, which will 'break' the command line.

Dropping multiple tables with same prefix in Hive

I have few tables in hive that has same prefix like below..
temp_table_name
temp_table_add
temp_table_area
There are few hundreds of tables like this in my database along with many other tables.
I want to delete tables that starts with "temp_table".
Do any of you know any query that can do this work in Hive?
There is no such thing as regular expressions for drop query in hive (or i didn't find them). But there are multipe ways to do it, for example :
With a shell script :
hive -e "show tables 'temp_*'" | xargs -I '{}' hive -e 'drop table {}'
Or by putting your tables in a specific database and dropping the whole database.
Create table temp.table_name;
Drop database temp cascade;
Above solutions are good. But if you have more tables to delete, then running 'hive -e drop table' is slow. So, I used this:
hive -e 'use db;show tables' | grep pattern > file.hql
use vim editor to open file.hql and run below commands
:%s!^!drop table
:%s!$!;
then run
hive -f file.hql
This approach will be much faster.
My solution has been to use bash script with the following cmd:
hive -e "SHOW TABLES IN db LIKE 'schema*';" | grep "schema" | sed -e 's/^/hive -e \"DROP TABLE db\./1' | sed -e 's/$/\"/1' > script.sh
chmod +x script.sh
./script.sh
I was able to delete all tables using following steps in Apache Spark with Scala:
val df = sql("SHOW TABLES IN default LIke 'invoice*'").select("tableName") // to drop only selected column
val df = sql("SHOW TABLES IN default").select("tableName")
val tableNameList: List[String] = df.as[String].collect().toList
val df2 = tableNameList.map(tableName => sql(s"drop table ${tableName}"))
As I had a lot of tables do drop I used the following command, inspired in the #HorusH answer
hive -e "show tables 'table_prefix*'" | sed -e 's/^/ \DROP TABLE db_name\./1' | sed -e 's/$/;/1' > script.sh
hive -f script.sh
Try this:
hive -e 'use sample_db;show tables' | xargs -I '{}' hive -e 'use sample_db;drop table {}'
Below command will also work.
hive -e 'show tables' | grep table_prefix | while read line; do hive -e "drop table $line"; done
fastest solution through one shell script:
drop_tables.sh pattern
Shell script content:
hive -e 'use db;show tables' | grep $1 | sed 's/^/drop table db./' | sed 's/$/;/' > temp.hql
hive -f temp.hql
rm temp.hql

How to get all table definitions in a database in Hive?

I am looking to get all table definitions in Hive. I know that for single table definition I can use something like -
describe <<table_name>>
describe extended <<table_name>>
But, I couldn't find a way to get all table definitions. Is there any table in megastore similar to Information_Schema in mysql or is there command to get all table definitions ?
You can do this by writing a simple bash script and some bash commands.
First, write all table names in a database to a text file using:
$hive -e 'show tables in <dbname>' | tee tables.txt
Then create a bash script (describe_tables.sh) to loop over each table in this list:
while read line
do
echo "$line"
eval "hive -e 'describe <dbname>.$line'"
done
Then execute the script:
$chmod +x describe_tables.sh
$./describe_tables.sh < tables.txt > definitions.txt
The definitions.txt file will contain all the table definitions.
The above processes work, however it will be slow due to the fact that the hive connection is made for each query. Instead you can do what I just did for the same need below.
Use one of the above methods to get your list of tables.
Then modify the list to make it a hive query for each table as follows:
describe my_table_01;
describe my_TABLE_02;
So you will have a flat file with the all your describe statements mentioned above. For example, if you have the query in a flat file called my_table_description.hql.
Get the output in one scoop as follows:
"hive -f my_table_description.hql > my_table_description.output
It is super fast and gets the output in one shot.
Fetch list of hive databases hive -e 'show databases' > hive_databases.txt
Echo each table's desc:
cat hive_databases.txt | grep -v '^$' | while read LINE;
do
echo "## TableName:" $LINE
eval "hive -e 'show tables in $LINE' | grep -v ^$ | grep -v Logging | grep -v tab_name | tee $LINE.tables.txt"
cat $LINE.tables.txt | while read table
do
echo "### $LINE.$table" > $LINE.$table.desc.md
eval "hive -e 'describe $LINE.$table'" >> $LINE.$table.desc.md
sed -i 's/\t/|/g' ./$LINE.$table.desc.md
sed -i 's/comment/comment\n|:--:|:--:|:--:|/g' ./$LINE.$table.desc.md
done
done

importing data from multiple databases using sqoop

I want to import certain tables from multiple SQLsever databases(100+) using sqoop to HDFS. Can someone guide me how to do it? an automated script would do well.
This can be done by shell script.
1)Prepare a inputfile which has list of DBNAME.TABLENAME
2)The shell script will have this file as input, iterate line by line and execute sqoop statement for each line.
while read line;
do
DBNAME=`echo $line | cut -d'.' -f1`
tableName=`echo $line | cut -d'.' -f2`
sqoop import -Dmapreduce.job.queuename=$RM_QUEUE_NAME --connect '$JDBC_URL;databaseName=$DBNAME;username=$USERNAME;password=$PASSWORD' --table $tableName --target-dir $DATA_COLLECTOR/$tableName --fields-terminated-by '\001' -m 1
done<inputFile

Resources