Sqoop incremental export using hcatalog? - hadoop

Is there a way to use sqoop to do incremental exports ? I am using Hcatalog integration for sqoop.I tried using the --last-value, --check-column options which are used for incremental import, but sqoop gave me error that the options were invalid.

I have not seen incremental sqoop export arguments. The other way you could try is to create a contol_table in hive where you keep log of the table name & timestamp when it was last exported every time.
create table if not exists control_table (
table_name string,
export_date timestamp
);
insert into control_table 'export_table1' as table_name, from_unixtime(unix_timestamp()) as export_date from control_table;
If export_table1 is the table you want to export incrementally and assuming if have already executed above two statements.
--execute below at once
--get the timestamp when the table was last executed
create temporary table control_table_now as select table_name, max(export_date) as last_export_date from control_table group by table_name;
--get incremental rows
create table new_export_table1 as select field1, field2, field3, .... timestamp1 from export_table1 e, control_table_now c where c.table_name = 'export_table1' and e.timestamp1 >= c.last_export_date;
--append the control_table for next process
insert into control_table 'export_table1' as table_name, from_unixtime(unix_timestamp()) as export_date from control_table;
Now, export the new_export_table1 table which is incrementally created using sqoop export command.

By default sqoop does not support incremental update with hcatalog integration, when we try it gives following error
Append mode for imports is not compatible with HCatalog. Please remove the parameter--append-mode
at org.apache.sqoop.tool.BaseSqoopTool.validateHCatalogOptions(BaseSqoopTool.java:1561)
you can use query option to make it work. as described in this hortonworks document

Related

Hive one line command to catch SCHEMA + TABLE NAME info

Is there a way to catch all schema + table name info in a single command through Hive in a similar way to
SELECT * FROM information_schema.tables
from the PostgreSQL world?
show databases and show tables combined in a loop [here an example] is an answer, but I'm looking for a more compact way to have the same result in a single command.
It's been long I have worked on Hive Queries but as far as I remember you can probably use
hive> desc formatted tableName;
or
hive> describe formatted tableName;
It will give you all the relevant information related to the Table like the Schema, Partition info, Table Type like Managed Table, etc
I am not sure If you are particularly looking for this ??
There is another way to query Hive Tables, is writing Hive Scripts which can be called from Hadoop Terminal rather than from Hive Terminal itself.
std]$ cat sample.hql or vi sample.hql
use dbName;
select * from tableName;
desc formatted tableName;
# this hql script can be called from outside the hive terminal
std]$ hive -f sample.hql
or, without even have to write script file you can probably query hive as
std]$ hive -e "use dbName; select * from emp;" > text.txt or >> to append
On the Database level, you can probably query as :
hive> use dbName;
hive> set hive.cli.print.current.db=true;
hive(dbName)> describe database dbName;
it will bring metadata from MySQL(metastore) about the Database.

Sqoop export of a hive table partitioned on an int column

I have a Hive table partitioned on an 'int' column.
I want to export the Hive table to MySql using Sqoop export tool.
sqoop export --connect jdbc:mysql://XXXX:3306/temp --username root --password root --table emp --hcatalog-database temp --hcatalog-table emp
I tried the above sqoop command but it failed with below exception.
ERROR tool.ExportTool: Encountered IOException running export job: java.io.IOException:
The table provided temp.emp uses unsupported partitioning key type for column mth_id : int.
Only string fields are allowed in partition columns in HCatalog
I understand that the partition on int column is not supported.
But would like to check whether this issue is fixed in any of the latest releases with an extra config/option.
As a workaround, I can create another table without a partition before exporting.But I would like to check whether there is a better way to achieve this?
Thanks in advance.

Sqoop export from hive to oracle with different col names, number of columns and order of columns

The scenario is like, I have a hive table with 10 columns . I want to export the data from my hive table to an oracle table using Sqoop.
But the target oracle table has 30 columns having different names than hive table columns. Also, the column positions in oracle table are not same as in hive table.
Can anyone please suggest how can I write the Sqoop export command for this case?
Try below , it is assumed that your hive table is created as a external table and your data is located at /myhivetable/data/ , fields are terminated by | and lines are terminated by '\n'.
In you RDBMS table , the 20 columns which are not going to be populated from hive HDFS should have default values or allow null values.
Let us suppose your database columns are DC1,DC2,D4,DC5 ....D20 and hive columns are c1,c2,c3,c3,......c10 and your mapping is as below.
DC1 -- c8
DC2 -- c1
DC3 -- c2
DC4 -- c4
DC5 -- c3
DC6 -- c7
DC7 -- c10
DC8 -- c9
DC9 -- c5
DC10 -- c6
sqoop export \
--connect jdbc:postgresql://10.10.11.11:1234/db \
--table table1 \
--username user \
--password pwd \
--export-dir /myhivetable/data/ \
--columns "DC2,DC3,DC5,DC4,DC9,DC10,DC6,DC1,DC8,DC7" \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--input-null-string "\\\\N" \
--input-null-non-string "\\\\N"
First of all, you can't export data directly from hive to oracle.
You need to EXPORT hive table to HDFS
sample command:
export table mytable to 'some_hdfs_location'
Or use HDFS data location of your hive table.
command to check the location
show create table mytable
So now you have location of data for your Hive table.
You can use --columns tag in Sqoop Export command to choose column order and number.
There is no problem with different column name.
I am taking simple example
Now you have hive table with columns - c1, c2, c3
and Oracle table - col1, col2, col3, col4, col5
I want to map c1 with col2, c2 with col5, c3 with col1.
I will use --columns "col2,col5,col1" in my sqoop command.
As per Sqoop docs,
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
There are 2 options:
As of now, sqoop export is very limited (thinking because this is not much intended functionality but other way around), it gives only option to specify the --export-dir which is the table's warehouse directory. And it loads all columns. So you may need to load into a staging table & load it into the original base table with relevant column mapping.
You can export the data from Hive using:
INSERT OVERWRITE DIRECTORY '/user/hive_exp/orders' select column1, column2 from hivetable;
Then use the Oracle's native import tool. This gives more flexibility.
Please update if you have a better solution.

how to append hadoop job id to hive query result file?

I have a hive query insert overwrite to local file system. My query is as the following:
insert overwrite local directory /home/test/dds
select col1, col2 from test_table where query_ymd='2011-05-15' or query_ymd='2011-05-16' or query_ymd='2011-05-17';
It generates 2 files:
.000000_0.crc
000000_0
I would like the output to be:
attempt_201303210330_19069_r_000000_0
attempt_201303210330_19069_r_000000_0.crc
How can I config the hive server or query?
one HQL has some jobs,not only one.So,you can not do this.

Hive: writing column headers to local file?

Hive documentation lacking again:
I'd like to write the results of a query to a local file as well as the names of the columns.
Does Hive support this?
Insert overwrite local directory 'tmp/blah.blah' select * from table_name;
Also, separate question: Is StackOverflow the best place to get Hive Help? #Nija, has been very helpful, but I don't to keep bothering them...
Try
set hive.cli.print.header=true;
Yes you can. Put the set hive.cli.print.header=true; in a .hiverc file in your main directory or any of the other hive user properties files.
Vague Warning: be careful, since this has crashed queries of mine in the past (but I can't remember the reason).
Indeed, #nija's answer is correct - at least as far as I know. There isn't any way to write the column names when doing an insert overwrite into [local] directory ... (whether you use local or not).
With regards to the crashes described by #user1735861, there is a known bug in hive 0.7.1 (fixed in 0.8.0) that, after doing set hive.cli.print.header=true;, causes a NullPointerException for any HQL command/query that produces no output. For example:
$ hive -S
hive> use default;
hive> set hive.cli.print.header=true;
hive> use default;
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:222)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:287)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Whereas this is fine:
$ hive -S
hive> set hive.cli.print.header=true;
hive> select * from dual;
c
c
hive>
Non-HQL commands are fine though (set,dfs !, etc...)
More info here: https://issues.apache.org/jira/browse/HIVE-2334
Hive does support writing to the local directory. You syntax looks right for it as well.
Check out the docs on SELECTS and FILTERS for additional information.
I don't think Hive has a way to write the names of the columns to a file for the query you're running . . . I can't say for sure it doesn't, but I do not know of a way.
I think the only place better than SO for Hive questions would be the mailing list.
I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.
create table new_table as
select
field1,
field2,
field3
from
(
select
0 as sort_col, --header row gets lowest number
'field1_name' as field1,
'field2_name' as field2,
'field3_name' as field3
from
some_small_table --table needs at least 1 row
limit 1 --only need 1 header row
union all
select
1 as sort_col, --original query goes here
field1,
field2,
field3
from
main_table
) a
order by
sort_col --make sure header row is first
It's a little bulky, but at least you can get what you need with a single query.
Hope this helps!
Not a great solution, but here is what I do:
create table test_dat
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/tmp/test_dat' as select * from YOUR_TABLE;
hive -e 'set hive.cli.print.header=true;select * from YOUR_TABLE limit 0' > /tmp/test_dat/header.txt
cat header.txt 000* > all.dat
Here's my take on it. Note, i'm not very well versed in bash, so improvements suggestions welcome :)
#!/usr/bin/env bash
# works like this:
# ./get_data.sh database.table > data.csv
INPUT=$1
TABLE=${INPUT##*.}
DB=${INPUT%.*}
HEADER=`hive -e "
set hive.cli.print.header=true;
use $DB;
INSERT OVERWRITE LOCAL DIRECTORY '$TABLE'
row format delimited
fields terminated by ','
SELECT * FROM $TABLE;"`
HEADER_WITHOUT_TABLE_NAME=${HEADER//$TABLE./}
echo ${HEADER_WITHOUT_TABLE_NAME//[[:space:]]/,}
cat $TABLE/*

Resources