sqoop export commands for the data which has spaces before in hdfs - hadoop

i have data which has stored in hdfs, the data has space before and after of the value, when i try to export to mysql, it gives numberformat exception but when i create data without space, it has inserted into mysql successfully.
my question is can't we export the data which has space from hdfs to mysql usong sqoop export command?
The data which i used
1201, adi, sen manager, 30000, it
1201, pavan, jun manager, 5000, cs
1203, santhosh, junior, 60000, mech
i created table like
create table emp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
sqoop command -- sqoop export \
--connect jdbc:mysql://127.0.0.1/mydb \
--username root \
--table emp \
--m 1 \
--export-dir /mydir \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n'
result: numberformatexception input string:'1201'
can't parse the data
i discussed in forum, they said trim the space but i wants to know that automatically trim the spaces while perform sqoop export.
can somebody give suggestions on this?

You can do 1 simple thing:
Create temporary table in MySQL with all VARCHAR
create table emp-temp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
Now create another with numeric fields after TRIM() and CAST()
create table emp as select CAST(TRIM(id) AS UNSIGNED), name, desg, CAST(TRIM(salary) AS UNSIGNED), dept FROM emp_temp;

Sqoop internally runs MapReduce job.
Simple solution is to run a Mapper and trim the spaces in your data and get the output in different file and run sqoop export on new file.

Related

Can Sqoop update record on Oracle RDBMS table that have different column structure with Hive table

I'm a Hadoop newcomer trying to export data from Hive to Oracle. Can Sqoop update data to Oracle table let say,
Oracle Table have column A,B,C,D,E
I stored data on Hive table as B,C,E
Can Sqoop export update(just update, not upsert) with B,C as update keys and update just the E column from Hive?
Pls mention --update-key Prim_key_col_in_table. Pls note --update-mode default is updateonly so you dont have to mention anything.
You can also add input-fields-terminated-by command if you want to.
here is a sample command -
sqoop export --connect jdbc:mysql://xxxxxx/mytable --username xxxxx --password xxxxx --table export_sqoop_mytable --update-key Prim_key_col_in_table --export-dir /user/ingenieroandresangel/datasets/mytable.txt -m 1

Sqoop Import of Data having new line character in avro format and then query using hive

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?
As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat

Sqoop Support for Oracle Date Format

Using Sqoop I am importing from Oracle table into hdfs and Loading to Manage Table by giving the hdfs Path Location.Below is the sqoop Command
sqoop import \
--connect jdbcconnection \
--username user \
--password password \
--table EMPDETAILS \
--column "EMP_ID,EMP_NAME,EMP_DOB,EMP_DOJ" \
--target-dir hdfspath \
-m 1
This command executed successfully and when loading into hive table using hdfs location it is giving null for EMP_DOB(date type is Date)
create table EMP_TARGET(
empid int,
empname string,
empdob date,
empdoj timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
Location 'hdfspath';
When I execute the above query the empdob column in target hive is giving NULL but empdoj is giving Correct Value. When I checked the value in hdfs path for empdob it is 1980-01-01 00:00:00:0.
Kindly help to solve the issue.

Importing tables from multiple databases into Hadoop and Union

I have this specific scenario:
There are year-wise databases in SQL Server with naming like "FOOXXYY" where XXYY denotes fiscal year. Now I want to take a specific table "bar" from all these Databases, Union it into a single table in hive and store it into the HDFS.
What will be the best and fastest approach to go about it?
You need to create database, create partitioned table, add partitions, run 4 different sqoop commands to connect to each of the database and load data into partitions. Here are the sample code snippets.
Create database and then partition table like this;
CREATE TABLE `order_items`(
`order_item_id` int,
`order_item_order_id` int,
`order_item_order_date` string,
`order_item_product_id` int,
`order_item_quantity` smallint,
`order_item_subtotal` float,
`order_item_product_price` float)
PARTITIONED BY (
`order_month` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
You can then add partitions using these commands:
alter table order_items add partition (order_month=201301);
alter table order_items add partition (order_month=201302);
Once the table is created, you can run describe formatted order_items. It will give the path of the table and you can validate using dfs -ls command in hive.
From describe formatted
Location: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/retail_ods.db/order_items
dfs -ls /apps/hive/warehouse/retail_ods.db/order_items
You will get 2 directories, capture the path.
By now you have table and partitions for each of the year (for your case). Now you can use sqoop import command for each database to query from the table and copy into the respective partition.
You can find sample sqoop commands here. You can even pass the query as part of sqoop import commands (Google sqoop user guide).
sqoop import \
--connect "jdbc:mysql://sandbox.hortonworks.com:3306/retail_db" \
--username=retail_dba \
--password=hadoop \
--table order_items \
--target-dir /apps/hive/warehouse/retail_ods.db/order_items/order_month=201301 \
--append \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--outdir java_files

Sqoop creating insert statements containing multiple records

we are trying to load the data from sqoop to netezza. And we are facing the following issue.
java.io.IOException: org.netezza.error.NzSQLException: ERROR:
Example Input dataset is as shown below:
1,2,3
1,3,4
sqoop command is as shown below:
sqoop export --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
The Sqoop is creating an insert statement in the following way:
insert into (c1,c2,c3) values (1,2,3),(1,3,4).
We are able to load one record but when we try to load the data to multiple records, the error is as said above.
Your help is highly appreciated.
Making sqoop.export.records.per.statement=1 will definitely help but this will make the export process extremely slow if your export record count is very large say "5 Million".
To solve this you need add following things:
1.) A properties file sqoop.properties, it must contain this property jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED (It avoids deadlock during exports)
also in the export command you need to specify this:
--connection-param-file /path/to/sqoop.properties
2.) Also sqoop.export.records.per.statement=100, making this will increase the speed of export.
3.) Third you have to add --batch, Use batch mode for underlying statement execution.
So you final export will look like this,
sqoop export -D sqoop.export.records.per.statement=100 --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
--connection-param-file /path/to/sqoop.properties
--batch
Hope this will help.
You can customise the number of rows that will be used in one insert statement with property "sqoop.export.records.per.statement". For example for Netezza you will need to set it to 1:
sqoop export -Dsqoop.export.records.per.statement=1 --connect ...
I would recommend you to also take a look on Apache Sqoop Cookbook where this and many other tips are described.

Resources