Sqoop Export specific columns from hdfs to mysql is not working properly - sqoop

My HDFS file contains 5 columns.
emp_no,birth_date,first_name,last_name,hire_date
I want to export it with only 3 columns:
emp_no,first_name,last_name
I am doing it with
sqoop export
--connect jdbc:mysql://mysql.example.com/sqoop
--username sqoop
--password sqoop
--table employees
--columns "emp_no,first_name,last_name"
--export-dir /user/dataset/employees
But I am getting emp_no,birth_date and first_name in MySQL table.
I am getting 3 columns in my table but one column I want to skip is not happening with --columns in sqoop export

I solved my problem. Actually I misunderstood option --columns for export.
With --columns option for export, we can select subset of columns or control ordering of the table columns(or destination e.g mysql columns) not the HDFS columns.
This option decides binding of HDFS source columns with columns mentioned in --columns option of the destination table.
e.g. if I mention --columns "col2,col3,col1" in sqoop command
where col1,col2,col3 are mysql table's columns
Then it will bind col2 with first column of the HDFS source and col3 with second column of the HDFS source and so on..

Related

Can Sqoop update record on Oracle RDBMS table that have different column structure with Hive table

I'm a Hadoop newcomer trying to export data from Hive to Oracle. Can Sqoop update data to Oracle table let say,
Oracle Table have column A,B,C,D,E
I stored data on Hive table as B,C,E
Can Sqoop export update(just update, not upsert) with B,C as update keys and update just the E column from Hive?
Pls mention --update-key Prim_key_col_in_table. Pls note --update-mode default is updateonly so you dont have to mention anything.
You can also add input-fields-terminated-by command if you want to.
here is a sample command -
sqoop export --connect jdbc:mysql://xxxxxx/mytable --username xxxxx --password xxxxx --table export_sqoop_mytable --update-key Prim_key_col_in_table --export-dir /user/ingenieroandresangel/datasets/mytable.txt -m 1

incremental load using sqoop from mysql to hive

I am new to sqoop and hive . Please help me with understanding
The count of mysql and hive table are different
mysql is 51 rows (table has primary key and no duplicates ) ad hive is 38rows - first run itself
sqoop job --create mmod -- import --connect "jdbc:mysql://cxln2.c.thelab-240901.internal:3306/retail_db" --username sqoopuser --password-file
/tmp/.mysql-pass.txt --table mod --compression-codec org.apache.hadoop.io.compress.BZip2Codec --hive-import --hive-database encry --hive-table mod2 --h
ive-overwrite --check-column last_update_date --incremental lastmodified --merge-key id --last-value 0 --target-dir /user/user_name/append1sqo
pp
It is not creating target dir in given location , instead it creating in warehouse location
I am trying to schedule a sqoop incremental job , somehow I am doing mistake some where
command : above command
2.1 new rows are added with same date
2.2 delete and update on few rows
Output :
No new updates on given table .
It is not updating lastvalue in sqoop job
How to choose merge-key column in sqoop
Where condition in sqoop
--query "select * from reason where id>20 AND $CONDITIONS"
What is the use of $CONDITIONS and do we need to pass the variable in Linux
Is that possible to track rejected rows in sqoop job

sqoop export commands for the data which has spaces before in hdfs

i have data which has stored in hdfs, the data has space before and after of the value, when i try to export to mysql, it gives numberformat exception but when i create data without space, it has inserted into mysql successfully.
my question is can't we export the data which has space from hdfs to mysql usong sqoop export command?
The data which i used
1201, adi, sen manager, 30000, it
1201, pavan, jun manager, 5000, cs
1203, santhosh, junior, 60000, mech
i created table like
create table emp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
sqoop command -- sqoop export \
--connect jdbc:mysql://127.0.0.1/mydb \
--username root \
--table emp \
--m 1 \
--export-dir /mydir \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n'
result: numberformatexception input string:'1201'
can't parse the data
i discussed in forum, they said trim the space but i wants to know that automatically trim the spaces while perform sqoop export.
can somebody give suggestions on this?
You can do 1 simple thing:
Create temporary table in MySQL with all VARCHAR
create table emp-temp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
Now create another with numeric fields after TRIM() and CAST()
create table emp as select CAST(TRIM(id) AS UNSIGNED), name, desg, CAST(TRIM(salary) AS UNSIGNED), dept FROM emp_temp;
Sqoop internally runs MapReduce job.
Simple solution is to run a Mapper and trim the spaces in your data and get the output in different file and run sqoop export on new file.

Sqoop export from hive to oracle with different col names, number of columns and order of columns

The scenario is like, I have a hive table with 10 columns . I want to export the data from my hive table to an oracle table using Sqoop.
But the target oracle table has 30 columns having different names than hive table columns. Also, the column positions in oracle table are not same as in hive table.
Can anyone please suggest how can I write the Sqoop export command for this case?
Try below , it is assumed that your hive table is created as a external table and your data is located at /myhivetable/data/ , fields are terminated by | and lines are terminated by '\n'.
In you RDBMS table , the 20 columns which are not going to be populated from hive HDFS should have default values or allow null values.
Let us suppose your database columns are DC1,DC2,D4,DC5 ....D20 and hive columns are c1,c2,c3,c3,......c10 and your mapping is as below.
DC1 -- c8
DC2 -- c1
DC3 -- c2
DC4 -- c4
DC5 -- c3
DC6 -- c7
DC7 -- c10
DC8 -- c9
DC9 -- c5
DC10 -- c6
sqoop export \
--connect jdbc:postgresql://10.10.11.11:1234/db \
--table table1 \
--username user \
--password pwd \
--export-dir /myhivetable/data/ \
--columns "DC2,DC3,DC5,DC4,DC9,DC10,DC6,DC1,DC8,DC7" \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--input-null-string "\\\\N" \
--input-null-non-string "\\\\N"
First of all, you can't export data directly from hive to oracle.
You need to EXPORT hive table to HDFS
sample command:
export table mytable to 'some_hdfs_location'
Or use HDFS data location of your hive table.
command to check the location
show create table mytable
So now you have location of data for your Hive table.
You can use --columns tag in Sqoop Export command to choose column order and number.
There is no problem with different column name.
I am taking simple example
Now you have hive table with columns - c1, c2, c3
and Oracle table - col1, col2, col3, col4, col5
I want to map c1 with col2, c2 with col5, c3 with col1.
I will use --columns "col2,col5,col1" in my sqoop command.
As per Sqoop docs,
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
There are 2 options:
As of now, sqoop export is very limited (thinking because this is not much intended functionality but other way around), it gives only option to specify the --export-dir which is the table's warehouse directory. And it loads all columns. So you may need to load into a staging table & load it into the original base table with relevant column mapping.
You can export the data from Hive using:
INSERT OVERWRITE DIRECTORY '/user/hive_exp/orders' select column1, column2 from hivetable;
Then use the Oracle's native import tool. This gives more flexibility.
Please update if you have a better solution.

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Resources