As a beginner in Hadoop field, i was trying my hands on Sqoop tool (Version : Sqoop 1.4.6-cdh5.8.0).
Though i referred to various sites and forums but i could not get workable solution where in i could import data using any other delimiter other than ,.
PFB the code that i have used :
--- Connecting to MySql, creating table and records with , in string.
mysql> create database GRHadoop;
Query OK, 1 row affected (0.00 sec)
mysql> use GRHadoop;
Database changed
mysql> Create table sitecustomer(Customerid int(10), Customername varchar(100),Productid int(4),Salary int(20));
Query OK, 0 rows affected (0.22 sec)
mysql> Insert into sitecustomer values(1,'Sohail',100,50000),(2,'Reshma',200,80000),(3,'Tom',200,60000);
Query OK, 3 rows affected (0.06 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> Insert into sitecustomer values(4,'Su,kama',300,50000),(5,'Ram,bha',100,80000),(6,'Suz',200,60000);
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0
Sqoop Command :
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--input-fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;
Expected Output :
1|Sohail|100|50000
2|Reshma|200|80000
3|Tom|200|60000
4|Su,kama|300|50000
5|Ram,bha|100|80000
6|Suz|200|60000
Actual output :
1,Sohail,100,50000
2,Reshma,200,80000
3,Tom,200,60000
4,Su,kama,300,50000
5,Ram,bha,100,80000
6,Suz,200,60000
Please guide where i am getting it wrong.
The --input-fields-terminated-by argument is to tell Sqoop how to parse the input files during export. You should be using --fields-terminated-by, this argument controls how the output is formatted.
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;
Related
I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1
I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.
I am doing a incremental sqooping from to hdfs oracle giving where condition like
(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD"T"HH24:MI"Z"')
AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD"T"HH24:MI"Z"'))
But it is not using the index. How can I force an Index so that sqoop can be faster by considering only filtered records.
What is the best option to do incremental sqoop. Table size in oracle is in TBs.
Table has billions rows and after where condition it is in some million
You can use --where or --query with where condition in select to filter import results
I was not sure about your sqoop full command, just give a try in this way
sqoop import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--where "(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"') AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"'))" \
--target-dir {hdfs/location/to/save/data/from/oracle} \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value {from Date/Timestamp to Sqoop in incremental}
Check more details about sqoop incremental load
Update
For incremental imports Sqoop saved job is recommended to maintain --last-value automatically.
sqoop job --create {incremental job name} \
-- import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value 0
Here --last-value 0 to import from start for first time then latest
value will be passed automatically in next invocation by sqoop job
MySQL table with dept_id as primary key
|dept_id | dept_name |
| 2 | Fitness
| 3 | Footwear
| 4 | Apparel
| 5 | Golf
| 6 | Outdoors
| 7 | Fan Shop
Sqoop Query
sqoop import \
-m 2 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
-P \
--query 'select * from departments where dept_id < 6 AND $CONDITIONS' \
--target-dir /user/cloudera/sqoop_import/departments;
Results with an error on console:
When importing query results in parallel, you must specify --split-by
---Question point!---
Even though the table has primary key & the splits can be equally distributed between 2 mappers then what is the need of --spit-by or -m 1 ??
Guide me for the same.
Thanks.
The reason why Sqoop import needs --split-by when you use --query is because when you specify the source location of data in "query", it is not possible to know/guess the primary key for Sqoop. Because, in query, you can join two or more tables which will have multiple keys and fields. So, Sqoop can't know/guess on which of those keys can it split by.
It is not primary key with --split-by usage. you are seeing error because of use of --query option. This option MUST be used with --split-by, --target-dir and $CONDITIONS in query.
free_form_query_imports documentations
When importing a free-form query, you must specify a destination
directory with --target-dir.
If you want to import the results of a query in parallel, then each
map task will need to execute a copy of the query, with results
partitioned by bounding conditions inferred by Sqoop. Your query must
include the token $CONDITIONS which each Sqoop process will replace
with a unique condition expression. You must also select a splitting
column with --split-by.
You can use --where option if you don't want to use --split-by and --query:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--where "department_id < 6"
if you use --boundary-query option then you don't need --split-by, --query option:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--boundary-query "select 2, 6 from departments limit 1"
selecting_the_data_to_import
By default sqoop will use query select min(<split-by>),
max(<split-by>) from <table name> to find out boundaries for creating
splits. In some cases this query is not the most optimal so you can
specify any arbitrary query returning two numeric columns using
--boundary-query argument.
As per sqoop docs,
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
So you have to specify your primary key in --split-by tag.
If you choose 1 mapper, Sqoop will not split task in parallel and perform complete import in 1 mapper.
Check my another answer (if you want) to understand need of $CONDITIONS and number of mappers.
From SQL server I imported and created a hive table using the below query.
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --hive-import --hive-table hivedb.demotable --create-hive-table --fields-terminated-by ','
Command was successful, imported the data and created a table with 10000 records.
I inserted 10 new records in SQL server and tried to append these 10 records into existing hive table using --where clause
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --where "ID > 10000" --hive-import -hive-table hivedb.demotable
But the sqoop job is getting failed with error
ERROR tool.ImportTool: Error during import: Import job failed!
Where am I going wrong? any other alternatives to insert into table using sqoop.
EDIT:
After slightly changing the above command I am able to append the new rows.
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --where "ID > 10000" --hive-import -hive-table hivedb.demotable --fields-terminated-by ',' -m 1
Though it resolves the mentioned problem, I can't insert the modified rows. Is there any way to insert the modified rows without using
--incremental lastmodified parameter.
in order to append rows to hive table, use the same query you have been using before, just remove the --hive-overwrite.
I will share the 2 queries that I used to import in hive, one for overwriting and one for append, you can use the same for importing:
To OVERWRITE the previous records
sqoop import -Dmapreduce.job.queuename=default --connect jdbc:teradata://database_connection_string/DATABASE=database_name,TMODE=ANSI,LOGMECH=LDAP --username z****** --password ******* --query "select * from ****** where \$CONDITIONS" --split-by "HASHBUCKET(HASHROW(key to split)) MOD 4" --num-mappers 4 --hive-table hive_table_name --boundary-query "select 0, 3 from dbc.dbcinfo" --target-dir directory_nameĀ --delete-target-dir --hive-import --hive-overwrite --driver com.teradata.jdbc.TeraDriver
TO APPEND to the previous records
sqoop import -Dmapreduce.job.queuename=default --connect jdbc:teradata://connection_string/DATABASE=db_name,TMODE=ANSI,LOGMECH=LDAP --username ****** --password ******--query "select * from **** where \$CONDITIONS" --split-by "HASHBUCKET(HASHROW(key to split)) MOD 4" --num-mappers 4 --hive-import --hive-table guestblock.prodrptgstrgtn --boundary-query "select 0, 3 from dbc.dbcinfo" --target-dir directory_name --delete-target-dir --driver com.teradata.jdbc.TeraDriver
Note that I am using 4 mappers, you can use more as well.
I am not sure if you can give direct --append option in sqoop with --hive-import option. Its still not available atleast in version 1.4.
The default behavior is append when --hive-overwrite and --create-hive-table is missing. (atleast in this context.
I go with nakulchawla09's answer. Though remind yourself to keep the --split-by option . This will ensure the split names in hive data store is appropriately created. otherwise you will not like the default naming. You can ignore this comment in case you don't care for the backstage hive warehouse naming and backstage data store. When i tried with the below command
Before the append
beeline:hive2> select count(*) from geolocation;
+-------+--+
| _c0 |
+-------+--+
| 8000 |
+-------+--+
file in hive warehouse before the append
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:03 /apps/hive/warehouse/geolocation/part-m-00000
sqoop command for appending additional 8k records again
sqoop import --connect jdbc:mysql://localhost/RAWDATA --table geolocation --username root --password hadoop --target-dir /rawdata --hive-import --driver com.mysql.jdbc.Driver --m 1 --delete-target-dir
it created the below files. You can see the file name is not great because did not give a split by option or split hash (can be datetime or date).
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:03 /apps/hive/warehouse/geolocation/part-m-00000
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:10 /apps/hive/warehouse/geolocation/part-m-00000_copy_1
hive records appended now
beeline:hive2> select count(*) from geolocation;
+-------+--+
| _c0 |
+-------+--+
| 16000 |
+-------+--+
We can use this command:
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --query 'select * from demotable where ID > 10000' --hive-import --hive-table hivedb.demotable --target-dir demotable_data
Use --append option and -m 1 so it will be like below :
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --hive-import --hive-table hivedb.demotable --append -m 1