Sqoop import query to transfer 1000 random records from a table? - hadoop

I have a table of around 100000 records and want to import 1000 random records from that table
can someone help :)
sqoop import \
--connect jdbc:mysql://localhost:3306/userdb \
--username root \
--table emp --m 1

Sqoop is just a tool which transfer the data from mysql to hdfs or hdfs to mysql so there is no any direct command to do this but yes can do it using query like this
query :-
--query "select * from my_table order by rand() limit 1000 AND \$CONDITIONS"
it will help you to import or export 1000 rows of the table.

There is no such command for random import but you can limit the record to import only 1000 record using --query option. Since you have MySQL database you can use below command:
sqoop import --connect "$CONNECT_STRING" \
--query "select $source_column from $SOURCE_TABLE_NAME where \$CONDITIONS limit 1000" \
--username $USER_NAME --password $PASSWORD \
--target-dir $TARGET_DIRECTORY_NAME -m 1
You can also pass any custom query with --query option.

Related

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

How to pass column names having spaces to sqoop --map-column-java

I have to import data using sqoop, my source column names are having spaces in between them, so while I am adding it in --map-column-java parameter getting the error.
Sample Sqoop import:
sqoop import --connect jdbc-con --username "user1" --query "select * from table where \$CONDITIONS" --target-dir /target/path/ -m 1 --map-column-java data col1=String, data col2=String, data col3=String --as-avrodatafile
Column names:
data col1,
data col2,
data col3
Error:
19/03/07 07:31:55 DEBUG sqoop.Sqoop: Malformed mapping. Column mapping should be the form key=value[,key=value]*
java.lang.IllegalArgumentException: Malformed mapping. Column mapping should be the form key=value[,key=value]*
at org.apache.sqoop.SqoopOptions.parseColumnMapping(SqoopOptions.java:1355)
at org.apache.sqoop.SqoopOptions.setMapColumnJava(SqoopOptions.java:1375)
at org.apache.sqoop.tool.BaseSqoopTool.applyCodeGenOptions(BaseSqoopTool.java:1363)
at org.apache.sqoop.tool.ImportTool.applyOptions(ImportTool.java:1011)
at org.apache.sqoop.tool.SqoopTool.parseArguments(SqoopTool.java:435)
at org.apache.sqoop.Sqoop.run(Sqoop.java:135)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Malformed mapping. Column mapping should be the form key=value[,key=value]*
Able to resolve this issue:
1. Spaces issue:
sqoop import --connect jdbc-con --username "user1" --query "select * from table where \$CONDITIONS" --target-dir /target/path/ -m 1 --map-column-java "data col1=String, data col2=String, data col3=String" --as-avrodatafile
2. ERROR tool.ImportTool: Import failed: Cannot convert SQL type 2005:
3 columns in source are having 2005 and nvarchar added them in --map-column-java resolved this issue
3. org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 1****
This is causing due to using * in select query, so modified sqoop query as:
sqoop import --connect jdbc-con --username "user1" --query "select [col1,data col2,data col3] from table where \$CONDITIONS" --target-dir /target/path/ -m 1 --map-column-java "data col1=String, data col2=String, data col3=String" --as-avrodatafile
Instead of using you can use this one method
I have used it and it works
here I am casting the columns to string so that timestamp could not change to int
keep note of that point It will help you to make your string properly
address = <localhost/server-ip-address/>
port = put your database port number
Sqoop is expecting the comma-separated list of mapping in form 'name of column'='new type'
columns-name = give your database column name of timestamp or date time to date
database-name = give your datbase name
database-user-name = put your user name
password = put your password
demo to understand the code properly
sqoop import --map-column-java "columns-name=String" --connect jdbc:postgresql://address:port/database-name --username user-name --password database-password --query "select * from demo where \$CONDITIONS;" -m 1 --target-dir /jdbc/star --as-parquetfile --enclosed-by '\"'
demo of code for single-column
sqoop import --map-column-java "date_of_birth=String" --connect jdbc:postgresql://192.168.0.1:1928/alpha --username postgres --password mysecretpass --query "select * from demo where \$CONDITIONS;" -m 1 --target-dir /jdbc/star --as-parquetfile --enclosed-by '\"'
demo of code for dealing with multiple columns
sqoop import --map-column-java "date_of_birth=String,create_date=String" --connect jdbc:postgresql://192.168.0.1:1928/alpha --username postgres --password mysecretpass --query "select * from demo where \$CONDITIONS;" -m 1 --target-dir /jdbc/star --as-parquetfile --enclosed-by '\"'

Unable to import data from MySql using Sqoop with different delimiter

As a beginner in Hadoop field, i was trying my hands on Sqoop tool (Version : Sqoop 1.4.6-cdh5.8.0).
Though i referred to various sites and forums but i could not get workable solution where in i could import data using any other delimiter other than ,.
PFB the code that i have used :
--- Connecting to MySql, creating table and records with , in string.
mysql> create database GRHadoop;
Query OK, 1 row affected (0.00 sec)
mysql> use GRHadoop;
Database changed
mysql> Create table sitecustomer(Customerid int(10), Customername varchar(100),Productid int(4),Salary int(20));
Query OK, 0 rows affected (0.22 sec)
mysql> Insert into sitecustomer values(1,'Sohail',100,50000),(2,'Reshma',200,80000),(3,'Tom',200,60000);
Query OK, 3 rows affected (0.06 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> Insert into sitecustomer values(4,'Su,kama',300,50000),(5,'Ram,bha',100,80000),(6,'Suz',200,60000);
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0
Sqoop Command :
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--input-fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;
Expected Output :
1|Sohail|100|50000
2|Reshma|200|80000
3|Tom|200|60000
4|Su,kama|300|50000
5|Ram,bha|100|80000
6|Suz|200|60000
Actual output :
1,Sohail,100,50000
2,Reshma,200,80000
3,Tom,200,60000
4,Su,kama,300,50000
5,Ram,bha,100,80000
6,Suz,200,60000
Please guide where i am getting it wrong.
The --input-fields-terminated-by argument is to tell Sqoop how to parse the input files during export. You should be using --fields-terminated-by, this argument controls how the output is formatted.
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/GRHadoop \
--username root \
--password cloudera \
--table sitecustomer \
--fields-terminated-by '|' \
--lines-terminated-by "\n" \
--target-dir /user/cloudera/GR/Sqoop/sitecustomer_data \
--m 1;

Using sqoop import, How to append rows into existing hive table?

From SQL server I imported and created a hive table using the below query.
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --hive-import --hive-table hivedb.demotable --create-hive-table --fields-terminated-by ','
Command was successful, imported the data and created a table with 10000 records.
I inserted 10 new records in SQL server and tried to append these 10 records into existing hive table using --where clause
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --where "ID > 10000" --hive-import -hive-table hivedb.demotable
But the sqoop job is getting failed with error
ERROR tool.ImportTool: Error during import: Import job failed!
Where am I going wrong? any other alternatives to insert into table using sqoop.
EDIT:
After slightly changing the above command I am able to append the new rows.
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --where "ID > 10000" --hive-import -hive-table hivedb.demotable --fields-terminated-by ',' -m 1
Though it resolves the mentioned problem, I can't insert the modified rows. Is there any way to insert the modified rows without using
--incremental lastmodified parameter.
in order to append rows to hive table, use the same query you have been using before, just remove the --hive-overwrite.
I will share the 2 queries that I used to import in hive, one for overwriting and one for append, you can use the same for importing:
To OVERWRITE the previous records
sqoop import -Dmapreduce.job.queuename=default --connect jdbc:teradata://database_connection_string/DATABASE=database_name,TMODE=ANSI,LOGMECH=LDAP --username z****** --password ******* --query "select * from ****** where \$CONDITIONS" --split-by "HASHBUCKET(HASHROW(key to split)) MOD 4" --num-mappers 4 --hive-table hive_table_name --boundary-query "select 0, 3 from dbc.dbcinfo" --target-dir directory_nameĀ  --delete-target-dir --hive-import --hive-overwrite --driver com.teradata.jdbc.TeraDriver
TO APPEND to the previous records
sqoop import -Dmapreduce.job.queuename=default --connect jdbc:teradata://connection_string/DATABASE=db_name,TMODE=ANSI,LOGMECH=LDAP --username ****** --password ******--query "select * from **** where \$CONDITIONS" --split-by "HASHBUCKET(HASHROW(key to split)) MOD 4" --num-mappers 4 --hive-import --hive-table guestblock.prodrptgstrgtn --boundary-query "select 0, 3 from dbc.dbcinfo" --target-dir directory_name --delete-target-dir --driver com.teradata.jdbc.TeraDriver
Note that I am using 4 mappers, you can use more as well.
I am not sure if you can give direct --append option in sqoop with --hive-import option. Its still not available atleast in version 1.4.
The default behavior is append when --hive-overwrite and --create-hive-table is missing. (atleast in this context.
I go with nakulchawla09's answer. Though remind yourself to keep the --split-by option . This will ensure the split names in hive data store is appropriately created. otherwise you will not like the default naming. You can ignore this comment in case you don't care for the backstage hive warehouse naming and backstage data store. When i tried with the below command
Before the append
beeline:hive2> select count(*) from geolocation;
+-------+--+
| _c0 |
+-------+--+
| 8000 |
+-------+--+
file in hive warehouse before the append
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:03 /apps/hive/warehouse/geolocation/part-m-00000
sqoop command for appending additional 8k records again
sqoop import --connect jdbc:mysql://localhost/RAWDATA --table geolocation --username root --password hadoop --target-dir /rawdata --hive-import --driver com.mysql.jdbc.Driver --m 1 --delete-target-dir
it created the below files. You can see the file name is not great because did not give a split by option or split hash (can be datetime or date).
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:03 /apps/hive/warehouse/geolocation/part-m-00000
-rwxrwxrwx 1 root hdfs 479218 2018-10-12 11:10 /apps/hive/warehouse/geolocation/part-m-00000_copy_1
hive records appended now
beeline:hive2> select count(*) from geolocation;
+-------+--+
| _c0 |
+-------+--+
| 16000 |
+-------+--+
We can use this command:
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --query 'select * from demotable where ID > 10000' --hive-import --hive-table hivedb.demotable --target-dir demotable_data
Use --append option and -m 1 so it will be like below :
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password paswd --table demotable --hive-import --hive-table hivedb.demotable --append -m 1

schedule and automate sqoop import/export tasks

I have a sqoop job which requires to import data from oracle to hdfs.
The sqoop query i'm using is
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '1' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test1 --fields-terminated-by '\t'
I am re-running the same query again and again with change in partitionid from 1 to 96. so I should execute the sqoop import command manually 96 times. The table 'ORDERS' contains millions of rows and each row has a partitionid from 1 to 96. I need to import 10001 rows from each partitionid into hdfs.
Is there any way to do this? How to automate the sqoop job?
Run script : $ ./script.sh 20 //------- for 20th entry
ramisetty#HadoopVMbox:~/ramu$ cat script.sh
#!/bin/bash
PART_ID=$1
TARGET_DIR_ID=$PART_ID
echo "PART_ID:" $PART_ID "TARGET_DIR_ID: "$TARGET_DIR_ID
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '$PART_ID' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test/$TARGET_DIR_ID --fields-terminated-by '\t'
For all 1 to 96 - single shot
ramisetty#HadoopVMbox:~/ramu$ cat script_for_all.sh
#!/bin/bash
for part_id in {1..96};
do
PART_ID=$part_id
TARGET_DIR_ID=$PART_ID
echo "PART_ID:" $PART_ID "TARGET_DIR_ID: "$TARGET_DIR_ID
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '$PART_ID' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test/$TARGET_DIR_ID --fields-terminated-by '\t'
done
Use crontab for scheduling purposes. Crontab documentation can be found here or you could use man crontab in terminal.
Add your sqoop import command in shell script and execute this shell script using crontab.

Resources