Sqoop export update only specified columns - hadoop

As far as I know, we can update database using "--udate-key" argument. Which updates whole record for that key. we can either insert or update with "--update-mode allowinsert" or "--update-mode updateonly".
For example I have a file which consists of primary key and a column values which I have to update in a table where it has other columns too. My question is, can we update that particular column without updating those other columns in table? We must specify all the columns for --update-key argument right? is there any solution or work around for this?

Yes.
By using "--update-key" and "columns" arguments.
Example:
$ sqoop export --connect jdbc:mysql://localhost/TGL --username root --password root --table staging --export-dir /sqoop/DB1_Result -m 1 -input-fields-terminated-by ","
note: field specified in update-key must be in columns argument

Related

Can Sqoop update record on Oracle RDBMS table that have different column structure with Hive table

I'm a Hadoop newcomer trying to export data from Hive to Oracle. Can Sqoop update data to Oracle table let say,
Oracle Table have column A,B,C,D,E
I stored data on Hive table as B,C,E
Can Sqoop export update(just update, not upsert) with B,C as update keys and update just the E column from Hive?
Pls mention --update-key Prim_key_col_in_table. Pls note --update-mode default is updateonly so you dont have to mention anything.
You can also add input-fields-terminated-by command if you want to.
here is a sample command -
sqoop export --connect jdbc:mysql://xxxxxx/mytable --username xxxxx --password xxxxx --table export_sqoop_mytable --update-key Prim_key_col_in_table --export-dir /user/ingenieroandresangel/datasets/mytable.txt -m 1

last-value in sqoop( incremental import)

sqoop import --connect \\
jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value
We don't know the last value of the previous table. How can I write the query?
You can try to 2 approaches to solve this.
Query into table and get maximum value of last-value column.
Create a job in sqoop and set the column as incremental one and moving forward, your job will run on incremental basis
Go to your pwd
cd .sqoop
open file metastore.db.script using vi or
your fav editor.
search for incremental.last.value
It should be something like
INSERT INTO SQOOP_SESSIONS VALUES('incimpjob','incremental.last.value','2018-09-11 19:20:52.0','SqoopOptions')
Note: I am assuming that you have created a Sqoop Job. The 'incimpjob' is the name of my sqoop job.

How to take updated records along with incremental import from RDBMS table to a Hive table?

Im working with Sqoop incremental import by taking the data everyday into my hive table. I have the following scenario:
I have an RDBMS table: empdata with columns
id name city
1 Sid Amsterdam
2 Bob Delhi
3 Sun Dubai
4 Rob London
I am importing the data into Hive, using Sqoop incremental import through a cron job which shell script to do the work.
#!/bin/bash
DATE=$(date +"%d-%m-%y")
while IFS=":" read -r server dbname tablename; do
sqoop import --connect jdbc:mysql://$server/$dbname --table $tablename --username root --password cloudera --hive-import --hive-table dynpart --hive-partition-key 'thisday' --hive-partition-value $DATE --target-dir '/user/hive/newimp5' --incremental-append --check-column id --last-value $(hive -e "select max(id) from $tablename");
done</home/cloudera/Desktop/MyScripts/tables.txt
The above script for incremental load is working fine. But now I have another requirement, which is to check if there are any updates to previous records. Like if the record:
1 Rob London is updated to 1 Rob NewYork I need to take that updated record(s) along with the incremental import but only the updated value should be present in Hive table so that I don't have duplicate values either. Could anyone tell me how can I accomplish it ?
In sqoop you can not use 2 columns in --check-column and even if you are allowed (you can combine 2 fields in --check-column see ex : Sqoop Incremental Import multiple columns in check-column) then also you are not sure if the city will have the higher or lower value next time so you can not really use city field in check column . now you have following options :
1) in your RDBMS create a new table where you have another field of type timestamp and this will be auto incremented so that every time you have any update or insert it has the current time stamp. and then after incremental append you again import this table using incremental lastmodified ...--check-column ts_field -- last-value also use "--merge-key id" in sqoop import so that it can merge the updates on the basis of id.
2) a) first run your sqoop import with --check-cloumn id --incremental append last value
b) then run the sqoop import again without using --incremental and with target dir as some temporary folder
c) then using sqoop merge merge the dataset(target-dir in step a. and b ) where new data will be in target dir of step a onto tar dir of step b and --merge key will be "id".
Please let me know if you have any further questions.

what is the relevence of -m 1

I am executing below sqoop command::=
sqoop import --connect 'jdbc:sqlserver://10.xxx.xxx.xx:1435;database=RRAM_Temp' --username DRRM_DATALOADER --password ****** --table T_VND --hive-import --hive-table amitesh_db.amit_hive_test --as-textfile --target-dir amitesh_test_hive -m 1
I have two queries::-
1) what is the relevence of -m 1? as far as I know Its the number of mapper that I am assigning to the sqoop job. If that is true, then, the moment I assign -m 2, the execution start throwing error as below:
ERROR tool.ImportTool: Error during import: No primary key could be found for table xxx. Please specify one with --split-by or perform a sequential import with '-m 1'
Now, I am forced to change my concept, now I see, it has something to do with database primary key. Can somebody help me a logic behind this?
2) I have ordered the above sqoop command to save the file as text file format.But when I go to the location suggested by the execution, I find tbl_name.jar. Why, if --as-textfile is a wrong sytax, then what is the right one. Or is there any other location that I can find the file in?
1) To have -m or --num-mappers to be set to a value greater than 1, the table must either have PRIMARY KEY or the sqoop command must be provided with a --split-by column. Controlling Parallelism would explain the logic behind this.
2) The FileFormat of the data imported into the Hive table amit_hive_test would be plain text(--as-textfile). As this is --hive-import, the data will be first imported into the --target-dir and then is loaded (LOAD DATA INPATH) into the Hive table. The resultant data will be inside the table's LOCATION and not in --target-dir.

Sqoop import converting TINYINT to BOOLEAN

I am attempting to import a MySQL table of NFL play results into HDFS using Sqoop. I issued the following command to achieve this:
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/nfl \
--username <username> -P \
--table play
Unfortunately, there are columns of type TINYINT, which are being converted to booleans upon import. For instance, there is a 'quarter' column for which quarter of the game the play occurred in. The value in this column is converted to 'true' if the play occurred in the first quarter and 'false' otherwise.
In fact, I did a sqoop import-all-tables, importing the entire NFL database I have, and it behaves like this uniformly.
Is there a way around this, or perhaps some argument for import or import-all-tables that prevents this from happening?
Add tinyInt1isBit=false in your JDBC connection URL. Something like
jdbc:mysql://127.0.0.1:3306/nfl?tinyInt1isBit=false
Another solution would be to explicitly override the column mapping for the datatype TINYINT(1) column. For example, if the column name is foo, then pass the following option to Sqoop during import: --map-column-hive foo=tinyint. In the case of non-Hive imports to HDFS, use --map-column-java foo=integer.
Source

Resources