Sqoop import to route records with null values in a particular columns into another table - sqoop

I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...
I can think of following ways to handle it
once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
handling at sqoop level itself(no clue on this)
could any expert here can help me with the above ask.
ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6

We can try making use of --query option along with the sqoop-import command
--query select * from table where column is null and $CONDITIONS
And in a similar way for not null condition also.
There will be 2 sqoop import jobs here.

Related

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Significance of $CONDITIONS while running a sqoop command

whats the importance of $CONDITIONS in every sqoop import query ?
ex:
select col1, col2 from test_table where \$CONDITIONS
What if I need to put my own where condition in the query and it works ?
The significance is explained in the Sqoop User Guide and in Apache Sqoop cookbook. To put it into nutshell, Sqoop needs a placeholder in the query to populate it with generated slices to enable parallel import.
It will act as a place holder where sqoop replaces it to make the query works parallel. It is must to put $CONDTIONS when you are importing query results. Still you can place your condition as below,
Select col1,col2 from table where $CONDITIONS and "your condtion"
Note: you can directly see in stack trace what the condition sqoop places here at $CONDITIONS. Eg. The first thing you can see is the condition "where 1=0" to fetch the metadata or schema from source table.

How can I import from "SQL View" to Sqoop Hive

I can import SQL tables to Hive, however when I try to SQL view, I am getting errors.
Any ideas ?
From Sqoop documentation:
If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
I think it's your case here. Provide additional option like:
--split-by tablex_id
where tablex_id is a field in your view that can be used as a primary key or index.
There is no any special command to import View form RDBMS.
Use simple import command with --split-by that sqoop refers to same as primary key.
you can import the rdbms views data but u have to pass -m 1 or --split-by in your command.

Resources