Oracle ROWID for Sqoop Split-By Column - sqoop

I have a huge oracle table (Transaction), the data in my oracle table has skew data on the column "Customer id" due to which the few mappers take time in hours to finish the job while other mappers finish the job in minutes. I couldn't see any other option to avoid the skewing data as this is the only column can be splited by. We can combine other columns like Customer ID, Batch ID, SEQ NUM to come with multi column split but I understood that sqoop doesn't support multi column in split by.
My objective is to pull the transaction data for a specific period (i.e. batch date unique for a month of data).
I tried the below options in sqoop with 10 mappers.
--split-by "my column name" //for example customer id
--where "my query condition" //for example batch date
Now I am thinking of using the ROWID which might split the rows evenly between the mappers. I thought of using the boundary query to get the MIN & MAX ROW ID. Below is Sqoop command I want to use.
sqoop import \
--table Transaction \
--split-by ROWID \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--boundary-query "SELECT MIN(ROWID) AS MIN, MAX(ROWID) AS MAXL FROM Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') GROUP BY CUSTOMERID, BATCHNO,BATCHSEQNO " \
--num-mappers 10 \
--target-dir /user/trans
Need advise if this would be right option or is there any other way.
Also I would like to know if we can use multi split-by column name by any chance.

Providing --boundary-query will only save your time in evaluating minimun and maximun value. All mappers will have the same range query.
In your case, sqoop will generate boundary query like -
SELECT MIN(ROWID), MAX(ROWID) FROM (Select * From Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') ) t1
You can try this query and your custom boundary query on your JDBC client to check which one is faster and use that one.
Now coming to uneven mappers load.
Yes, you are right. Currently, sqoop doesn't support multi-column in split by. you have to choose one column. If ROWID is evenly distributed (I am assuming yes), you should use it.
So, you query looks good. Just check compare--boundary-query.
Edit
There is no proper java type issue with ROWID type of Oracle.
Add --map-column-java ROWID=String in your import command to map this to Java's String.

Do you have index on SEQ NUM, if so then you can use SEQ-NUM in --split-by (i am assuming that SEQ-NUM no generating randomly it is populating in incremental fashion for each transaction ). so your sqoop command may look like this
sqoop import \
--table Transaction \
--split-by SEQ-NUM \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--num-mappers 10 \
--target-dir /user/trans

Related

How to specify multiple columns for incremental data in Sqoop?

I am using following query to fetch incremental data in sqoop-
bin/sqoop job --create JOB_NAME -- import --connect jdbc:oracle:thin:/system#HOST:PORT:ORACLE_SERVICE --username USERNAME --password-file /PASSWORD_FILE.txt --fields-terminated-by ',' --enclosed-by '"' --table SCHEMA.TABLE_NAME --target-dir /TARGET_DIR -m 2 --incremental append --check-column NVL(UPDATE_DATE,INSERT_DATE) --last-value '2019-01-01 00:00:00.000' --split-by PRIMARY_KEY --direct
It throwing error for Multiple columns in --check-columns parameters.
Is there any approcach to specify multi columns in --check-column parameter?
I want to fetch data , if UPDATE_DATE field contains null value then it should fetch the data on the basis of INSERT_DATE column.
I want to extract transaction records from a table which is being updated daily , and if the records is inserted first time then there is no value in UPDATED_DATE column. That's why I need to compare both columns while fetching data from table.
Any help regarding this would be highly appreciated.
As per my understanding it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way we can manage to get this done is with 2 separate imports:
Incremental import with the Insert date as check column for first time
records
Incremental import with the updated column as check column
for UPDATED records

Sqoop import to route records with null values in a particular columns into another table

I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...
I can think of following ways to handle it
once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
handling at sqoop level itself(no clue on this)
could any expert here can help me with the above ask.
ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6
We can try making use of --query option along with the sqoop-import command
--query select * from table where column is null and $CONDITIONS
And in a similar way for not null condition also.
There will be 2 sqoop import jobs here.

Significance of $conditions in Sqoop

What is the significance of $conditions clause in sqoop import command?
select col1, col2 from test_table where \$CONDITIONS
Sqoop performs highly efficient data transfers by inheriting Hadoop’s parallelism.
To help Sqoop split your query into multiple chunks that can be
transferred in parallel, you
need to include the $CONDITIONS placeholder in the where clause of your query.
Sqoop
will automatically substitute this placeholder with the generated conditions specifying
which slice of data should be transferred by each individual task.
While you could skip
$CONDITIONS by forcing Sqoop to run only one job using the --num-mappers 1 param‐
eter, such a limitation would have a severe performance impact.
For example:-
If you run a parallel import, the map tasks will execute your query
with different values substituted in for $CONDITIONS. one mapper
may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and
the next mapper may execute "select bla from foo WHERE (id >= 10000
AND id < 20000)" and so on.

why is the default maximum mappers are 4 in Sqoop? can we give more than 4 mappers in -m parameter?

I am trying to understand the reason behind the default maximum mappers in a sqoop job. Can we set more than four mappers in a sqoop job to achieve higher parallelism.
If you are using integer column in your split-by then the default number of mappers are 4. And it is strongly recomonded that you always use integer column not the string/char/Text column. see the code here for more explaination. https://github.com/apache/sqoop/blob/660f3e8ad07758aabf0a9b6ede3accdfac5fb1be/src/java/org/apache/sqoop/mapreduce/db/TextSplitter.java#L100
Yes you can give increase/decrease the parallelism by specifying -m
from Sqoop Guide
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
My guess is that 4 is a default number that works well on practice for most use cases. Use the parameter --num-mappers if you want Sqoop to use a different number of mappers. For example, to use 8 concurrent tasks you would use the following sqoop command:
sqoop import \
--connect jdbc:mysql://mysql.example.com/testdb \
--username abcdef \
--password 123456 \
--table test \
--num-mappers 8
Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. You might want to keep this in mind of you are pulling data from your production environment.
when we don't mention the number of mappers while transferring the data from RDBMS to HDFS file system sqoop will use default number of mapper 4. Sqoop imports data in parallel from most database sources. Sqoop only uses mappers as it does parallel import and export.
If you're not mentioning number of mapper in sqoop it will by default use 4 mapper in parallel to do the sqoop import. If you want to use more then 4 mapper you can use --num-mappers in your sqoop command you can use any number of mapper
Also, if you are not sure you have primary key or not in source table --autoreset-to-one-mapper come handy in that case. If there is primary key it will use the mentioned number of mappers to execute the job or else it will just use one mapper to import the table without primary key
sqoop-import \
-- connect jdbc:mysql://localhost/databasename \
-- username root \
-- password xxxxxxxx \
-- warehouse-dir /directory/path/from/home \
-- autoreset-to-one-mapper \
-- num-mappers 6
Also, it comes handy when you doing sqoop import-all-tables to import multiple tables for all the tables with primary key it will use the mentioned number of mappers and for all the tables without primary key it will reset the number of mappers to 1 without failing the job.
Note: The tables without Primary Key use only one mapper for sqoop import until and unless you're not giving --split-by column

what is the purpose of split-by <column> --target-dir in Sqoop

What happens internally when we write --split-by in sqoop?
Example:
sqoop import --connect jdbc:mysql://localhost/test --username root --password training123 --query 'select * from transaction where $CONDITIONS' --split-by Txnid --target-dir input/transaction
Hadoop MAP Reduce is all about divide and conquer .
In order to partition data into multiple independent slices that will be transferred in a parallel manner, Sqoop needs to find the minimum and maximum value of the column specified in the --split-by parameter
When using the split-by option, you should choose a column which contains values that are uniformly distributed.
in the query we are telling data is evenly distributed on base column 'Txnid' and use the column for making splits.
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--split-by <column-name> - Column of the table used to split work units
Reference: Sqoop User Guide
It specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism.
Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits. We can choose the column through --split-by which can result in the best splitting and thus increasing parallelism and better performance.
split-by in sqoop is used to create input splits for the mapper. It is very useful for parallelism factor as splitting imposes the job to run faster.

Resources