Sqoop import all the table primary and non primary key tables at a time - sqoop

Can we import tables using sqoop having primary and non primary at a time.Eg i am hvaing 200 primary key tables and 200 non priamry key tables in database .How can we import 400 tables at a time?

In addition to Jamie's answer:
You can add --autoreset-to-one-mapper tag in your sqoop import-all-tables... command.
Say you are using 8 mappers (-m 8) in your command. Then using above tag tables with primary keys will split as per the number of mappers and tables without primary keys will be loaded using 1 mappers.
so, overall your efficiency will improve.
Check 1st point of sqoop documentation for details.

Yes, you can add the flag --m 1 to the import command of all your tables (both the 200 with primary key and the 200 without it).
By adding this option, Sqoop will only use one mapper to retrieve all the data from the tables, so your command would look like something like this:
sqoop import-all-tables --connect your-database --username user --password pwd --m 1

Related

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

why is the default maximum mappers are 4 in Sqoop? can we give more than 4 mappers in -m parameter?

I am trying to understand the reason behind the default maximum mappers in a sqoop job. Can we set more than four mappers in a sqoop job to achieve higher parallelism.
If you are using integer column in your split-by then the default number of mappers are 4. And it is strongly recomonded that you always use integer column not the string/char/Text column. see the code here for more explaination. https://github.com/apache/sqoop/blob/660f3e8ad07758aabf0a9b6ede3accdfac5fb1be/src/java/org/apache/sqoop/mapreduce/db/TextSplitter.java#L100
Yes you can give increase/decrease the parallelism by specifying -m
from Sqoop Guide
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
My guess is that 4 is a default number that works well on practice for most use cases. Use the parameter --num-mappers if you want Sqoop to use a different number of mappers. For example, to use 8 concurrent tasks you would use the following sqoop command:
sqoop import \
--connect jdbc:mysql://mysql.example.com/testdb \
--username abcdef \
--password 123456 \
--table test \
--num-mappers 8
Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. You might want to keep this in mind of you are pulling data from your production environment.
when we don't mention the number of mappers while transferring the data from RDBMS to HDFS file system sqoop will use default number of mapper 4. Sqoop imports data in parallel from most database sources. Sqoop only uses mappers as it does parallel import and export.
If you're not mentioning number of mapper in sqoop it will by default use 4 mapper in parallel to do the sqoop import. If you want to use more then 4 mapper you can use --num-mappers in your sqoop command you can use any number of mapper
Also, if you are not sure you have primary key or not in source table --autoreset-to-one-mapper come handy in that case. If there is primary key it will use the mentioned number of mappers to execute the job or else it will just use one mapper to import the table without primary key
sqoop-import \
-- connect jdbc:mysql://localhost/databasename \
-- username root \
-- password xxxxxxxx \
-- warehouse-dir /directory/path/from/home \
-- autoreset-to-one-mapper \
-- num-mappers 6
Also, it comes handy when you doing sqoop import-all-tables to import multiple tables for all the tables with primary key it will use the mentioned number of mappers and for all the tables without primary key it will reset the number of mappers to 1 without failing the job.
Note: The tables without Primary Key use only one mapper for sqoop import until and unless you're not giving --split-by column

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

How can I import from "SQL View" to Sqoop Hive

I can import SQL tables to Hive, however when I try to SQL view, I am getting errors.
Any ideas ?
From Sqoop documentation:
If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
I think it's your case here. Provide additional option like:
--split-by tablex_id
where tablex_id is a field in your view that can be used as a primary key or index.
There is no any special command to import View form RDBMS.
Use simple import command with --split-by that sqoop refers to same as primary key.
you can import the rdbms views data but u have to pass -m 1 or --split-by in your command.

Resources