How many database connectors are created during sqoop import and is there any maximum value is there for it ? Also, please confirm whether number of DB connectors are equal to number of mappers ?
Number of Database connections are depended on the number of mapper (parallel task you are running) tasks are running while importing data from the database.
Suppose you have specified -m 1 or --num-mappers 1 in your Sqoop command, then only one database connection will be active till entire data import.
The DB number of connection or maximum values depends on allowed connection for the user through which you are accessing the database. So suppose you are accessing database with user "A" and this user has limitation to make only 10 connections, then you won't be able to open more than 10 connection at a time. It means if you specify --num-mappers 11 in Sqoop command, your Sqoop job will fail. It means the number of DB connections is equal to the number of mappers.
Related
I want to import big table from oracle database to HDFS using Sqoop.
As the table size is huge and it is having primary key sqoop can run multiple mappers parallel.
I have some questions in
1)Due to bad record in oracle database, one mapper got exception and others are running fine. So all the job will get failed or except one mapper data all other mappers will write data in HDFS?
2)Is sqoop is intelligent enough to run parallel mappers if we hive --m option.
If we give --m 4 then sqoop can increase mappers based on tables size or it will run with 4 only?
Is any body came across this kind of scenario??
Based on my knowledge.
If one mapper gets failed, The sqoop process will try to kill other mapper. The process won't delete the data from HDFS. You can see some of the data been created in your HDFS location.
When we specify number of mapper (using -m x option) the program will use at most x mapper.
I am trying to understand the reason behind the default maximum mappers in a sqoop job. Can we set more than four mappers in a sqoop job to achieve higher parallelism.
If you are using integer column in your split-by then the default number of mappers are 4. And it is strongly recomonded that you always use integer column not the string/char/Text column. see the code here for more explaination. https://github.com/apache/sqoop/blob/660f3e8ad07758aabf0a9b6ede3accdfac5fb1be/src/java/org/apache/sqoop/mapreduce/db/TextSplitter.java#L100
Yes you can give increase/decrease the parallelism by specifying -m
from Sqoop Guide
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
My guess is that 4 is a default number that works well on practice for most use cases. Use the parameter --num-mappers if you want Sqoop to use a different number of mappers. For example, to use 8 concurrent tasks you would use the following sqoop command:
sqoop import \
--connect jdbc:mysql://mysql.example.com/testdb \
--username abcdef \
--password 123456 \
--table test \
--num-mappers 8
Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. You might want to keep this in mind of you are pulling data from your production environment.
when we don't mention the number of mappers while transferring the data from RDBMS to HDFS file system sqoop will use default number of mapper 4. Sqoop imports data in parallel from most database sources. Sqoop only uses mappers as it does parallel import and export.
If you're not mentioning number of mapper in sqoop it will by default use 4 mapper in parallel to do the sqoop import. If you want to use more then 4 mapper you can use --num-mappers in your sqoop command you can use any number of mapper
Also, if you are not sure you have primary key or not in source table --autoreset-to-one-mapper come handy in that case. If there is primary key it will use the mentioned number of mappers to execute the job or else it will just use one mapper to import the table without primary key
sqoop-import \
-- connect jdbc:mysql://localhost/databasename \
-- username root \
-- password xxxxxxxx \
-- warehouse-dir /directory/path/from/home \
-- autoreset-to-one-mapper \
-- num-mappers 6
Also, it comes handy when you doing sqoop import-all-tables to import multiple tables for all the tables with primary key it will use the mentioned number of mappers and for all the tables without primary key it will reset the number of mappers to 1 without failing the job.
Note: The tables without Primary Key use only one mapper for sqoop import until and unless you're not giving --split-by column
I just started with Sqoop Hands-on. I have a question, lets say I have 300 tables in a database and I want to perform an incremental load on those tables. I understand I can do incremental imports with either append mode or last modified.
But do I have to create 300 jobs, if the only thing in job which varies is Table name , CDC column and the last value/updated value?
Has anyone tried using the same job and passing this above things as parameter which can be read from a text file in a loop and execute the same job for all the tables in parallel.
What is the industry standard and recommendations ?
Also, is there a way to truncate and re-load the hadoop tables which is very small instead of performing CDC and merging the tables later?
There is import-all-tables "Import tables from a database to HDFS"
However it will not provide way to change CDC column for each table.
Also see sqoop import multiple tables
There is no truncation but same can be achieved through following.
--delete-target-dir "Delete the import target directory if it exists"
I am importing data from Oracle to Hadoop using Sqoop. In Oracle table I have approximately 2 Millions records with the primary-key which is I am providing as split-by field.
My sqoop job is getting completed and I am getting correct data and job is running for 30 Min till now all good.
When I check the output file I see first file is round 1.4 GB, Second file is around 157.2 MB and last file (20th File) is around 10.4 MB whereas all the other files from 3rd to 19th are 0 bytes.
I am setting -m 20 because I want to run 20 mappers for my job.
here is the sqoop command :
sqoop import --connect "CONNECTION_STRING" --query "SELECT * FROM WHERE AND \$CONDITIONS" --split-by .ID --target-dir /output_data -m 20
Note : My cluster is capable enough to handle 20 mappers and database also capable to handle 20 request at a time.
Any thought?
Dharmesh
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_controlling_parallelism...
If the actual values for the primary key are not uniformly distributed
across its range, then this can result in unbalanced tasks.
The --split-by argument can be used to choose a column with better distribution. Normally, this will vary by data type.
Try using a different --split-by field for better load balancing.
This is because the Primary Key (ID) is not uniformly distributed. Hence, your mappers are not being used appopriately. So you must use some other field for splitting that is uniformly distributed.
I was trying to import a 1 TB table in MySQL to HDFS using sqoop. The command used was:
sqoop import --connect jdbc:mysql://xx.xx.xxx.xx/MyDB --username myuser --password mypass --table mytable --split-by rowkey -m 14
After executing the bounding vals query, all the mappers start, but after some time, the tasks get killed due to timeout (1200 seconds). This, I think, is because the time taken to execute the select query running in each mapper takes more than the time set for timeout (in sqoop it seems to be 1200 seconds); and hence it fails to report status, and the task subsequently gets killed. (I have also tried it for 100 gb data sets; it still failed due to timeout for multiple mappers.) For single mapper import, it works fine, as no filtered resultsets are needed. Is there any way to override the map task timeout (say set it to 0 or a very high value) while using multiple mappers in sqoop?
Sqoop is using special thread to send statuses so that the map task won't get killed by jobtracker. I would be interested to explore your issue further. Would you mind sharing the sqoop log, one of the map task logs and your table schema?
Jarcec