Fetching less data woth sqoop - sqoop

I am running a sqoop import in a database with a time range of 2 days and 4m rows and at the end it shows me that it fetched only 500k rows. Is there any way to increase the fetching volume in sqoop ? My thought is that either jdbc has error or the fetching size of sqoop.

Related

Loading last 3 days incremental data from oracle to hdfs using sqoop

How to import last 3 days incremental data from oracle to hdfs using Sqoop.
Currently i have written generic sqoop command using Shell-Script to import data from multiple oracle database for multiple plants.
So can anyone help me how to write sqoop command to import the last 3 days data.
In your SQOOP job you can issue SQL so in your SQL statement you can add a date function to the Where clause assuming the table you are pulling from has a date column.
Example: select ,,... from where >= (CURRENT-DATE -3);

Sqoop: Understanding how num-mappers and fetch-size work together

I am trying to import a table from MySQL incrementally using following configuration:
--split-by
date_format(updated_at, '%l')
--boundary-query
select 1, 12 from ${table}
--m
12
--incremental
lastmodified
--last-value
${lastValue}
--check-column
updated_at
--merge-key
id
When I run this, I am getting Java Heap Space error. After searching a bit, I got to know about another config --fetch-size <n>, which defaults to 1000, in sqoop which controls the number number of entries to read from database at once.
Default container memory allocation is 1 GB and the table which I am pulling is of size around 100 GB.
I am trying to figure out why its throwing Java Heap Space error as I am sure if it is going to pull 1000 rows at once, data size of 1000 rows is not going to exceed 1GB.
Is fetch-data config being overwritten by by the split-by, boundary-query and mapper config?
Idea behind this config was to ensure that data distribution is now skewed and few mappers only don't end up pulling all the data. So with this config, I am doing a split by hour in 12 hour format so that hour 1 and 13 get assigned to same mapper.
Any guidance on this will be really helpful.

Partial data getting sqooped using Sqoop

I am trying to sqoop a Teradata table with 5 billion rows and only 2.35 billion rows of data is getting sqooped.
The job is getting completed successfully without any issues, I am using TeradataConnManager with 36 mappers , can someone provide any pointers on what might be the issue. It is a simple table import into a HDFS directory.

Import to HDFS or Hive(directly)

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source is a MS SQL database of around 1.6TB and around 25 tables
The ultimate objective is to check if the existing queries can run faster on the HDP
There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
4.As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :
sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose
Now I plan to create a Hive table but I have some questions mentioned here.
My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)
For Loading, you can try these options
1) You can do a mysql import to csv file that will be stored in your Linux file system as backup then do a distcp to HDFS.
2) As mentioned, you can do a Sqoop import and load the data to Hive table (parent_table).
For checking the performance using different formats & Partition table, you can use CTAS (Create Table As Select) queries, where you can create new tables from the base table (parent_table). In CTAS, you can mention the format like parque or avro etc and partition options is also there.
Even if you delete new tables created by CTAS, the base table will be there.
Based on my experience, Parque + partition will give a best performance, but it also depends on your data.
I see that the connection and settings are all correct. But I dint see --fetch-size in the query. By default the --fetch-size is 1000 which would take forever in your case. If the no of columns are less. I would recommend increasing the --fetch-size 10000. I have gone up to 50000 when the no of columns are less than 50. Maybe 20000 if you have 100 columns. I would recommend checking the size of data per row and then decide. If there is one column which has size greater than 1MB data in it. Then I would not recommend anything above 1000.

Sqoop is not using all the specified mappers

I am importing data from Oracle to Hadoop using Sqoop. In Oracle table I have approximately 2 Millions records with the primary-key which is I am providing as split-by field.
My sqoop job is getting completed and I am getting correct data and job is running for 30 Min till now all good.
When I check the output file I see first file is round 1.4 GB, Second file is around 157.2 MB and last file (20th File) is around 10.4 MB whereas all the other files from 3rd to 19th are 0 bytes.
I am setting -m 20 because I want to run 20 mappers for my job.
here is the sqoop command :
sqoop import --connect "CONNECTION_STRING" --query "SELECT * FROM WHERE AND \$CONDITIONS" --split-by .ID --target-dir /output_data -m 20
Note : My cluster is capable enough to handle 20 mappers and database also capable to handle 20 request at a time.
Any thought?
Dharmesh
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_controlling_parallelism...
If the actual values for the primary key are not uniformly distributed
across its range, then this can result in unbalanced tasks.
The --split-by argument can be used to choose a column with better distribution. Normally, this will vary by data type.
Try using a different --split-by field for better load balancing.
This is because the Primary Key (ID) is not uniformly distributed. Hence, your mappers are not being used appopriately. So you must use some other field for splitting that is uniformly distributed.

Resources