Sqoop import fetching more records from source - sqoop

I recently encountered an issue with SQOOP import.
When I mention the following :
--num-mappers 10 SQOOP import fetches 10x records
--num-mappers 15 SQOOP import fetches 15x records
--num-mappers 1 SQOOP import fetches exact records
for the same select query.
the select query contains a LEFT outer join which when ran on the DB returns x records, which is what I am trying to retrieve.
The query is :
SELECT table1.*,table2.* from table1 left outer join table2 on
(table1.tab1_id = table2.tab2_id);
As tab2_id is a PK for table2 i am using the same for the --split-by clause.
But I am unable to understand why the SQOOP returns different #records when different #mappers are specified

The issue was with the sql query, that was intermediately submitted to the SQOOP job, was not yielding correct result set.
Reason::
The SQOOP command looked like ::
sqoop import --connect <JDBC connection string> -m 10 --hive-drop-import-
delims --fields-terminated-by '\001' --fetch-size=10000 --split-by <PK
column> --query "SELECT table1.*,table2.* from table1 left outer join
table2 on (table1.tab1_id = table2.tab2_id) AND \$CONDITIONS"
And this $CONDITIONS is internally replaced by the lower/upper boundary values.
And as there is no where condition in the query, this $CONDITIONS block is having no effect on the splitting of the data to mappers, resulting in delegating the entire data to all the mappers whichever is being spawned.

Related

explain and show example why we use $CONDITIONS in sqoop

sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --query 'select * from table name where $CONDITIONS'
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id
--target-dir /user/foo/joinresults

Hadoop-Sqoop import without an integer value using split-by

I am importing data from memsql to Hdfs using Sqoop. My source table in Memsql doesn't have any integer value, I created a new table including a new column 'test' with the existing columns.
FOllowing is the query
sqoop import --connect jdbc:mysql://XXXXXXXXX:3306/db_name --username XXXX --password XXXXX --query "select closed,extract_date,open,close,cast(floor(rand()*1000000 as int) as test from tble_name where \$CONDITIONS" --target-dir /user/XXXX--split-by test;
this query gave me following error :
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'as int) as test from table_name where (1 = 0)' at line 1
I tried it another way as well:
sqoop import --connect jdbc:mysql://XXXXX:3306/XXXX --username XXXX --password XXXX --query "select closed,extract_date,open,close,ceiling(rand()*1000000) as test from table_name where \$CONDITIONS" --target-dir /user/dfsdlf --split-by test;
With the following query the job gets executed, but there is no data being transferred. It says split-by column is of float type and change it to integer type strictly.
Please help me with this to change split-by column as integer type from float type
The problem mostly seems to be related with the use of alias as the --split-by parameter.
If it's required to use the particular column in the query , you can run the query
'select closed,extract_date,open,close,ceiling(rand()*1000000) from table_name' in the console, get the column name thus coming for the table in the console and use it in --split-by 'complete_column_name_from_console' (here it should be --split-by 'ceiling(rand()*1000000)') .

Is it possible to write a Sqoop incremental import with filters on the new file before importing?

My doubt is, Say, I have a file A1.csv with 2000 records on sql-server table, I import this data into hdfs, later that day I have added 3000 records to the same file on sql-server table.
Now, I want to run incremental import for the second chunk of data to be added on hdfs, but, I do not want complete 3000 records to be imported. I need only some data according to my necessity to be imported, like, 1000 records with certain condition to be imported as part of the increment import.
Is there a way to do that using sqoop incremental import command?
Please Help, Thank you.
You need a unique key or a Timestamp field to identify the deltas which is the new 1000 records in your case. using that field you have to options to bring in the data to Hadoop.
Option 1
is to use the sqoop incremental append, below is the example of it
sqoop import \
--connect jdbc:oracle:thin:#enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major
Arguments :
--check-column (col) #Specifies the column to be examined when determining which rows to import.
--incremental (mode) #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Option 2
Using the --query argument in sqoop where you can use the native sql for mysql/any database you connect to.
Example :
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults

sqoop free form query to import n records from a table

I'm trying to import "50" records from a single table using the following query
sqoop import --connect jdbc:mysql://xxxxxxx/db_name --username yyyyy --query 'select * from table where (id <50) AND $CONDITIONS' --target-dir /user/tmp/ -P
I'm having error on this query.
Any ideas ?
i removed the parenthesis in where clause and it worked and when using two or more logical operators use parenthesis otherwise it doesn't work

Sqoop - can I bulk import multiple mysql tables to one HBase/Hive table

If I have multiple similar tables, e.g:
table A: "users", columns: user_name, user_id, user_address, etc etc
table B: "customers" columns: customer_name, customer_id, customer_address, etc etc
table C: "employee" columns: employee_name, employee_id, employe_address, etc etc
Is it possible that using Sqoop to import the three tables into one HBase or Hive table? So After the import, I have one HBase table contains all the records in table A, B, C ?
It's definitely possible if the tables are somehow related. A free-form query can be used in Sqoop to do exactly that. In this case, the free-form query would be a join. For example, when importing into Hive:
sqoop import --connect jdbc:mysql:///mydb --username hue --password hue --query "SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS" --split-by oozie_job.id --target-dir "/tmp/hue" --hive-import --hive-table hive-table
Similarly, for Hbase:
sqoop import --connect jdbc:mysql:///mydb --username hue --password hue --query "SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS" --split-by oozie_job.id --hbase-table hue --column-family c1
The key ingredient in all of this is the SQL statement being provided:
SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS
For more information on free-form queries, check out http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_free_form_query_imports.

Resources