sqoop import --connect jdbc:mysql://localhost/retail_db --username root --password cloudera --query 'select * from table name where $CONDITIONS'
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id
--target-dir /user/foo/joinresults
Related
I need to import data from few different SQL servers which have same tables, table structure and even primary key value. So to uniquely identify a record, ingested from a SQLserver say "S1", i want to have a extra column - say "serverName" in my hive tables. How should i add this in my sqoop free form query.
All i want to do is pass a hardcoded value along with list of columns such that the hardcoded column value should get stored in Hive. Once done, I can take care of dynamically changing this value depending upon the server data.
sqoop import --connect "connDetails" --username "user" --password "pass" --query "select col1, col2, col3, 'S1' from table where \$CONDITIONS" --hive-import --hive-overwrite --hive-table stg.T1 --split-by col1 --as-textfile --target-dir T1 --hive-drop-import-delims
S1 being the hardcoded value here. I am thinking in SQL-way that when you pass a hardcode value, same is returned as the query result. Any pointers how to get this done?
Thanks in Advance.
SOLVED: Actually it just needed an alias for the hardcoded value. So the sqoop command executed is -
sqoop import --connect "connDetails" --username "user" --password "pass" --query "select col1, col2, col3, 'S1' as serverName from table where \$CONDITIONS" --hive-import --hive-overwrite --hive-table stg.T1 --split-by col1 --as-textfile --target-dir T1 --hive-drop-import-delims
My doubt is, Say, I have a file A1.csv with 2000 records on sql-server table, I import this data into hdfs, later that day I have added 3000 records to the same file on sql-server table.
Now, I want to run incremental import for the second chunk of data to be added on hdfs, but, I do not want complete 3000 records to be imported. I need only some data according to my necessity to be imported, like, 1000 records with certain condition to be imported as part of the increment import.
Is there a way to do that using sqoop incremental import command?
Please Help, Thank you.
You need a unique key or a Timestamp field to identify the deltas which is the new 1000 records in your case. using that field you have to options to bring in the data to Hadoop.
Option 1
is to use the sqoop incremental append, below is the example of it
sqoop import \
--connect jdbc:oracle:thin:#enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major
Arguments :
--check-column (col) #Specifies the column to be examined when determining which rows to import.
--incremental (mode) #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Option 2
Using the --query argument in sqoop where you can use the native sql for mysql/any database you connect to.
Example :
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
I'm trying to import "50" records from a single table using the following query
sqoop import --connect jdbc:mysql://xxxxxxx/db_name --username yyyyy --query 'select * from table where (id <50) AND $CONDITIONS' --target-dir /user/tmp/ -P
I'm having error on this query.
Any ideas ?
i removed the parenthesis in where clause and it worked and when using two or more logical operators use parenthesis otherwise it doesn't work
If I have multiple similar tables, e.g:
table A: "users", columns: user_name, user_id, user_address, etc etc
table B: "customers" columns: customer_name, customer_id, customer_address, etc etc
table C: "employee" columns: employee_name, employee_id, employe_address, etc etc
Is it possible that using Sqoop to import the three tables into one HBase or Hive table? So After the import, I have one HBase table contains all the records in table A, B, C ?
It's definitely possible if the tables are somehow related. A free-form query can be used in Sqoop to do exactly that. In this case, the free-form query would be a join. For example, when importing into Hive:
sqoop import --connect jdbc:mysql:///mydb --username hue --password hue --query "SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS" --split-by oozie_job.id --target-dir "/tmp/hue" --hive-import --hive-table hive-table
Similarly, for Hbase:
sqoop import --connect jdbc:mysql:///mydb --username hue --password hue --query "SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS" --split-by oozie_job.id --hbase-table hue --column-family c1
The key ingredient in all of this is the SQL statement being provided:
SELECT * FROM users JOIN customers ON users.id=customers.user_id JOIN employee ON users.id = employee.user_id WHERE \$CONDITIONS
For more information on free-form queries, check out http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_free_form_query_imports.
I recently encountered an issue with SQOOP import.
When I mention the following :
--num-mappers 10 SQOOP import fetches 10x records
--num-mappers 15 SQOOP import fetches 15x records
--num-mappers 1 SQOOP import fetches exact records
for the same select query.
the select query contains a LEFT outer join which when ran on the DB returns x records, which is what I am trying to retrieve.
The query is :
SELECT table1.*,table2.* from table1 left outer join table2 on
(table1.tab1_id = table2.tab2_id);
As tab2_id is a PK for table2 i am using the same for the --split-by clause.
But I am unable to understand why the SQOOP returns different #records when different #mappers are specified
The issue was with the sql query, that was intermediately submitted to the SQOOP job, was not yielding correct result set.
Reason::
The SQOOP command looked like ::
sqoop import --connect <JDBC connection string> -m 10 --hive-drop-import-
delims --fields-terminated-by '\001' --fetch-size=10000 --split-by <PK
column> --query "SELECT table1.*,table2.* from table1 left outer join
table2 on (table1.tab1_id = table2.tab2_id) AND \$CONDITIONS"
And this $CONDITIONS is internally replaced by the lower/upper boundary values.
And as there is no where condition in the query, this $CONDITIONS block is having no effect on the splitting of the data to mappers, resulting in delegating the entire data to all the mappers whichever is being spawned.