what are the following commands in sqoop? - sqoop
Can anyone tell me what is the use of --split-by and boundary query in sqoop?
sqoop import --connect jdbc:mysql://localhost/my --username user --password 1234 --query 'select * from table where id=5 AND $CONDITIONS' --split-by table.id --target-dir /dir
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--boundary-query : By default sqoop will use query select min(), max() from to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Reason to use : If --split-by is not giving you the optimal performance you can use this to improve the performance further.
--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.
$CONDITIONS is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set.
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop allows you to import data in parallel and --split-by and --boundary-query allow you more control. If you're just importing a table then it'll use the PRIMARY KEY however if you're doing a more advanced query, you'll need to specify the column to do the parallel split.
i.e.,
sqoop import \
--connect 'jdbc:mysql://.../...' \
--direct \
--username uname --password pword \
--hive-import \
--hive-table query_import \
--boundary-query 'SELECT 0, MAX(id) FROM a' \
--query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
--num-mappers 3
--split-by a.id \
--target-dir /data/import \
--verbose
Boundary Query lets you specify an optimized query to get the max, min. else it will attempt to do MIN(a.id), MAX(a.id) ON your --query statement.
The results will be (if min=0, max=30) is 3 queries that get run in parallel:
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;
Split by :
why it is used? -> to enhance the speed while fetching the data from rdbms to hadoop
How it works? -> By default there are 4 mappers in sqoop , so the import works parallely. The entire data is divided into equal partitions. Sqoop considers primary key column for splitting the data and then finds out the maximum and minimum range from it and then makes the 4 ranges for 4 mappers to work.
Eg. 1000 records in primary key column and max value =1000 and min value -0 so sqoop will create 4 ranges - (0-250) , (250-500),(500-750),(750-1000) and depending on values of column the data will be partitioned and given to 4 mappers to store it on HDFS.
so if in case the primary key column is not evenly distributed so with split-by you can change the column-name for evenly partitioning.
In short: Used for partitioning of data to support parallelism and improve performance
Also if we specify --query value within double quotes(" "), we need to precede $CONDITIONS with a slash(\)
--query "select * from table where id=5 AND \$CONDITIONS"
or else
--query 'select * from table where id=5 AND $CONDITIONS'
Related
Sqoop --where condition with multiple clause
I'm trying to get data from Oracle and import in Hadoop table. I'm making changes inn existing sqoop, i have to use --where to filter the record. For now we have in where date=somedate condition, now i need to add another condition like date = somedate and status ='Active'. I have make this change in --where. I'm not allowed to use --query 🥺. Can you guys help me on this ?
You can try like this --query "select * from table where status = 'Active' AND date=somedate AND $CONDITIONS"
use --where condition wrapped in double quotes like below. --where " date = somedate and status ='Active'" And good news is you can add as many conditions as possible. In fact you can add subquery as well - it should be syntactically correct in database.
This syntax is helpful for me: --query "select * from table where date=somedate AND status ='Active' AND ($CONDITIONS)" there is no need to use --where.
Issue in using Where Clause in SQOOP
I am trying to use --where option to get conditional data by joining orders table with order_items table using below command : sqoop import \ --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username retail_dba \ --password cloudera \ --query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id where \$CONDITIONS " \ --where "order_id between 10840 and 10850" \ --target-dir /user/cloudera/order_join_conditional \ --split-by order_id Now i don't know whats wrong with this because when i Run same Query in MySQL i get 41 records which is correct But when i run this command in sqoop it will Dump all the 172198 records. I don't understand whats happening and whats going wrong.
When you run a parallel import, Sqoop will use the value of the parameter specified in --split-by to substitute the $CONDITIONS parameter and generate different queries (which will be executed by different mappers). For instance, Sqoop will first try to find the minimum and maximum value of order_id and depending on the number of mappers, will try to execute your query against different subsets of the whole range of possible values of order_id. That way, your query would be translated internally to different parallel queries like these ones: SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id >=0 AND order_id < 10000) SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id >=1000 AND order_id < 20000) ... So in this case, the --where clause you specified separately will not be used and you'll end up having all the records. But in your particular case, you don't really need the --split-by flag, because you are only interested in a particular (and very limited) range of values. So you could use this instead: sqoop import \ --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username retail_dba \ --password cloudera \ --query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850)" \ --target-dir /user/cloudera/order_join_conditional \ -m 1 Note also the -m 1 at the end which (as pointed out by dev ツ) stands for --num-mappers and allows you to tell Sqoop that you want to use just one mapper for your import process (therefore, no parallelism). If the range of values was bigger, you could use the --split-by and use your where condition in your free-form query, making use of the parallelism: sqoop import \ --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username retail_dba \ --password cloudera \ --query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850) AND \$CONDITIONS" \ --target-dir /user/cloudera/order_join_conditional \ --split-by order_id
Set a constant boundary query
I am using Sqoop to import MySQL tables to HDFS. To do that, I use a free-form query import. --query "SELECT $query_select FROM $table where \$CONDITIONS" This query is quite slow because of the min(id) and the max(id) search. To improve performances, I've decided to use --boundary-query and specify manually lower-bound and upper-bound. ( https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html): --boundary-query "select 176862848, 172862848" However, sqoop doesn't care about specified value and again tries to find minimum and maximum "id" by itself. 16/06/13 14:24:44 INFO tool.ImportTool: Lower bound value: 170581647 16/06/13 14:24:44 INFO tool.ImportTool: Upper bound value: 172909234 The complete sqoop command: sqoop-import -fs hdfs://xxxxxxxxx/ -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris" -m $nodes_number\ --connect jdbc:mysql://$server:$port/$database --username $username --password $password\ --target-dir $destination_dir --boundary-query "select 176862848, 172862848"\ --incremental append --check-column $id_column_name --last-value $last_value\ --split-by $id_column_name --query "SELECT $query_select FROM $table where \$CONDITIONS"\ --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' Does anyone has already met/solved this problem? Thanks
I've managed to solve this problem by deleting the following arguments: --incremental append --check-column $id_column_name --last-value $last_value It seems that there is a concurrency between arguments --boundary-query, --check-column, --split-by and --incremental append
You are correct.. We should not use --split-by with --boundary-query control argument.
try like this.. --boundary-query "select 176862848, 172862848 from tablename limit 1" \
Hive Joins query
I have two tables in hive: Table 1: 1,Nail,maher,24,6.2 2,finn,egan,23,5.9 3,Hadm,Sha,28,6.0 4,bob,hope,55,7.2 Table 2 : 1,Nail,maher,24,6.2 2,finn,egan,23,5.9 3,Hadm,Sha,28,6.0 4,bob,hope,55,7.2 5,john,hill,22,5.5 6,todger,hommy,11,2.2 7,jim,cnt,99,9.9 8,will,hats,43,11.2 Is there any way in Hive to retrieve the new data in table 2 that doesn't exist in table 1?? In other Databases tools, you would use a inner left/right. But inner left/right doesn't exist in Hive and suggestions how this could be achieved?
If you are using Hive version >= 0.13 you can use this query: SELECT * FROM A WHERE A.firstname, A.lastname ... IN (SELECT B.firstname, B.lastname ... FROM B); But I'm not sure if Hive supports multiple coloumns in the IN clause. If not something like this could work: SELECT * FROM A WHERE A.firstname IN (SELECT B.firstname FROM B) AND A.lastname IN (SELECT b.lastname FROM B) ...;
It might be wiser to concatenate the fields together before testing for NOT IN: SELECT * FROM t2 WHERE CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING)) NOT IN (SELECT CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING)) FROM t1) Performing sequential NOT IN sub-queries may give you erroneous results. From the above example, a new record with the values ('nail','egan',28, 7.2) would not show up as new with sequential NOT IN statements.
Equivalent of 'IN' or 'NOT' operator in sqoop
I have a sqoop job torun, The conditions includes : WHERE cond1='' AND date = '2-12-xxxx' AND date = '3-12-xxxx' AND date = '3-12-xxxx'. Is there a IN conditional in sqoop similar to sql?
You can run sqoop import using --query and pass any query to get the data. In --where you have to pass conditions like this --where "cond1='value' and cond2 in (<comma seperated values>)". If you use where condition on table, it will apply like this select * from <table> where <condition specified in where clause> to fetch the data and hence you can pass any valid conditions in where.