I am trying to use --where option to get conditional data by joining orders table with order_items table using below command :
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id where \$CONDITIONS " \
--where "order_id between 10840 and 10850" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id
Now i don't know whats wrong with this because when i Run same Query in MySQL i get 41 records which is correct But when i run this command in sqoop it will Dump all the 172198 records. I don't understand whats happening and whats going wrong.
When you run a parallel import, Sqoop will use the value of the parameter specified in --split-by to substitute the $CONDITIONS parameter and generate different queries (which will be executed by different mappers). For instance, Sqoop will first try to find the minimum and maximum value of order_id and depending on the number of mappers, will try to execute your query against different subsets of the whole range of possible values of order_id.
That way, your query would be translated internally to different parallel queries like these ones:
SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
WHERE (order_id >=0 AND order_id < 10000)
SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
WHERE (order_id >=1000 AND order_id < 20000)
...
So in this case, the --where clause you specified separately will not be used and you'll end up having all the records. But in your particular case, you don't really need the --split-by flag, because you are only interested in a particular (and very limited) range of values. So you could use this instead:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850)" \
--target-dir /user/cloudera/order_join_conditional \
-m 1
Note also the -m 1 at the end which (as pointed out by dev ツ) stands for --num-mappers and allows you to tell Sqoop that you want to use just one mapper for your import process (therefore, no parallelism).
If the range of values was bigger, you could use the --split-by and use your where condition in your free-form query, making use of the parallelism:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850) AND \$CONDITIONS" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id
Related
I'm trying to get data from Oracle and import in Hadoop table. I'm making changes inn existing sqoop, i have to use --where to filter the record. For now we have in where date=somedate condition, now i need to add another condition like date = somedate and status ='Active'. I have make this change in --where. I'm not allowed to use --query 🥺.
Can you guys help me on this ?
You can try like this
--query "select * from table where status = 'Active' AND date=somedate AND $CONDITIONS"
use --where condition wrapped in double quotes like below.
--where " date = somedate and status ='Active'"
And good news is you can add as many conditions as possible. In fact you can add subquery as well - it should be syntactically correct in database.
This syntax is helpful for me:
--query "select * from table where date=somedate AND status ='Active' AND ($CONDITIONS)"
there is no need to use --where.
I am running the following sqoop import from Teradata:
sqoop import --driver com.teradata.jdbc.TeraDriver \
--connect jdbc:teradata://telearg7/DATABASE=AR_PROD_HUB_DIM_VW,CHARSET=UTF8,CLIENT_CHARSET=UTF-8,TCP=SEND1500,TCP=RECEIVE1500 \
--verbose \
--username ld_hadoop \
--password xxxx \
--query "SELECT G.suscripcion_id , G.valor_recurso_primario_cd , G.suscripcion_cd , G.fecha_migra_id FROM ( SELECT DISTINCT a.suscripcion_id as suscripcion_id, a.valor_recurso_primario_cd as valor_recurso_primario_cd , f.suscripcion_cd as suscripcion_cd, a.fecha_fin_orden_id AS fecha_migra_id , row_number() over (partition by a.valor_recurso_primario_CD order by a.Fecha_Fin_Orden_ID DESC) as row_num FROM AR_PROD_HUB_DIM_VW.F_TR_CAMBIO_OFERTA_D A INNER JOIN AR_PROD_HUB_DIM_VW.D_ESTADO_OPERACION B ON A.ESTADO_OPERACION_ID = B.ESTADO_OPERACION_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_ESTADO_ORDEN C ON A.ESTADO_ORDEN_ID = C.ESTADO_ORDEN_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_TIPO_OFERTA D ON A.TIPO_OFERTA_ID = D.TIPO_OFERTA_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_TIPO_OFERTA E ON A.TIPO_OFERTA_ANTERIOR_ID = E.TIPO_OFERTA_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_Suscripcion F ON a.Suscripcion_ID = F.Suscripcion_ID WHERE FECHA_FIN_ORDEN_ID BETWEEN CURRENT_DATE-15 and CURRENT_DATE AND B.ESTADO_OPERACION_CD = 'DO' AND C.ESTADO_ORDEN_CD = 'DO' AND D.TIPO_OFERTA_DE IN ('PortePagado', 'PRE', 'Prepaid') AND E.TIPO_OFERTA_DE NOT IN ('PortePagado', 'PRE', 'Prepaid') ) G WHERE \$CONDITIONS AND G.ROW_NUM = 1" \
--hcatalog-database TRAFICO \
--hcatalog-table CRITERIO_TEM_MIGNEG_TMP \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile tblproperties ('EXTERNAL'='TRUE')" -m 1
And it is giving me the following error:
22/05/11 12:16:15 INFO hcat.SqoopHCatUtilities: Caused by:
java.lang.NullPointerException 22/05/11 12:16:15 INFO
hcat.SqoopHCatUtilities: at
org.apache.hadoop.hive.ql.ddl.DDLSemanticAnalyzerFactory.(DDLSemanticAnalyzerFactory.java:79)
22/05/11 12:16:16 DEBUG manager.SqlManager: Closing a db connection
ERROR tool.ImportTool: Import failed: java.io.IOException: HCat exited
with status 1
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.executeExternalHCatProgram(SqoopHCatUtilities.java:1252)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.launchHCatCli(SqoopHCatUtilities.java:1201)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.createHCatTable(SqoopHCatUtilities.java:735)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureHCat(SqoopHCatUtilities.java:394)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureImportOutputFormat(SqoopHCatUtilities.java:904)
at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:100)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:265)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:732)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:549)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:653)
at org.apache.sqoop.Sqoop.run(Sqoop.java:151)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:187)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:241)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:250)
at org.apache.sqoop.Sqoop.main(Sqoop.java:259)
I run the sql from teradata and it works, it brings records that are the ones to be imported to Hive.
In Hive the table TRAFFIC.CRITERIO_TEM_MIGNEG_TMP is dropped before the import.
I run it and I can't solve the error.
Any suggestion?
This is Hive's version:
Hive 3.1.3000.7.1.7.1000-141
This is Hadoop's version
Hadoop 3.1.1.7.1.7.1000-141
Source code repository git#github.infra.cloudera.com:CDH/hadoop.git -r 8225796fc6d7984f835c3f63f1feb1efb1e4784a
Compiled by jenkins on 2022-03-24T17:23Z
Compiled with protoc 2.5.0
From source with checksum b591347dc68a5634183cd9aac1974ddd
This command was run using /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hadoop/hadoop-common-3.1.1.7.1.7.1000-141.jar
I am using Sqoop to import MySQL tables to HDFS. To do that, I use a free-form query import.
--query "SELECT $query_select FROM $table where \$CONDITIONS"
This query is quite slow because of the min(id) and the max(id) search. To improve performances, I've decided to use --boundary-query and specify manually lower-bound and upper-bound.
( https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html):
--boundary-query "select 176862848, 172862848"
However, sqoop doesn't care about specified value and again tries to find minimum and maximum "id" by itself.
16/06/13 14:24:44 INFO tool.ImportTool: Lower bound value: 170581647
16/06/13 14:24:44 INFO tool.ImportTool: Upper bound value: 172909234
The complete sqoop command:
sqoop-import -fs hdfs://xxxxxxxxx/ -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris" -m $nodes_number\
--connect jdbc:mysql://$server:$port/$database --username $username --password $password\
--target-dir $destination_dir --boundary-query "select 176862848, 172862848"\
--incremental append --check-column $id_column_name --last-value $last_value\
--split-by $id_column_name --query "SELECT $query_select FROM $table where \$CONDITIONS"\
--fields-terminated-by , --escaped-by \\ --enclosed-by '\"'
Does anyone has already met/solved this problem? Thanks
I've managed to solve this problem by deleting the following arguments:
--incremental append --check-column $id_column_name --last-value $last_value
It seems that there is a concurrency between arguments --boundary-query, --check-column, --split-by and --incremental append
You are correct..
We should not use --split-by with --boundary-query control argument.
try like this..
--boundary-query "select 176862848, 172862848 from tablename limit 1" \
I have a sqoop job torun, The conditions includes :
WHERE cond1='' AND date = '2-12-xxxx' AND date = '3-12-xxxx' AND date = '3-12-xxxx'.
Is there a IN conditional in sqoop similar to sql?
You can run sqoop import using --query and pass any query to get the data.
In --where you have to pass conditions like this --where "cond1='value' and cond2 in (<comma seperated values>)".
If you use where condition on table, it will apply like this select * from <table> where <condition specified in where clause> to fetch the data and hence you can pass any valid conditions in where.
Can anyone tell me what is the use of --split-by and boundary query in sqoop?
sqoop import --connect jdbc:mysql://localhost/my --username user --password 1234 --query 'select * from table where id=5 AND $CONDITIONS' --split-by table.id --target-dir /dir
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--boundary-query : By default sqoop will use query select min(), max() from to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Reason to use : If --split-by is not giving you the optimal performance you can use this to improve the performance further.
--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.
$CONDITIONS is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set.
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop allows you to import data in parallel and --split-by and --boundary-query allow you more control. If you're just importing a table then it'll use the PRIMARY KEY however if you're doing a more advanced query, you'll need to specify the column to do the parallel split.
i.e.,
sqoop import \
--connect 'jdbc:mysql://.../...' \
--direct \
--username uname --password pword \
--hive-import \
--hive-table query_import \
--boundary-query 'SELECT 0, MAX(id) FROM a' \
--query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
--num-mappers 3
--split-by a.id \
--target-dir /data/import \
--verbose
Boundary Query lets you specify an optimized query to get the max, min. else it will attempt to do MIN(a.id), MAX(a.id) ON your --query statement.
The results will be (if min=0, max=30) is 3 queries that get run in parallel:
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;
Split by :
why it is used? -> to enhance the speed while fetching the data from rdbms to hadoop
How it works? -> By default there are 4 mappers in sqoop , so the import works parallely. The entire data is divided into equal partitions. Sqoop considers primary key column for splitting the data and then finds out the maximum and minimum range from it and then makes the 4 ranges for 4 mappers to work.
Eg. 1000 records in primary key column and max value =1000 and min value -0 so sqoop will create 4 ranges - (0-250) , (250-500),(500-750),(750-1000) and depending on values of column the data will be partitioned and given to 4 mappers to store it on HDFS.
so if in case the primary key column is not evenly distributed so with split-by you can change the column-name for evenly partitioning.
In short: Used for partitioning of data to support parallelism and improve performance
Also if we specify --query value within double quotes(" "), we need to precede $CONDITIONS with a slash(\)
--query "select * from table where id=5 AND \$CONDITIONS"
or else
--query 'select * from table where id=5 AND $CONDITIONS'