Sqoop import having SQL query with where clause and parallel processing

Sqoop import having SQL query with where clause and parallel processing - hadoop

I have a table as below in mysql :
Order_Details :
+---------+------------+-------------------+--------------+
| orderid | order_date | order_customer_id | order_status |
+---------+------------+-------------------+--------------+
| A001 | 10/30/2018 | C003 | Completed |
| A002 | 10/30/2018 | C005 | Completed |
| A451 | 11/02/2018 | C376 | Pending |
| P9209 | 10/30/2018 | C234 | Completed |
| P92099 | 10/30/2018 | C244 | Pending |
| P9210 | 10/30/2018 | C035 | Completed |
| P92398 | 10/30/2018 | C346 | Pending |
| P9302 | 10/30/2018 | C034 | Completed |
+---------+------------+-------------------+--------------+
and the description as below :
mysql> desc Order_Details_Sankha;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| orderid | varchar(20) | NO | PRI | | |
| order_date | varchar(20) | YES | | NULL | |
| order_customer_id | varchar(20) | YES | | NULL | |
| order_status | varchar(20) | YES | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
I am using the below sqoop import with parallel processing :
sqoop import
--connect jdbc:mysql://ip-10-0-1-10.ec2.internal/06july2018_new
--username labuser
--password abc123
--driver com.mysql.jdbc.Driver
--query "select * from Order_Details where order_date = '10/30/2018' AND \$CONDITIONS"
--target-dir /user/sankha087_gmail_com/outputs/EMP_Sankha_1112201888
--split-by ","
--m 3
and I am getting the below error message
18/12/15 17:15:14 WARN security.UserGroupInformation: PriviledgedActionException as:sankha087_gmail_com (auth:SIMPLE) cause:java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You hav
e an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' a
t line 1
18/12/15 17:15:14 ERROR tool.ImportTool: Import failed: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to
your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' at line 1
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
Please advise what needs to be changed in my import statement .

Sqoop parallel execution doesn't happen with vertical split, it happens with horizontal split.
--split-by should be a column name. column should be one which is evenly distributed.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1765770
Read: 7.2.4. Controlling Parallelism

Related

Dbeaver does not display "Table description" of table in a Hive database

I created a Hive table with table description metadata using this command:
create table sbx_ppppp.comments (
s string comment 'uma string',
i int comment 'um inteiro'
) comment 'uma tabela com comentários';
But it isn't correctly displayed when I double click the table:
The table description also isn't displayed in the table tooltip or in the table list when I double click the database name.
When I run the describe formatted table sbx_ppppp.comments command with the comment is correctly displayed as table property:
col_name |data_type |comment |
----------------------------+------------------------------------------------+---------------------------------------------------------------------------+
# col_name |data_type |comment |
s |string |uma string |
i |int |um inteiro |
| | |
# Detailed Table Information| | |
Database: |sbx_ppppp | |
OwnerType: |USER | |
Owner: |ppppp | |
CreateTime: |Fri Apr 29 18:31:31 BRT 2022 | |
LastAccessTime: |UNKNOWN | |
Retention: |0 | |
Location: |hdfs://BNDOOP03/corporativo/sbx_ppppp/comments | |
Table Type: |MANAGED_TABLE | |
Table Parameters: | | |
|COLUMN_STATS_ACCURATE |{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"i\":\"true\",\"s\":\"true\"}}|
|bucketing_version |2 |
|comment |uma tabela com comentários |
|numFiles |0 |
|numRows |0 |
|rawDataSize |0 |
|totalSize |0 |
|transactional |true |
|transactional_properties |default |
|transient_lastDdlTime |1651267891 |
| | |
# Storage Information | | |
SerDe Library: |org.apache.hadoop.hive.ql.io.orc.OrcSerde | |
InputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | |
OutputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat| |
Compressed: |No | |
Num Buckets: |-1 | |
Bucket Columns: |[] | |
Sort Columns: |[] | |
Storage Desc Params: | | |
|serialization.format |1 |
In "Table Parameters" you can see the value "uma tabela com comentários" for the "comment" parameter.
I'm using Cloudera ODBC driver version 2.6.11.1011 to connect to Hive. DBeaver is version 22.0.3.202204170718. I don't know if this is a bug in DBeaver or in Cloudera ODBC driver. Maybe I'm not correctly setting the table description.

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.

You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

ERROR: relation "table" does not exist, even though both the database and the table exist

I'm using cockroachdb, which is essentially a superset of postgres and I can't understand why the statement:
select * from _a66df261120b6c23.tabDefaultValue;
results sin the error:
ERROR: relation "_a66df261120b6c23.tabdefaultvalue" does not exist
show databases gives me:
database_name | owner | primary_region | regions | survival_goal
--------------------+-------+----------------+---------+----------------
_a66df261120b6c23 | root | NULL | {} | NULL
defaultdb | root | NULL | {} | NULL
postgres | root | NULL | {} | NULL
root | root | NULL | {} | NULL
sammy | root | NULL | {} | NULL
system | node | NULL | {} | NULL
test123 | root | NULL | {} | NULL
test3 | root | NULL | {} | NULL
test4 | root | NULL | {} | NULL
test5 | root | NULL | {} | NULL
test9 | root | NULL | {} | NULL
and show tables from _a66df261120b6c23 gives me:
schema_name | table_name | type | owner | estimated_row_count | locality
--------------+-------------------+-------+-------+---------------------+-----------
public | __Auth | table | root | 0 | NULL
public | tabDefaultValue | table | root | 0 | NULL
The database and the table both exist, so why does select * from _a66df261120b6c23.tabDefaultValue; fail? Strange thing is when I run \dt all I get is:
schema_name | table_name | type | owner | estimated_row_count | locality
--------------+------------+-------+-------+---------------------+-----------
public | sammy | table | root | 0 | NULL
How do I actually get the select statement to work? Thank you

SQL statements are case sensitive and incoming identifiers are lowercased unless specifically forced with double quotes.
This means that to refer to your tabDefaultValue, you need to use the following statement:
select * from _a66df261120b6c23."tabDefaultValue";
Note that the quotes are around the table name only, if you quote dbname.tablename together, this will be considered a single identifier with a dot in the middle.

Does Sqoop incremental lastmodified option include the date, mentioned in last value, in result?

My SQL table is
mysql> select * from Orders;
+--------+-----------+------------+-------------+-------------+
| ord_no | purch_amt | ord_date | customer_id | salesman_id |
+--------+-----------+------------+-------------+-------------+
| 70001 | 150.50 | 2012-10-05 | 3005 | 5002 |
| 70009 | 270.65 | 2012-09-10 | 3001 | 5005 |
| 70002 | 65.26 | 2012-10-05 | 3002 | 5001 |
| 70004 | 110.50 | 2012-08-17 | 3009 | 5003 |
| 70007 | 948.50 | 2012-09-10 | 3005 | 5002 |
| 70005 | 999.99 | 2012-07-27 | 3007 | 5001 |
| 70008 | 999.99 | 2012-09-10 | 3002 | 5001 |
| 70010 | 999.99 | 2012-10-10 | 3004 | 5006 |
| 70003 | 999.99 | 2012-10-10 | 3009 | 5003 |
| 70012 | 250.45 | 2012-06-27 | 3008 | 5002 |
| 70011 | 75.29 | 2012-08-17 | 3003 | 5007 |
| 70013 | 999.99 | 2012-04-25 | 3002 | 5001 |
+--------+-----------+------------+-------------+-------------+
I ran an Sqoop import
sqoop import --connect jdbc:mysql://ip-172-31-20-247:3306/sqoopex --
username sqoopuser --password <hidden> --table Orders --target-dir
SqoopImp2 --split-by ord_no --check-column ord_date --incremental
lastmodified --last-value '2012-09-10'
As per Sqoop 1.4.6 manual stated below,
An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported
I was not expecting columns with date '2012-09-10' in output. However my output, as shown below,
[manojpurohit17834325#ip-172-31-38-146 ~]$ hadoop fs -cat SqoopImp2/*
70001,150.50,2012-10-05,3005,5002
70002,65.26,2012-10-05,3002,5001
70003,999.99,2012-10-10,3009,5003
70007,948.50,2012-09-10,3005,5002
70009,270.65,2012-09-10,3001,5005
70008,999.99,2012-09-10,3002,5001
70010,999.99,2012-10-10,3004,5006
contains rows with date 20125-10-10. Note: the output directory was not present earlier and it was created by this sqoop execution.
From this execution, I see that date in --last-modified is included in output which is contrary to what is mentioned in manual. Please help me understand this discrepancy and correct me if am missing something here.

yes, --lastvalue also included in the results.
in --incremental import 2 modes available: i) append ii) lastmodified
append: in this mode it just checks --lastvalue and import from last value onwards.
--> it's not imported any previous values even if they updated
lastmodified: it's also the same as append mode only but here it imports new rows and imported previous rows also if they updated.
Note: lastmodified is working on the only date or Timestamp type columns only

sqoop import is not moving entire table in hdfs

I had create a small database with few tables in mysql. Now I transferring the table using sqoop to HDFS.
Below is the sqoop command:
sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver
I am not getting last 2 columns, salary and dept
Output of the above command
1201gopalmanager
1202manishaProof reader
1203khalilphp dev
1204prasanthphp dev
1205kranthiadmin
MySql table is :
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
I tried with using "--fields-terminated-by , **", or "--input-fields-terminated-by ,**" but failed
Also when I am using mapper count like (--m 3), getting only single file in HDFS.
I am using apache Sqoop on ubuntu machince.
Thanks in advance for finding solution. :)

Your command seems correct. Providing some steps below that you can try to follow once again and see if it works:
1) Create the table and populate it (MySQL)
mysql> create database sqooptest;
mysql> use sqooptest;
mysql> create table emp (id int, name varchar(100), deg varchar(50), salary int, dept varchar(10));
mysql> insert into emp values(1201, 'gopal','manager',50000,'TP');
mysql> insert into emp values(1202, 'manisha','Proof reader',50000,'TP');
mysql> insert into emp values(1203, 'khalil','php dev',30000,'AC');
mysql> insert into emp values(1204, 'prasanth','php dev',30000,'AC');
mysql> insert into emp values(1205, 'kranthi','admin',20000,'TP');
mysql> select * from emp;
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
2) Run the import
$ sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver --target-dir /tmp/sqoopout
3) Check result
$ hadoop fs -cat /tmp/sqoopout/*
1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
HDFS has only one file (part-m-00000):
$ hadoop fs -ls /tmp/sqoopout
Found 2 items
/tmp/sqoopout/_SUCCESS
/tmp/sqoopout/part-m-00000
This is because the data size is small, and one mapper is sufficient to process it. You can verify this by looking at the sqoop logs, which outputs:
Job Counters
Launched map tasks=1

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sqoop import having SQL query with where clause and parallel processing - hadoop

Sqoop parallel execution doesn't happen with vertical split, it happens with horizontal split. --split-by should be a column name. column should be one which is evenly distributed. https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1765770 Read: 7.2.4. Controlling Parallelism

Related

Dbeaver does not display "Table description" of table in a Hive database

How to drop hive partitions with hivevar passed as partition variable?

ERROR: relation "table" does not exist, even though both the database and the table exist

Does Sqoop incremental lastmodified option include the date, mentioned in last value, in result?

sqoop import is not moving entire table in hdfs

Categories

Resources