Does Sqoop incremental lastmodified option include the date, mentioned in last value, in result? - sqoop

My SQL table is
mysql> select * from Orders;
+--------+-----------+------------+-------------+-------------+
| ord_no | purch_amt | ord_date | customer_id | salesman_id |
+--------+-----------+------------+-------------+-------------+
| 70001 | 150.50 | 2012-10-05 | 3005 | 5002 |
| 70009 | 270.65 | 2012-09-10 | 3001 | 5005 |
| 70002 | 65.26 | 2012-10-05 | 3002 | 5001 |
| 70004 | 110.50 | 2012-08-17 | 3009 | 5003 |
| 70007 | 948.50 | 2012-09-10 | 3005 | 5002 |
| 70005 | 999.99 | 2012-07-27 | 3007 | 5001 |
| 70008 | 999.99 | 2012-09-10 | 3002 | 5001 |
| 70010 | 999.99 | 2012-10-10 | 3004 | 5006 |
| 70003 | 999.99 | 2012-10-10 | 3009 | 5003 |
| 70012 | 250.45 | 2012-06-27 | 3008 | 5002 |
| 70011 | 75.29 | 2012-08-17 | 3003 | 5007 |
| 70013 | 999.99 | 2012-04-25 | 3002 | 5001 |
+--------+-----------+------------+-------------+-------------+
I ran an Sqoop import
sqoop import --connect jdbc:mysql://ip-172-31-20-247:3306/sqoopex --
username sqoopuser --password <hidden> --table Orders --target-dir
SqoopImp2 --split-by ord_no --check-column ord_date --incremental
lastmodified --last-value '2012-09-10'
As per Sqoop 1.4.6 manual stated below,
An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported
I was not expecting columns with date '2012-09-10' in output. However my output, as shown below,
[manojpurohit17834325#ip-172-31-38-146 ~]$ hadoop fs -cat SqoopImp2/*
70001,150.50,2012-10-05,3005,5002
70002,65.26,2012-10-05,3002,5001
70003,999.99,2012-10-10,3009,5003
70007,948.50,2012-09-10,3005,5002
70009,270.65,2012-09-10,3001,5005
70008,999.99,2012-09-10,3002,5001
70010,999.99,2012-10-10,3004,5006
contains rows with date 20125-10-10. Note: the output directory was not present earlier and it was created by this sqoop execution.
From this execution, I see that date in --last-modified is included in output which is contrary to what is mentioned in manual. Please help me understand this discrepancy and correct me if am missing something here.

yes, --lastvalue also included in the results.
in --incremental import 2 modes available: i) append ii) lastmodified
append: in this mode it just checks --lastvalue and import from last value onwards.
--> it's not imported any previous values even if they updated
lastmodified: it's also the same as append mode only but here it imports new rows and imported previous rows also if they updated.
Note: lastmodified is working on the only date or Timestamp type columns only

Related

Dbeaver does not display "Table description" of table in a Hive database

I created a Hive table with table description metadata using this command:
create table sbx_ppppp.comments (
s string comment 'uma string',
i int comment 'um inteiro'
) comment 'uma tabela com comentários';
But it isn't correctly displayed when I double click the table:
The table description also isn't displayed in the table tooltip or in the table list when I double click the database name.
When I run the describe formatted table sbx_ppppp.comments command with the comment is correctly displayed as table property:
col_name |data_type |comment |
----------------------------+------------------------------------------------+---------------------------------------------------------------------------+
# col_name |data_type |comment |
s |string |uma string |
i |int |um inteiro |
| | |
# Detailed Table Information| | |
Database: |sbx_ppppp | |
OwnerType: |USER | |
Owner: |ppppp | |
CreateTime: |Fri Apr 29 18:31:31 BRT 2022 | |
LastAccessTime: |UNKNOWN | |
Retention: |0 | |
Location: |hdfs://BNDOOP03/corporativo/sbx_ppppp/comments | |
Table Type: |MANAGED_TABLE | |
Table Parameters: | | |
|COLUMN_STATS_ACCURATE |{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"i\":\"true\",\"s\":\"true\"}}|
|bucketing_version |2 |
|comment |uma tabela com comentários |
|numFiles |0 |
|numRows |0 |
|rawDataSize |0 |
|totalSize |0 |
|transactional |true |
|transactional_properties |default |
|transient_lastDdlTime |1651267891 |
| | |
# Storage Information | | |
SerDe Library: |org.apache.hadoop.hive.ql.io.orc.OrcSerde | |
InputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | |
OutputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat| |
Compressed: |No | |
Num Buckets: |-1 | |
Bucket Columns: |[] | |
Sort Columns: |[] | |
Storage Desc Params: | | |
|serialization.format |1 |
In "Table Parameters" you can see the value "uma tabela com comentários" for the "comment" parameter.
I'm using Cloudera ODBC driver version 2.6.11.1011 to connect to Hive. DBeaver is version 22.0.3.202204170718. I don't know if this is a bug in DBeaver or in Cloudera ODBC driver. Maybe I'm not correctly setting the table description.

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Timestamp is different for the same table in hive-cli & presto-cli

I am getting different timestamps for the same table in hive-cli & presto-cli.
table structure for the table:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `mea_req_201`( |
| `mer_id` bigint, |
| `mer_from_dttm` timestamp, |
| `mer_to_dttm` timestamp, |
| `_c0` bigint, |
| `a_number` string, |
| `b_number` string, |
| `time_stamp` timestamp, |
| `duration` bigint) |
| PARTITIONED BY ( |
| `partition_col` bigint) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop5:8020/warehouse/tablespace/external/hive/mea_req_201' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'spark.sql.create.version'='2.4.0.7.1.4.0-203', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'transient_lastDdlTime'='1625496239') |
+----------------------------------------------------+
While running from hive-cli the output is:
While from presto-cli:
in mer_from_dttm col, there's a time difference but for other timestamps columns, dates are exactly the same. Note this time difference behaviour is the same when done from presto-jdbc also. I believe this got nothing to do with timezone because if it was timezone, the time difference should be across all timestamp columns, not just one. Please provide some resolution.
Some Facts:
Presto server version: 0.180
Presto Jdbc version: 0.180
hive.time-zone=Asia/Calcutta
In Presto jvm.config: -Duser.timezone=Asia/Calcutta
Client TimeZone: Asia/Calcutta
Edit 1:
Sorted the query with mer_id to ensure both queries are outputting the same set of rows, However, the erroneous behavior still remains the same.
While Running from hive-cli:
While Running from presto-cli:
Presto 0.180 is really old. It was released in 2017, and many bugs have been fixed along the way.
I would suggest you try with a recent version. In particular, recent versions of Trino (formerly known as PrestoSQL) had a lot work done around handling of timestamp data.
Use order by to see exact rows in each of clients.
SELECT `mer_id`,`mer_from_dttm`, `mer_to_dttm`, `time_stamp` FROM mea_req_201 ORDER BY `mer_id`;

Sqoop import having SQL query with where clause and parallel processing

I have a table as below in mysql :
Order_Details :
+---------+------------+-------------------+--------------+
| orderid | order_date | order_customer_id | order_status |
+---------+------------+-------------------+--------------+
| A001 | 10/30/2018 | C003 | Completed |
| A002 | 10/30/2018 | C005 | Completed |
| A451 | 11/02/2018 | C376 | Pending |
| P9209 | 10/30/2018 | C234 | Completed |
| P92099 | 10/30/2018 | C244 | Pending |
| P9210 | 10/30/2018 | C035 | Completed |
| P92398 | 10/30/2018 | C346 | Pending |
| P9302 | 10/30/2018 | C034 | Completed |
+---------+------------+-------------------+--------------+
and the description as below :
mysql> desc Order_Details_Sankha;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| orderid | varchar(20) | NO | PRI | | |
| order_date | varchar(20) | YES | | NULL | |
| order_customer_id | varchar(20) | YES | | NULL | |
| order_status | varchar(20) | YES | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
I am using the below sqoop import with parallel processing :
sqoop import
--connect jdbc:mysql://ip-10-0-1-10.ec2.internal/06july2018_new
--username labuser
--password abc123
--driver com.mysql.jdbc.Driver
--query "select * from Order_Details where order_date = '10/30/2018' AND \$CONDITIONS"
--target-dir /user/sankha087_gmail_com/outputs/EMP_Sankha_1112201888
--split-by ","
--m 3
and I am getting the below error message
18/12/15 17:15:14 WARN security.UserGroupInformation: PriviledgedActionException as:sankha087_gmail_com (auth:SIMPLE) cause:java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You hav
e an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' a
t line 1
18/12/15 17:15:14 ERROR tool.ImportTool: Import failed: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to
your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' at line 1
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
Please advise what needs to be changed in my import statement .
Sqoop parallel execution doesn't happen with vertical split, it happens with horizontal split.
--split-by should be a column name. column should be one which is evenly distributed.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1765770
Read: 7.2.4. Controlling Parallelism

sqoop import is not moving entire table in hdfs

I had create a small database with few tables in mysql. Now I transferring the table using sqoop to HDFS.
Below is the sqoop command:
sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver
I am not getting last 2 columns, salary and dept
Output of the above command
1201gopalmanager
1202manishaProof reader
1203khalilphp dev
1204prasanthphp dev
1205kranthiadmin
MySql table is :
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
I tried with using "--fields-terminated-by , **", or "--input-fields-terminated-by ,**" but failed
Also when I am using mapper count like (--m 3), getting only single file in HDFS.
I am using apache Sqoop on ubuntu machince.
Thanks in advance for finding solution. :)
Your command seems correct. Providing some steps below that you can try to follow once again and see if it works:
1) Create the table and populate it (MySQL)
mysql> create database sqooptest;
mysql> use sqooptest;
mysql> create table emp (id int, name varchar(100), deg varchar(50), salary int, dept varchar(10));
mysql> insert into emp values(1201, 'gopal','manager',50000,'TP');
mysql> insert into emp values(1202, 'manisha','Proof reader',50000,'TP');
mysql> insert into emp values(1203, 'khalil','php dev',30000,'AC');
mysql> insert into emp values(1204, 'prasanth','php dev',30000,'AC');
mysql> insert into emp values(1205, 'kranthi','admin',20000,'TP');
mysql> select * from emp;
+------+----------+--------------+--------+------+
| id | name | deg | salary | dept |
+------+----------+--------------+--------+------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | Proof reader | 50000 | TP |
| 1203 | khalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
+------+----------+--------------+--------+------+
2) Run the import
$ sqoop import --connect jdbc:mysql://localhost/sqooptest --username root -P --table emp --m 1 --driver com.mysql.jdbc.Driver --target-dir /tmp/sqoopout
3) Check result
$ hadoop fs -cat /tmp/sqoopout/*
1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
HDFS has only one file (part-m-00000):
$ hadoop fs -ls /tmp/sqoopout
Found 2 items
/tmp/sqoopout/_SUCCESS
/tmp/sqoopout/part-m-00000
This is because the data size is small, and one mapper is sufficient to process it. You can verify this by looking at the sqoop logs, which outputs:
Job Counters
Launched map tasks=1

Resources