Error in getting data from Oracle to hive using sqoop - oracle

I am running the following sqoop query:
sqoop import --connect jdbc:oracle:thin:#ldap://oid:389/ewsop000,cn=OracleContext,dc=****,dc=com \
--table ngprod.ewt_payment_ng --where "d_last_updt_ts >= to_timestamp('11/01/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" \
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" --username ***** --P \
--columns N_PYMNT_ID,D_last_updt_Ts,c_pymnt_meth,c_rcd_del,d_Create_ts \
--hive-import --hive-table payment_sample_table2
The schema for table payment_sample_table2 is in hive. it is running fine if I do not use
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')"
Can someone tell me why, or if there's any other way to get the range of data?

Please specify the exact error . In any case please put the "AND .." within the same double quotation and on the same previous line as the preceding part of the "where" clause. As shown above you have a badly formatted commandline - nothing to do with the actual query.

Related

Create New Rows from Oracle CLOB and Write to HDFS

In an Oracle database, I can read this table containing a CLOB type (note the newlines):
ID MY_CLOB
001 500,aaa,bbb
500,ccc,ddd
480,1,2,bad
500,eee,fff
002 777,0,0,bad
003 500,yyy,zzz
I need to process this, and import into an HDFS table with new rows for each MY_CLOB line starting with "500,". In this case, the hive table should look like:
ID C_1 C_2 C_3
001 500 aaa bbb
001 500 ccc ddd
001 500 eee fff
003 500 yyy zzz
This solution to my previous question succeeds in producing this on Oracle. But writing the result to HDFS with a Python driver is very slow, or never succeeds.
Following this solution, I've tested a similar regex + pyspark solution that might work for my purposes:
<!-- begin snippet: js hide: true -->
import cx_Oracle
#... query = """SELECT ID, MY_CLOB FROM oracle_table"""
#... cx_oracle_results <--- fetchmany results (batches) from query
import re
from pyspark.sql import Row
from pyspark.sql.functions import col
def clob_to_table(clob_lines):
m = re.findall(r"^(500),(.*),(.*)",
clob_lines, re.MULTILINE)
return Row(C_1 = m.group(1), C_2 = m.group(2), C_3 = m.group(3))
# Process each batch of results and write to hive as parquet
for batch in cx_oracle_results():
# batch is like [(1,<cx_oracle object>), (2,<cx_oracle object>), (3,<cx_oracle object>)]
# When `.read()` looks like [(1,"500,a,b\n500c,d"), (2,"500,e,e"), (3,"500,z,y\n480,-1,-1")]
df = sc.parallelize(batch).toDF(["ID", "MY_CLOB"])\
.withColumn("clob_as_text", col("MY_CLOB")\
.read()\ # Converts cx_oracle CLOB object to text.
.map(clob_to_table)
df.write.mode("append").parquet("myschema.pfile")
But reading oracle cursor results and feeding them into pyspark this way doesn't work well.
I'm trying to to run a sqoop job generated by another tool, importing the CLOB as text, and hoping I can process the sqooped table into a new hive table like the above in reasonable time. Perhaps with pyspark with a solution similar to above.
Unfortunately, this sqoop job doesn't work.
sqoop import -Doraoop.timestamp.string=false -Doracle.sessionTimeZone=America/Chicago
-Doraoop.import.hint=" " -Doraoop.oracle.session.initialization.statements="alter session disable parallel query;"
-Dkite.hive.tmp.root=/user/hive/kite_tmp/wassadamo --verbose
--connect jdbc:oracle:thin:#ldap://connection/string/to/oracle
--num-mappers 8 --split-by date_column
--query "SELECT * FROM (
SELECT ID, MY_CLOB
FROM oracle_table
WHERE ROWNUM <= 1000
) WHERE \$CONDITIONS"
--create-hive-table --hive-import --hive-overwrite --hive-database my_db
--hive-table output_table --as-parquetfile --fields-terminated-by \|
--delete-target-dir --target-dir $HIVE_WAREHOUSE --map-column-java=MY_CLOB=String
--username wassadamo --password-file /user/wassadamo/.oracle_password
But I get an error (snippet below):
20/07/13 17:04:08 INFO mapreduce.Job: map 0% reduce 0%
20/07/13 17:05:08 INFO mapreduce.Job: Task Id : attempt_1594629724936_3157_m_000001_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string
This seems to have been caused by mapping the CLOB column to string. I did this based on this answer.
How can I fix this? I'm open to a different pyspark solution as well
Partial answer: the oracle error seems to have been due to
--split-by date_column
This date_column is an Oracle Date type. Turns out it doesn't work when sqooping from Oracle. It would be nice to be able to split on this. But splitting on ID (varchar2) seems to be working.
The issue of performantly parsing the text MY_CLOB field and creating new rows for each line remains.

Valid MySQL query breaks when used as boundary-query

Note: This is NOT a duplicate of Sqoop - Syntaxt error - Boundary Query - “error in your SQL syntax”
To limit the fetching data from only last 8 days, I'm using this following boundary-query with Sqoop
SELECT min(`created_at`),
max(`created_at`)
FROM `billing_db`.`billing_ledger`
WHERE `created_at` >= timestamp(date(convert_tz(now(), IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone),'Asia/Kolkata')) + interval -2 DAY)"
I've broken query into multiple lines here for readability, actually i pass it to Sqoop in single line only
Explaination of different parts of boundary-query are
IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone)
determines server timezone
works for both MySQL & TiDB
convert_tz(now(), <server-timezone>,'Asia/Kolkata')
converts time from server-timezone in IST
timestamp(date(<ist-timestamp> + interval -{num_days} DAY)
returns the IST timestamp at 00:00 hours for date whih is {num_days} before today (current-time -> tz-specific)
While the query works fine on MySQL
mysql> SELECT min(`created_at`),
-> max(`created_at`)
-> FROM `billing_db`.`billing_ledger`
-> WHERE `created_at` >= timestamp(date(convert_tz(now(), IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone),'Asia/Kolkata')) + interval -2 DAY);
+---------------------+---------------------+
| min(`created_at`) | max(`created_at`) |
+---------------------+---------------------+
| 2020-05-08 00:00:00 | 2020-05-10 20:12:32 |
+---------------------+---------------------+
1 row in set (0.02 sec)
It breaks with following stacktrace on Sqoop
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT min(), max() FROM . WHERE >= timestamp(date(convert_tz(now(), IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone),'Asia/Kolkata')) + interval -2 DAY)
[2020-05-10 12:45:34,968] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1589114450995_0001
[2020-05-10 12:45:34,971] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader#6ab7a896
[2020-05-10 12:45:34,973] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 ERROR tool.ImportTool: Import failed: java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=313686) You have an error in your SQL syntax; check the manual that corresponds to your TiDB version for the right syntax to use line 1 column 12 near "), max() FROM . WHERE >= timestamp(date(convert_tz(now(), IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone),'Asia/Kolkata')) + interval -2 DAY)"
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:303)
For the record
using WHERE $CONDITIONS is required in --query (free-form query-import) but for --boundary-query it is NOT mandatory. Without it, Sqoop merely generates this warning
WARN db.DataDrivenDBInputFormat: Could not find $CONDITIONS token in query: SELECT min(), max() FROM . WHERE >= timestamp(date(convert_tz(now(), IF(##global.time_zone = 'SYSTEM', ##system_time_zone, ##global.time_zone),'Asia/Kolkata')) + interval -2 DAY); splits may not partition data.
I've been using similar complex boundary-querys elsewhere in my pipeline but in this particular case it is breaking
What have I tried
I tried adding aliases in SELECT clause of query like this
SELECT min(`created_at`) AS min_created_at,...
Backticks `` were the culprit
Removing backticks from boundary-query resolved the error
Some comments in discussions point out that backticks can cause wierd things with sqoop
But the docs bear no mention of it and some discussions even encourage using it

Sqoop is failing to get data from teradata with java.IO exception

Here is my sqoop import that I'm using to pull data from Teradata
sqoop import -libjars jars --driver drivers --connect connection_url -m 1 --hive-overwrite --hive-import --hive-database hivedatabase --hive-table hivetable --target-dir '/user/hive/warehouse/database.db/table_name' --as-parquetfile --query "select c1,c2,c3, to_char(SOURCE_ACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_ACTIVATION_DT,to_char(SOURCE_DEACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_DEACTIVATION_DT,to_char(EFF_DT,'YYYY-MM-DD HH24:MI:SS') as EFF_DT,to_char(EXP_DT,'YYYY-MM-DD HH24:MI:SS') as EXP_DT,to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_UPDATE_DTM,to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_LOAD_DTM from source_schema.table_name WHERE to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS')> '2017-03-30 10:00:00' OR to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') > '2017-03-30 10:00:00' AND \$CONDITIONS"
Below is the error I'm getting, this was running fine for two days and started returning the below error recently.
17/03/29 20:07:53 INFO mapreduce.Job: map 0% reduce 0%
17/03/29 20:56:46 INFO mapreduce.Job: Task Id : attempt_1487033963691_263120_m_000000_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 15.10.00.14] [Error 1005] [SQLState HY000] Unexpected parcel kind received: 9
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:94)
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:69)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveRecordSubState.action(ReceiveRecordSubState.java:195)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:311)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:200)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:137)
at com.teradata.jdbc.jdbc_4.statemachine.PreparedStatementController.run(PreparedStatementController.java:46)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.fetchRows(StatementController.java:360)
at com.teradata.jdbc.jdbc_4.TDResultSet.goToRow(TDResultSet.java:374)
at com.teradata.jdbc.jdbc_4.TDResultSet.next(TDResultSet.java:657)
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:237)
... 12 more
When i googled around I've seen people getting same errors for different errors, I know this is something related to the time i'm using in where clause, but not sure what exactly i have to change.
Thanks in advance...!!
Sqoop uses $CONDITIONS to fetch metadata and data.
Metadata - It replaces $CONDITIONS with 1=0. So, no data will be fetched using this condition but only metadata.
Data in case of 1 mapper: It replaces $CONDITIONS with 1=1. So, all the data is fetched.
Data in case of multiple mapper: It replaces $CONDITIONS with some range condition.
Try these queries in JDBC client:
select c1,c2,c3, to_char(SOURCE_ACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_ACTIVATION_DT,to_char(SOURCE_DEACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_DEACTIVATION_DT,to_char(EFF_DT,'YYYY-MM-DD HH24:MI:SS') as EFF_DT,to_char(EXP_DT,'YYYY-MM-DD HH24:MI:SS') as EXP_DT,to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_UPDATE_DTM,to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_LOAD_DTM from source_schema.table_name WHERE to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS')> '2017-03-30 10:00:00' OR to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') > '2017-03-30 10:00:00' AND 1=0"
select c1,c2,c3, to_char(SOURCE_ACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_ACTIVATION_DT,to_char(SOURCE_DEACTIVATION_DT,'YYYY-MM-DD HH24:MI:SS') as SOURCE_DEACTIVATION_DT,to_char(EFF_DT,'YYYY-MM-DD HH24:MI:SS') as EFF_DT,to_char(EXP_DT,'YYYY-MM-DD HH24:MI:SS') as EXP_DT,to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_UPDATE_DTM,to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') as SYS_LOAD_DTM from source_schema.table_name WHERE to_char(SYS_UPDATE_DTM,'YYYY-MM-DD HH24:MI:SS')> '2017-03-30 10:00:00' OR to_char(SYS_LOAD_DTM,'YYYY-MM-DD HH24:MI:SS') > '2017-03-30 10:00:00' AND 1=1"
If these are not working, your sqoop command with this query can never run.

Strange beeline error: "Error: (state=,code=0)"

I'm seeing a very strange error when running my HiveQL through beeline:
Error: (state=,code=0)
Error: (state=,code=0)
Aborting command set because "force" is false and command failed: "create table some_database.some_table..."
My query is quite complex, utilizing UNIONS and transforms, but it runs fine when I submit it using the Hive client. It looks something like this:
create table some_database.some_table
stored as rcfile
as select * from (
from some_other_db.table_1
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
union all
from some_other_db.table_2
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
union all
from some_other_db.table_3
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
) all_unions
;
I'm using:
CDH 4.3.0-1
Hive 0.10.0-cdh4.3.0
Beeline version 0.10.0-cdh4.3.0

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always?
And in this format- 20120805?
I can run my query like this to get the data for yesterday's date as today is Aug 6th-
select * from table1 where dt = '20120805';
But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column.
select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(),
'yyyyMMdd')) , 1) limit 10;
It is looking for the data in all the partitions? Why? Something wrong I am doing in my query?
How I can make the evaluation happen in a subquery to avoid the whole table scanned?
Try something like:
select * from table1
where dt >= from_unixtime(unix_timestamp()-1*60*60*24, 'yyyyMMdd');
This works if you don't mind that hive scans the entire table. from_unixtime is not deterministic, so the query planner in Hive won't optimize for you. For many cases (for example log files), not specifying a deterministic partition key can cause a very large hadoop job to start since it will scan the whole table, not just the rows with the given partition key.
If this matters to you, you can launch hive with an additional option
$ hive -hiveconf date_yesterday=20150331
And in the script or hive terminal use
select * from table1
where dt >= ${hiveconf:date_yesterday};
The name of the variable doesn't matter, nor does the value, you can set them in this case to get the prior date using unix commands. In the specific case of the OP
$ hive -hiveconf date_yesterday=$(date --date yesterday "+%Y%m%d")
In mysql:
select DATE_FORMAT(curdate()-1,'%Y%m%d');
In sqlserver :
SELECT convert(varchar,getDate()-1,112)
Use this query:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()-1*24*60*60,'%Y%m%d');
It looks like DATE_SUB assumes date in format yyyy-MM-dd. So you might have to do some more format manipulation to get to your format. Try this:
select * from table1
where dt = FROM_UNIXTIME(
UNIX_TIMESTAMP(
DATE_SUB(
FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')
, 1)
)
, 'yyyyMMdd') limit 10;
Use this:
select * from table1 where dt = date_format(concat(year(date_sub(current_timestamp,1)),'-', month(date_sub(current_timestamp,1)), '-', day(date_sub(current_timestamp,1))), 'yyyyMMdd') limit 10;
This will give a deterministic result (a string) of your partition.
I know it's super verbose.

Resources