Create New Rows from Oracle CLOB and Write to HDFS - oracle

In an Oracle database, I can read this table containing a CLOB type (note the newlines):
ID MY_CLOB
001 500,aaa,bbb
500,ccc,ddd
480,1,2,bad
500,eee,fff
002 777,0,0,bad
003 500,yyy,zzz
I need to process this, and import into an HDFS table with new rows for each MY_CLOB line starting with "500,". In this case, the hive table should look like:
ID C_1 C_2 C_3
001 500 aaa bbb
001 500 ccc ddd
001 500 eee fff
003 500 yyy zzz
This solution to my previous question succeeds in producing this on Oracle. But writing the result to HDFS with a Python driver is very slow, or never succeeds.
Following this solution, I've tested a similar regex + pyspark solution that might work for my purposes:
<!-- begin snippet: js hide: true -->
import cx_Oracle
#... query = """SELECT ID, MY_CLOB FROM oracle_table"""
#... cx_oracle_results <--- fetchmany results (batches) from query
import re
from pyspark.sql import Row
from pyspark.sql.functions import col
def clob_to_table(clob_lines):
m = re.findall(r"^(500),(.*),(.*)",
clob_lines, re.MULTILINE)
return Row(C_1 = m.group(1), C_2 = m.group(2), C_3 = m.group(3))
# Process each batch of results and write to hive as parquet
for batch in cx_oracle_results():
# batch is like [(1,<cx_oracle object>), (2,<cx_oracle object>), (3,<cx_oracle object>)]
# When `.read()` looks like [(1,"500,a,b\n500c,d"), (2,"500,e,e"), (3,"500,z,y\n480,-1,-1")]
df = sc.parallelize(batch).toDF(["ID", "MY_CLOB"])\
.withColumn("clob_as_text", col("MY_CLOB")\
.read()\ # Converts cx_oracle CLOB object to text.
.map(clob_to_table)
df.write.mode("append").parquet("myschema.pfile")
But reading oracle cursor results and feeding them into pyspark this way doesn't work well.
I'm trying to to run a sqoop job generated by another tool, importing the CLOB as text, and hoping I can process the sqooped table into a new hive table like the above in reasonable time. Perhaps with pyspark with a solution similar to above.
Unfortunately, this sqoop job doesn't work.
sqoop import -Doraoop.timestamp.string=false -Doracle.sessionTimeZone=America/Chicago
-Doraoop.import.hint=" " -Doraoop.oracle.session.initialization.statements="alter session disable parallel query;"
-Dkite.hive.tmp.root=/user/hive/kite_tmp/wassadamo --verbose
--connect jdbc:oracle:thin:#ldap://connection/string/to/oracle
--num-mappers 8 --split-by date_column
--query "SELECT * FROM (
SELECT ID, MY_CLOB
FROM oracle_table
WHERE ROWNUM <= 1000
) WHERE \$CONDITIONS"
--create-hive-table --hive-import --hive-overwrite --hive-database my_db
--hive-table output_table --as-parquetfile --fields-terminated-by \|
--delete-target-dir --target-dir $HIVE_WAREHOUSE --map-column-java=MY_CLOB=String
--username wassadamo --password-file /user/wassadamo/.oracle_password
But I get an error (snippet below):
20/07/13 17:04:08 INFO mapreduce.Job: map 0% reduce 0%
20/07/13 17:05:08 INFO mapreduce.Job: Task Id : attempt_1594629724936_3157_m_000001_0, Status : FAILED
Error: java.io.IOException: SQLException in nextKeyValue
...
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string
This seems to have been caused by mapping the CLOB column to string. I did this based on this answer.
How can I fix this? I'm open to a different pyspark solution as well

Partial answer: the oracle error seems to have been due to
--split-by date_column
This date_column is an Oracle Date type. Turns out it doesn't work when sqooping from Oracle. It would be nice to be able to split on this. But splitting on ID (varchar2) seems to be working.
The issue of performantly parsing the text MY_CLOB field and creating new rows for each line remains.

Related

Hive with data that does not have a delimiter

I am having some data in HDFS that does not have a delimiter. That is, the individual data fields are identified by their position in the line.
For instance,
CountryXTOWNYCRIMEVALUEZ
So here the country would be positions 0 to 7, the town 8 to 12, and the crime statistic would be 13 to 23.
Is there a way to import data organised like this directly into Hive? I suppose a workable way would be to design a map reduce job that delimits the data, but I was wondering if there is a Hive command that can be used to import the data directly?
RegexSerDe
create external table mytable
(
country string
,town string
,crime_statistic string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '^(.{8})(.{5})(.*)$'
)
location '/...location of the data...'
;
select * from mytable
;
+----------+-------+-----------------+
| country | town | crime_statistic |
+----------+-------+-----------------+
| CountryX | TOWNY | CRIMEVALUEZ |
+----------+-------+-----------------+

Set variables in HIVE query

I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?
When there isn't a documented way one can use documented features to provide a clean enough hack :)
Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way
Let's go
Basic setup
SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';
SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;
Generate a file in hdfs
with content "SET q1=<the-query-result>;"
CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;
Source in the generated file as any sql-file.
First put the file to local fs since 'source' only operates on local disk...
dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};
Try the setting
hive> SET q1;
q1=4
Use the setting in a query
hive > SELECT lname,
salary,
NTILE( ${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
OK
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Optional cleanup
!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;

How to import data from a hbase table to hive table?

I've created a Hbase table like this,
create 'student','personal'
and I've put some data into it like this.
ROW COLUMN+CELL
1 column=personal:age, timestamp=1456224023454, value=20
1 column=personal:name, timestamp=1456224008188, value=pesronA
2 column=personal:age, timestamp=1456224891317, value=13
2 column=personal:name, timestamp=1456224868967, value=pesronB
3 column=personal:age, timestamp=1456224935178, value=21
3 column=personal:name, timestamp=1456224921246, value=personC
4 column=personal:age, timestamp=1456224951789, value=20
4 column=personal:name, timestamp=1456224961845, value=personD
5 column=personal:age, timestamp=1456224983240, value=20
5 column=personal:name, timestamp=1456224972816, value=personE
-
I want to import this data to a hive table. I wrote a hive query for that like this
CREATE TABLE hbaseStudent(key INT,name STRING,age INT) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal:age,personal:name") TBLPROPERTIES("hbase.table.name" = "student")
But when I execute the query error comes out like this.
Driver returned: 1. Errors: OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hbase/HBaseConfiguration
what should i do?
I tried this thing and it worked try replacing all the double quotes (") with single quotes ('). It will work & also try to add terminator ; in last line.

Error in getting data from Oracle to hive using sqoop

I am running the following sqoop query:
sqoop import --connect jdbc:oracle:thin:#ldap://oid:389/ewsop000,cn=OracleContext,dc=****,dc=com \
--table ngprod.ewt_payment_ng --where "d_last_updt_ts >= to_timestamp('11/01/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" \
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" --username ***** --P \
--columns N_PYMNT_ID,D_last_updt_Ts,c_pymnt_meth,c_rcd_del,d_Create_ts \
--hive-import --hive-table payment_sample_table2
The schema for table payment_sample_table2 is in hive. it is running fine if I do not use
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')"
Can someone tell me why, or if there's any other way to get the range of data?
Please specify the exact error . In any case please put the "AND .." within the same double quotation and on the same previous line as the preceding part of the "where" clause. As shown above you have a badly formatted commandline - nothing to do with the actual query.

Oracle Merge, not logging errors

I'm merging several tables in Oracle 10g, into a consolidated table, like this:
table_A (will have all the records)
table_b -part of the data to be merged
table_c -part of the data to be merged
table_d -part of the data to be merged
now, i run it with error logging like this
MERGE INTO TABLE_A A USING (SELECT * FROM TABLE_B) B
ON
(
A.NOMBRE=B.NOMBRE AND
A.PRIMER_APELLIDO=B.PRIMER_APELLIDO AND
A.SEGUNDO_APELLIDO=B.SEGUNDO_APELLIDO AND
TO_CHAR(A.FECHA_NACIMIENTO,'DD/MM/YYYY')=TO_CHAR(B.FECHA_NACIMIENTO,'DD/MM/YYYY') AND
A.SEXO=B.SEXO
)
WHEN MATCHED THEN
UPDATE SET DGP2011='1'
WHEN NOT MATCHED THEN
INSERT
(
A.FOLIO_RELACIONADO,
A.CVE_PROGRAMA,
A.FECHA_ALTA,
A.PRIMER_APELLIDO,
A.SEGUNDO_APELLIDO,
A.NOMBRE,
A.FECHA_NACIMIENTO,
A.SEXO,
A.CVE_NACIONALIDAD,
A.CVE_ENTIDAD_NACIMIENTO,
A.CVE_GRADO_ESCOLAR,
A.CVE_GRADO_ESTUDIOS,
A.CURP,
A.CALLE,
A.NUM_EXT,
A.NUM_INT,
A.CODIGO_POSTAL,
A.ENTRE_CALLE,
A.Y_CALLE,
A.OTRA_REFERENCIA,
A.TELEFONO,
A.COLONIA,
A.LOCALIDAD,
A.CVE_MUNICIPIO,
A.CVE_ENTIDAD_FEDERATIVA,
A.CVE_CCT,
A.PRIMER_APELLIDO_C,
A.SEGUNDO_APELLIDO_C,
A.NOMBRE_C,
A.FECHA_NACIMIENTO_C,
A.SEXO_C,
A.CVE_ESTADO_CIVIL_C,
A.CVE_GRADO_ESTUDIOS_C,
A.CVE_PARENTESCO_C,
A.CURP_C,
A.CVE_TIPO_ID_OFCL_C,
A.ID_DOCTO_OFL_C,
A.CVE_NACIONALIDAD_C,
A.CVE_ENTIDAD_NACIMIENTO_C,
A.CALLE_C,
A.NUM_EXT_C,
A.NUM_INT_C,
A.CODIGO_POSTAL_C,
A.ENTRE_CALLE_C,
A.Y_CALLE_C,
A.OTRA_REFERENCIA_C,
A.TELEFONO_C,
A.COLONIA_C,
A.LOCALIDAD_C,
A.CVE_MUNICIPIO_C,
A.CVE_ENTIDAD_FEDERATIVA_C,
A.E_MAIL_C,
A.DGP2011
)
VALUES
(
B.FOLIO_RELACIONADO,
B.CVE_PROGRAMA,
B.FECHA_ALTA,
B.PRIMER_APELLIDO,
B.SEGUNDO_APELLIDO,
B.NOMBRE,
TO_CHAR(B.FECHA_NACIMIENTO,'DD/MM/YYYY'),
B.SEXO,
B.CVE_NACIONALIDAD,
B.CVE_ENTIDAD_NACIMIENTO,
B.CVE_GRADO_ESCOLAR,
B.CVE_GRADO_ESTUDIOS,
B.CURP,
B.CALLE,
B.NUM_EXT,
B.NUM_INT,
B.CODIGO_POSTAL,
B.ENTRE_CALLE,
B.Y_CALLE,
B.OTRA_REFERENCIA,
B.TELEFONO,
B.COLONIA,
B.LOCALIDAD,
B.CVE_MUNICIPIO,
B.CVE_ENTIDAD_FEDERATIVA,
B.CVE_CCT,
B.PRIMER_APELLIDO_C,
B.SEGUNDO_APELLIDO_C,
B.NOMBRE_C,
TO_CHAR(B.FECHA_NACIMIENTO_C,'DD/MM/YYYY'),
B.SEXO_C,
B.CVE_ESTADO_CIVIL_C,
B.CVE_GRADO_ESTUDIOS_C,
B.CVE_PARENTESCO_C,
B.CURP_C,
B.CVE_TIPO_ID_OFCL_C,
B.ID_DOCTO_OFL_C,
B.CVE_NACIONALIDAD_C,
B.CVE_ENTIDAD_NACIMIENTO_C,
B.CALLE_C,
B.NUM_EXT_C,
B.NUM_INT_C,
B.CODIGO_POSTAL_C,
B.ENTRE_CALLE_C,
B.Y_CALLE_C,
B.OTRA_REFERENCIA_C,
B.TELEFONO_C,
B.COLONIA_C,
B.LOCALIDAD_C,
B.CVE_MUNICIPIO_C,
B.CVE_ENTIDAD_FEDERATIVA_C,
B.E_MAIL_C,
'1'
)LOG ERRORS INTO ELOG_SEGURO_ESCOLAR REJECT LIMIT UNLIMITED;
and it just raises the error "ORA-01722: invalid number" and toad highlights the 'A.' part of the query.
Now about the tables
table A has all the fields in varchar2 (4000)
table b to d have formatting according to the data they hold (date, number, etc)
the thing is, even with the error logging clause it raises the error and doesn't merge anything!
Plus i have no idea what i should be looking for to find the 'invalid number' field
Any advice would be deeply appreciated
Found it!
It was the TO_CHAR(A.FECHA_NACIMIENTO,'DD/MM/YYYY') line. Just left it like this
A.FECHA_NACIMIENTO=B.FECHA_NACIMIENTO and it worked. Thanks anyway!

Resources