I'm trying to write a Dataframe back into Oracle DB (I'm using Oracle Express 12c for development).
I'm writing the dataframe like this
val options = Map(
"url" -> "jdbc:oracle:thin:username/password#host:1521:database",
"driver" -> "oracle.jdbc.driver.OracleDriver",
"dbtable" -> "myNewTab"
)
dataframe.write.format("jdbc").options(options).mode("overwrite").save()
My dataframe is string and integer columns with following schema
0 = {StructField#25373} "StructField(CustomerID,IntegerType,true)"
1 = {StructField#25374} "StructField(Gender,StringType,true)"
2 = {StructField#25375} "StructField(Age,IntegerType,true)"
3 = {StructField#25376} "StructField(Annual_Income__k__,IntegerType,true)"
4 = {StructField#25377} "StructField(Spending_Score__1_100_,IntegerType,true)"
After successfully running the spark job, when I inspect the written table on oracle I see that integer columns are all NaN!!!
So I tried to inspect the JDBC schema that could be generated for the table with JdbcUtils.schemaString(dataframe, url, None) and I see that integer columns are mapped to Oracle NUMBER(10) type
"CustomerID" NUMBER(10) , "Gender" VARCHAR2(255) , "Age" NUMBER(10) , "Annual_Income__k__" NUMBER(10) , "Spending_Score__1_100_" NUMBER(10)
Is there anything I'm missing here when using the Oracle JDBC driver with Spark?
Related
We created the Hive external table using ElasticSearch StorageHandler as shown below:
CREATE EXTERNAL TABLE DEFAULT.ES_TEST (
REG_DATE STRING
, STR1 STRING
, STR2 STRING
, STR3 STRING
, STR4 STRING
, STR5 STRING
)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.resource' = 'log-22-20210120'
, 'es.nodes' = '1.2.3.4'
, 'es.port' = '9201'
, 'es.mapping.date.rich' = 'false'
);
And then we tried to load the ES data into Hive managed table as like:
insert overwrite table elastic.es_log_tab partition(part_log_date)
select *
, current_timestamp()
, from_unixtime(unix_timestamp(reg_date), 'yyyyMMdd')
from DEFAULT.ES_TEST;
When the ES data for given date is volumed to about 65GB, it was approximated taken 10 hours or 1.1M rows per minute (670M rows in total).
In order to get a better loading performance for this case, are there any further recommendation or checkpoints? How about increasing the number of mappers? Currently, it is running with 16 mappers. Expecting to get it faster with more mappers?
Please share your thoughts and previous experience with me.
I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC
Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.
How to convert a non-partition table (no primary key) to partitioned table? Someone says I can use rowid, but I can not find any sample from Oracle doc.
My oracle is 12C release 1, it did not contain the new feature Using the MODIFY clause of ALTER TABLE to convert online to a partitioned table.
Please provide a sample if you can.
"Someone says can use rowed, but I can not find any sample from oracle doc"
I think the option you are looking for is the DBMS_REDEFINITION.START_REDEF_TABLE parameter options_flag.
Like this
start_redef_table (
uname => 'your_schema'
, orig_table => 'your_current_table'
, int_table => 'your_interim_table'
, options_flag => dbms_redefinition.cons_use_rowid
);
Find out more
Below is my program to connect to oracle using spark scala JDBC code:
/usr/hdp/current/spark2-client/bin/spark-shell --driver-memory 5g --executor-memory 5g --jars /usr/hdp/current/sqoop-client/lib/ojdbc7.jar
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:oracle:thin:#//Host:1521/QAM").option("driver", "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select * from MGPH.APPLICATION where APPLICATION_ID in (11,12))").option("user", "XXXXXXXXXX").option("password", "xxxxxxxxx").option("fetchsize", "100").load()
dataframe_mysql.show()
Spark Output :
scala> dataframe_mysql.show()
+--------------+-------------------+--------------------+--------------------+-----------+----------+------------------+
|c1|c2| c3| c4|c5|c6|c7|
+--------------+-------------------+--------------------+--------------------+-----------+----------+------------------+
| 11| 1|Safire.Accumulato...|Safire Accumulato...| true| 3346| false|
+--------------+-------------------+--------------------+--------------------+-----------+----------+------------------+
Oracle Table structure :
Name Null? Type
------------------- -------- -------------
c1 NOT NULL NUMBER(3)
c2 NOT NULL NUMBER(2)
c3 NOT NULL VARCHAR2(50)
c4 NOT NULL VARCHAR2(500)
c5 NOT NULL NUMBER(1)
c5 NUMBER(10)
c7 NUMBER(1)
Question :
Column c7 in oracle has NUMBER(1), but spark JDBC converting that into boolean type when import.
Please suggest , how to avoid true/false? and make output as 0/1 in dataframe.
Please refer the article : http://www.ericlin.me/2017/05/oracle-number10-field-maps-to-boolean-in-spark/ and https://github.com/apache/spark/pull/14377 . Spark's BooleanType is mapped to oracle's NUMBER(1) (Since boolean type is not available in oracle). So you will have to handle this after reading your table into Spark[By casting or using the boolean value in your spark transformation].
I solved this issue using to_char method in the select class of JDBC for any Boolean column .
Below is the code used ( to_char(HEARTBEATS_ENABLED) ). I tried to_number as well, but it produce result like 1.0000, so i used to_char to achieve desire result
val result=sqlcontext.read.format("jdbc").option("url", "jdbc:oracle:thin:#//Host:1521/QAM").option("driver", "oracle.jdbc.driver.OracleDriver")
.option("dbtable", "(select to_char(HEARTBEATS_ENABLED) as HEARTBEATS_ENABLED[enter link description here][1],APPLICATION_ID,APPLICATION_TYPE_ID,NAME,DESCR,***to_char(ACTIVE_STAT) as ACTIVE_STAT*** ,PROGRAM_ID from MGPH.APPLICATION where APPLICATION_ID in (11,12))").option("user", "myuser").option("password", "my password").option("fetchsize", "100").load()
result.show()
result.printSchema
I'm trying to retrieve FK of a given table with JDBC metadata.
For that, I'm using the "getImportedKeys" function.
For my table 'cash_mgt_strategy', it give in resultset:
PKTABLE_CAT : 'HAWK'
PKTABLE_SCHEM : 'dbo'
PKTABLE_NAME : 'fx_execution_strategy_policy'
PKCOLUMN_NAME : 'fx_execution_strategy_policy_id'
FKTABLE_CAT : 'HAWK'
FKTABLE_SCHEM : 'dbo'
FKTABLE_NAME : 'cash_mgt_strategy'
FKCOLUMN_NAME : 'fx_est_execution_strategy_policy'
KEY_SEQ : '1'
UPDATE_RULE : '1'
DELETE_RULE : '1'
FK_NAME : 'fk_fx_est_execution_strategy_policy'
PK_NAME : 'cash_mgt_s_1283127861'
DEFERRABILITY : '7'
The problem is that the "FKCOLUMN_NAME : 'fx_est_execution_strategy_policy'" is not a real column of my table, but it seems to be truncated? (missing "_id" at the end)
When using an official Sybase sql client (Sybase Workspace), displaying the DDL of the table give for this constraint / foreign key:
ALTER TABLE dbo.cash_mgt_strategy ADD CONSTRAINT fk_fx_est_execution_strategy_policy FOREIGN KEY (fx_est_execution_strategy_policy_id)
REFERENCES HAWK.dbo.fx_execution_strategy_policy (fx_execution_strategy_policy_id)
So I'm wondering how to retrieve the full FKCOLUMN_NAME ?
Note that I'm using jconnect 6.0.
I've tested with jconnect 7.0, same problem.
Thanks
You haven't provided your ASE version so I'm going to assume the following:
dataserver was running ASE 12.x at some point (descriptor names limited to 30 characters)
dataserver was upgraded to ASE 15.x/16.x (descriptor names extended to 255 characters)
DBA failed to upgrade/update the sp_jdbc* procs after the upgrade to ASE 15.x/16.x (hence the old ASE 12.x version of the procs - descriptors limited to 30 characters - are still in use in the dataserver)
If the above is true then sp_version should show the older versions of the jdbc procs running in the dataserver.
The (obvious) solution would be to have the DBA load the latest version of the jdbc stored procs (typically found under ${SYBASE}/jConnect*/sp).
NOTE: Probably wouldn't hurt to have the DBA review the output from sp_version to see if there are any other upgrade scripts that need to be loaded (eg, installmodel, installsecurity, installcommit, etc).
Ok, so I've done some search on my DB server and I've found the code of stored proc sp_jdbc_importkey. In this code can see:
create table #jfkey_res(
PKTABLE_CAT varchar(32) null,
PKTABLE_SCHEM varchar(32) null,
PKTABLE_NAME varchar(257) null,
PKCOLUMN_NAME varchar(257) null,
FKTABLE_CAT varchar(32) null,
FKTABLE_SCHEM varchar(32) null,
FKTABLE_NAME varchar(257) null,
FKCOLUMN_NAME varchar(257) null,
KEY_SEQ smallint,
UPDATE_RULE smallint,
DELETE_RULE smallint,
FK_NAME varchar(257),
PK_NAME varchar(257) null)
create table #jpkeys(seq int, keys varchar(32) null)
create table #jfkeys(seq int, keys varchar(32) null)
The temporary tables #jpkeys and #jfkeys used to store the column names (for PK and FK) are typed with varchar(32) instead of 257!!
Need to search how to patch / update theses stored proc now.