Spark avro insertInto file extension - hadoop

I've an external Hive table based on Avro.
| CREATE EXTERNAL TABLE `temp_avro`( |
| `string1` string COMMENT '') |
| PARTITIONED BY ( |
| `string2` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxx/xxx/temp_avro' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1503938718') |
I'm trying to write to this table using Spark as:
SELECT_0_0.toDF().write.mode("append").insertInto("temp_avro")
With this, the avro file gets created in the HDFS location without avro extension (with names part-00001, part-00002, and so on). Is there a way to have the file name with extension .avro

Try using coalesce to reduce parts combining to one before saving your results
SELECT_0_0.toDF().coalesce(1).write.mode("append").insertInto("temp_avro")

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Timestamp is different for the same table in hive-cli & presto-cli

I am getting different timestamps for the same table in hive-cli & presto-cli.
table structure for the table:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `mea_req_201`( |
| `mer_id` bigint, |
| `mer_from_dttm` timestamp, |
| `mer_to_dttm` timestamp, |
| `_c0` bigint, |
| `a_number` string, |
| `b_number` string, |
| `time_stamp` timestamp, |
| `duration` bigint) |
| PARTITIONED BY ( |
| `partition_col` bigint) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop5:8020/warehouse/tablespace/external/hive/mea_req_201' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'spark.sql.create.version'='2.4.0.7.1.4.0-203', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'transient_lastDdlTime'='1625496239') |
+----------------------------------------------------+
While running from hive-cli the output is:
While from presto-cli:
in mer_from_dttm col, there's a time difference but for other timestamps columns, dates are exactly the same. Note this time difference behaviour is the same when done from presto-jdbc also. I believe this got nothing to do with timezone because if it was timezone, the time difference should be across all timestamp columns, not just one. Please provide some resolution.
Some Facts:
Presto server version: 0.180
Presto Jdbc version: 0.180
hive.time-zone=Asia/Calcutta
In Presto jvm.config: -Duser.timezone=Asia/Calcutta
Client TimeZone: Asia/Calcutta
Edit 1:
Sorted the query with mer_id to ensure both queries are outputting the same set of rows, However, the erroneous behavior still remains the same.
While Running from hive-cli:
While Running from presto-cli:
Presto 0.180 is really old. It was released in 2017, and many bugs have been fixed along the way.
I would suggest you try with a recent version. In particular, recent versions of Trino (formerly known as PrestoSQL) had a lot work done around handling of timestamp data.
Use order by to see exact rows in each of clients.
SELECT `mer_id`,`mer_from_dttm`, `mer_to_dttm`, `time_stamp` FROM mea_req_201 ORDER BY `mer_id`;

Unable to query/select data those inserted through Spark SQL

I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `part_test08`( |
| `id` string, |
| `name` string, |
| `baseamount` double, |
| `billtoaccid` string, |
| `extendedamount` double, |
| `netamount` decimal(19,5), |
| `netunitamount` decimal(19,5), |
| `pricingdate` timestamp, |
| `quantity` int, |
| `invoiceid` string, |
| `shiptoaccid` string, |
| `soldtoaccid` string, |
| `ingested_on` timestamp, |
| `external_id` string) |
| PARTITIONED BY ( |
| `productid` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'wasb://blobrootpath/hive/warehouse/db_103.db/part_test08' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1549962363') |
+--------------------------------------------------------------------------------------------------+--+
Trying to execute SQL statement to insert records into the part table like below
sparkSession.sql("INSERT INTO TABLE db_103.part_test08 PARTITION(ProductId) SELECT reflect('java.util.UUID', 'randomUUID'),stg_name,stg_baseamount,stg_billtoaccid,stg_extendedamount,stg_netamount,stg_netunitamount,stg_pricingdate,stg_quantity,stg_invoiceid,stg_shiptoaccid,stg_soldtoaccid,'2019-02-12 09:06:07.566',stg_id,stg_ProductId FROM tmp_table WHERE part_id IS NULL");
Without insert statement, if we run select query then getting below data.
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
|reflect(java.util.UUID, randomUUID)|stg_name|stg_baseamount| stg_billtoaccid|stg_extendedamount|stg_netamount|stg_netunitamount| stg_pricingdate|stg_quantity|stg_invoiceid| stg_shiptoaccid| stg_soldtoaccid|2019-02-12 09:06:07.566|stg_id|stg_ProductId|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
| 4e0b4331-b551-42d...| OLI6| 16.0|2DD4E682-6B4F-E81...| 34.567| 1166.74380| 916.78000|2018-10-18 05:06:22| 13| I1|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 6| P3|
| 8b327a8e-dd3c-445...| OLI7| 16.0|2DD4E682-6B4F-E81...| 34.567| 766.74380| 1016.78000|2018-10-18 05:06:22| 13| I6|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 7| P4|
| c0e14b9a-8d1a-426...| OLI5| 14.6555| null| 34.56| 500.87000| 814.65000|2018-10-11 05:06:22| 45| I4|29B73C4E-846B-E71...|29B73C4E-846B-E71...| 2019-02-12 09:06:...| 5| P1|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
Earlier I was getting error while inserting into Managed table. But after restarting Hive & Thrift services now there is no error in execution of job but not able to see those inserted data while doing select query through beeline/program. I can see partition with delta files also got inserted into hive/warehouse, see below screenshot.
Also, I can see some warnings as below not sure its related to the error or not.
Cannot get ACID state for db_103.part_test08 from null
One more note: If I use External Table then it is working fine can able to view data as well.
We are using Azure HDInsight Spark 2.3 (HDI 4.0 Preview) cluster with below Service Stacks.
HDFS: 3.1.1
Hive: 3.1.0
Spark2: 2.3.1
Have you added below set commands while trying to insert data.
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I have faced similar issue where i was not allowed to do any read / write operation, after adding the above properties I was able to query the table.
Since you have not faced any issue with external table, not sure if this is going to solve your problem.

Hive: Error while fetching data

I tried to connect to Hive using the below query:
select * from some-table where yyyy = 2018 and mm = 01 and dd = 05 runs
The query ran successfully.
After adding one more filter, i.e string data type
The following error is generated:
java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.serde2.io.DateWritable cannot be cast to
org.apache.hadoop.io.Text
The error is generated by Serializer-Deserializers.
Root Cause: When you created the table, you probably didn't define the STORED AS tag. Try to describe your table using desc <table name> and you may see something like this:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Which is not in the good practice. Your SerDes use Lazy Evaluations by default. Create a table using STORED AS ORC and then try to describe your table and the result may be different this time:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Try this and you may be able to resolve the issue.

AvroSerDe: Encountered exception determining schema

I'm moving data from one hive table to another using Spark after applying some transformations. While doing so, I'm getting an exception saying avsc file can't be read (as per my understanding) but I can see the avsc file in HDFS.
Please advise what could be the reason.
I've source hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `exchr`( |
| `exchr_sk` bigint, |
| `rec_ctry_cd` string, |
| `cob_dt` date, |
| `ccy_cd` string, |
| `exchr_val` decimal(10,5)) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/EXCHR' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hive', |
| 'last_modified_time'='1500408192', |
| 'numFiles'='0', |
| 'numRows'='3', |
| 'rawDataSize'='73', |
| 'totalSize'='0', |
| 'transient_lastDdlTime'='1501780655') |
+----------------------------------------------------+--+
I've a target hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE EXTERNAL TABLE `fx_rate`( |
| `stamp` string COMMENT '', |
| `curr_code` string COMMENT '', |
| `fx_rate` double COMMENT '') |
| PARTITIONED BY ( |
| `cb_dt` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/fx_rate' |
| TBLPROPERTIES ( |
| 'avro.schema.url'='hdfs://xxxx/fx_rate.avsc', |
| 'transient_lastDdlTime'='1502313078') |
+----------------------------------------------------+--+
when i'm trying to insert data into this table using spark with below code, I encounter Encountered exception determining schema. Returning signal schema to indicate problem
Spark code:
val TABLE_2_0 = sqlContext.sql("select * from xxxx.exchr");
val SELECT_1_0 = TABLE_2_0.select(TABLE_2_0.col("*"));
val SELECT_0_0 = SELECT_1_0.select(date_format(SELECT_1_0.col("cob_dt"), "yyyymmdd").as("stamp"), lit(null).as("curr_code"), round(lit(10)/SELECT_1_0.col("exchr_val"),10).as("fx_rate"), date_format(SELECT_1_0.col("cb_dt"), "yyyymmdd").as("cb_dt")).limit(1000);
SELECT_0_0.toDF().write.mode("append").insertInto("xxxx.fx_rate")
Exception trace:
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:381)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:154)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:135)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:172)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.newSerializer(InsertIntoHiveTable.scala:59)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.outputClass$lzycompute(InsertIntoHiveTable.scala:53)
17/08/10 23:46:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 3, myhost.net, executor 2): org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:523)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:97)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:88)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:81)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:92)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Resources