Unable to query/select data those inserted through Spark SQL

Unable to query/select data those inserted through Spark SQL - hadoop

I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `part_test08`( |
| `id` string, |
| `name` string, |
| `baseamount` double, |
| `billtoaccid` string, |
| `extendedamount` double, |
| `netamount` decimal(19,5), |
| `netunitamount` decimal(19,5), |
| `pricingdate` timestamp, |
| `quantity` int, |
| `invoiceid` string, |
| `shiptoaccid` string, |
| `soldtoaccid` string, |
| `ingested_on` timestamp, |
| `external_id` string) |
| PARTITIONED BY ( |
| `productid` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'wasb://blobrootpath/hive/warehouse/db_103.db/part_test08' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1549962363') |
+--------------------------------------------------------------------------------------------------+--+
Trying to execute SQL statement to insert records into the part table like below
sparkSession.sql("INSERT INTO TABLE db_103.part_test08 PARTITION(ProductId) SELECT reflect('java.util.UUID', 'randomUUID'),stg_name,stg_baseamount,stg_billtoaccid,stg_extendedamount,stg_netamount,stg_netunitamount,stg_pricingdate,stg_quantity,stg_invoiceid,stg_shiptoaccid,stg_soldtoaccid,'2019-02-12 09:06:07.566',stg_id,stg_ProductId FROM tmp_table WHERE part_id IS NULL");
Without insert statement, if we run select query then getting below data.
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
|reflect(java.util.UUID, randomUUID)|stg_name|stg_baseamount| stg_billtoaccid|stg_extendedamount|stg_netamount|stg_netunitamount| stg_pricingdate|stg_quantity|stg_invoiceid| stg_shiptoaccid| stg_soldtoaccid|2019-02-12 09:06:07.566|stg_id|stg_ProductId|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
| 4e0b4331-b551-42d...| OLI6| 16.0|2DD4E682-6B4F-E81...| 34.567| 1166.74380| 916.78000|2018-10-18 05:06:22| 13| I1|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 6| P3|
| 8b327a8e-dd3c-445...| OLI7| 16.0|2DD4E682-6B4F-E81...| 34.567| 766.74380| 1016.78000|2018-10-18 05:06:22| 13| I6|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 7| P4|
| c0e14b9a-8d1a-426...| OLI5| 14.6555| null| 34.56| 500.87000| 814.65000|2018-10-11 05:06:22| 45| I4|29B73C4E-846B-E71...|29B73C4E-846B-E71...| 2019-02-12 09:06:...| 5| P1|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
Earlier I was getting error while inserting into Managed table. But after restarting Hive & Thrift services now there is no error in execution of job but not able to see those inserted data while doing select query through beeline/program. I can see partition with delta files also got inserted into hive/warehouse, see below screenshot.
Also, I can see some warnings as below not sure its related to the error or not.
Cannot get ACID state for db_103.part_test08 from null
One more note: If I use External Table then it is working fine can able to view data as well.
We are using Azure HDInsight Spark 2.3 (HDI 4.0 Preview) cluster with below Service Stacks.
HDFS: 3.1.1
Hive: 3.1.0
Spark2: 2.3.1

Have you added below set commands while trying to insert data.
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I have faced similar issue where i was not allowed to do any read / write operation, after adding the above properties I was able to query the table.
Since you have not faced any issue with external table, not sure if this is going to solve your problem.

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.

You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Timestamp is different for the same table in hive-cli & presto-cli

I am getting different timestamps for the same table in hive-cli & presto-cli.
table structure for the table:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `mea_req_201`( |
| `mer_id` bigint, |
| `mer_from_dttm` timestamp, |
| `mer_to_dttm` timestamp, |
| `_c0` bigint, |
| `a_number` string, |
| `b_number` string, |
| `time_stamp` timestamp, |
| `duration` bigint) |
| PARTITIONED BY ( |
| `partition_col` bigint) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop5:8020/warehouse/tablespace/external/hive/mea_req_201' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'spark.sql.create.version'='2.4.0.7.1.4.0-203', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'transient_lastDdlTime'='1625496239') |
+----------------------------------------------------+
While running from hive-cli the output is:
While from presto-cli:
in mer_from_dttm col, there's a time difference but for other timestamps columns, dates are exactly the same. Note this time difference behaviour is the same when done from presto-jdbc also. I believe this got nothing to do with timezone because if it was timezone, the time difference should be across all timestamp columns, not just one. Please provide some resolution.
Some Facts:
Presto server version: 0.180
Presto Jdbc version: 0.180
hive.time-zone=Asia/Calcutta
In Presto jvm.config: -Duser.timezone=Asia/Calcutta
Client TimeZone: Asia/Calcutta
Edit 1:
Sorted the query with mer_id to ensure both queries are outputting the same set of rows, However, the erroneous behavior still remains the same.
While Running from hive-cli:
While Running from presto-cli:

Presto 0.180 is really old. It was released in 2017, and many bugs have been fixed along the way.
I would suggest you try with a recent version. In particular, recent versions of Trino (formerly known as PrestoSQL) had a lot work done around handling of timestamp data.

Use order by to see exact rows in each of clients.
SELECT `mer_id`,`mer_from_dttm`, `mer_to_dttm`, `time_stamp` FROM mea_req_201 ORDER BY `mer_id`;

Spark avro insertInto file extension

I've an external Hive table based on Avro.
| CREATE EXTERNAL TABLE `temp_avro`( |
| `string1` string COMMENT '') |
| PARTITIONED BY ( |
| `string2` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxx/xxx/temp_avro' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1503938718') |
I'm trying to write to this table using Spark as:
SELECT_0_0.toDF().write.mode("append").insertInto("temp_avro")
With this, the avro file gets created in the HDFS location without avro extension (with names part-00001, part-00002, and so on). Is there a way to have the file name with extension .avro

Try using coalesce to reduce parts combining to one before saving your results
SELECT_0_0.toDF().coalesce(1).write.mode("append").insertInto("temp_avro")

AvroSerDe: Encountered exception determining schema

I'm moving data from one hive table to another using Spark after applying some transformations. While doing so, I'm getting an exception saying avsc file can't be read (as per my understanding) but I can see the avsc file in HDFS.
Please advise what could be the reason.
I've source hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `exchr`( |
| `exchr_sk` bigint, |
| `rec_ctry_cd` string, |
| `cob_dt` date, |
| `ccy_cd` string, |
| `exchr_val` decimal(10,5)) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/EXCHR' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hive', |
| 'last_modified_time'='1500408192', |
| 'numFiles'='0', |
| 'numRows'='3', |
| 'rawDataSize'='73', |
| 'totalSize'='0', |
| 'transient_lastDdlTime'='1501780655') |
+----------------------------------------------------+--+
I've a target hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE EXTERNAL TABLE `fx_rate`( |
| `stamp` string COMMENT '', |
| `curr_code` string COMMENT '', |
| `fx_rate` double COMMENT '') |
| PARTITIONED BY ( |
| `cb_dt` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/fx_rate' |
| TBLPROPERTIES ( |
| 'avro.schema.url'='hdfs://xxxx/fx_rate.avsc', |
| 'transient_lastDdlTime'='1502313078') |
+----------------------------------------------------+--+
when i'm trying to insert data into this table using spark with below code, I encounter Encountered exception determining schema. Returning signal schema to indicate problem
Spark code:
val TABLE_2_0 = sqlContext.sql("select * from xxxx.exchr");
val SELECT_1_0 = TABLE_2_0.select(TABLE_2_0.col("*"));
val SELECT_0_0 = SELECT_1_0.select(date_format(SELECT_1_0.col("cob_dt"), "yyyymmdd").as("stamp"), lit(null).as("curr_code"), round(lit(10)/SELECT_1_0.col("exchr_val"),10).as("fx_rate"), date_format(SELECT_1_0.col("cb_dt"), "yyyymmdd").as("cb_dt")).limit(1000);
SELECT_0_0.toDF().write.mode("append").insertInto("xxxx.fx_rate")
Exception trace:
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:381)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:154)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:135)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:172)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.newSerializer(InsertIntoHiveTable.scala:59)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.outputClass$lzycompute(InsertIntoHiveTable.scala:53)
17/08/10 23:46:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 3, myhost.net, executor 2): org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:523)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:97)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:88)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:81)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:92)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unable to query/select data those inserted through Spark SQL - hadoop

Related

How to drop hive partitions with hivevar passed as partition variable?

Timestamp is different for the same table in hive-cli & presto-cli

Spark avro insertInto file extension

AvroSerDe: Encountered exception determining schema

Automatically generating documentation about the structure of the database

Categories

Resources