Not able to query(from Hive) Parquet file created in Pig - hadoop

I have create a Parquet file in Pig(in the directory outputset)
grunt> STORE extracted INTO './outputset' USING ParquetStorer;
The file has 1 Record as shown below,
grunt> mydata = LOAD './outputset/part-r-00000.parquet' using ParquetLoader;
grunt> dump mydata;
(val1,val2,val3)
grunt> describe mydata;
mydata: {val_0: chararray,val_1: chararray,val_2: chararray}
After this, I have created an external table in Hive to read this file,
CREATE EXTERNAL TABLE parquet_test (
field1 string,
field2 string,
field3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';
When I query the table I am able to retrieve the 1 Record, but all the fields are NULL as show below,
hive> select * from parquet_test;
NULL NULL NULL
What am I missing here?
PS :
Pig version : 0.15.0
Hive version : 1.2.1

You need to match exact field name in pig with column in hive.
So your hive should look like
CREATE EXTERNAL TABLE parquet_test (
val1 string,
val2 string,
val3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';

Related

Creating external hive table from parquet file which contains json string

I have a parquet file which is stored in a partitioned directory. The format of the partition is
/dates=*/hour=*/something.parquet.
The content of parquet file looks like as follows:
{a:1,b:2,c:3}.
This is json data and i want to create external hive table.
My approach:
CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
After that i run MSCK REPAIR TABLE test_table; but i get following output:
hive> select * from test_table;
OK
NULL NULL NULL 2021-09-27 09
The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.
Create table with the same schema as parquet file:
CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
Run repair table to mount partitions:
MSCK REPAIR TABLE test_table;
Parse value in query:
select e.a, e.b, e.c
from test_table t
lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c
Cast values as int if necessary: cast(e.a as int) as a
You can also create a table for json fields as columns using this:
CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
a INT,
b INT,
c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';
Then run MSCK REPAIR TABLE test_table;
You would be able to query directly without writing any parsers.

How to create external tables from parquet files in s3 using hive 1.2?

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC
Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

Indexing data from HDFS to Elasticsearch via Hive

I'm using Elasticsearch for Hadoop plugin in order to read and index documents in Elasticsearch via Hive.
I followed the documentation in this page:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
In order to index documents in Elasticsearch with Hadoop, you need to create a table in Hive that is configured properly.
And I encountered a problem with inserting data into that hive table.
That’s the table's script for writing I used to create:
CREATE EXTERNAL TABLE es_names_w
(
firstname string,
lastname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
Then I tried to insert data:
INSERT OVERWRITE TABLE es_names_w
SELECT firstname,lastname
FROM tmp_names_source;
The error I get from hive is:
"Job submission failed with exception 'org.apache.hadoom.ipc.RemoteExaption(java.lang.RuntimeExeption: org.xml.sax.SAXParseException; systemId: file:////hdfs_data/mapred/jt/jobTracker/job_201506091622_0064.xml; lineNunber: 607; columnNumber:51; Character reference "&#..."
However, this error occurs ONLY when the hive table that I create has more than one column.
For example, this code works:
CREATE EXTERNAL TABLE es_names_w
(
firstname string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'hive_test/names', 'es.index.auto.create' = 'true')
INSERT OVERWRITE TABLE es_names_w
SELECT firstname
FROM tmp_names_source;
Everything went well,
Hive has created a new type in elasticsearch index and the data has been indexed in Elasticsearch
I really don’t know why my first attempt doesn't work
I would appreciate some help,
Thanks
Can you add this to the tbl es.mapping.id’=’key’. The key can be your firstname.
Try
es.index.auto.create' = 'false'
Try with SerDe it will work out. For eg.
CREATE EXTERNAL TABLE elasticsearch_es (
name STRING, id INT, country STRING )
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES ('es.resource'='elasticsearch/demo');
Also, make sure when you create index and type in ES you create the exact same mapping as Hive column's in ES.

Error while creating a external table using JSON jars in hive 13.0

I am doing some testing on hive 13.0. I am trying to create a external table and using json jars to read the json formatted data file. But getting errors. Below is my create table statment
'$response = Invoke-Hive -Query #"
add jar wasb://path/json-serde-1.1.9.2.jar;
add jar wasb://path/json-serde-1.1.9.2-jar-with-dependencies.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (col1 string, col2 string...coln int)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ()
STORED AS TEXTFILE;
"#'
below is the error i am getting
'FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.org.apache.hadoop.hive.serde2.object inspector.primitive.AbstractPrimitiveJavaObjectInspector.<init>(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive
/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V'
Any suggestions?
There are some changes needed to that SerDe for Hive .13 - you can see a list here: https://github.com/rcongiu/Hive-JSON-Serde/pull/64

Resources