How to use Parquet files created using Apache Drill inside Hive - hadoop

Apache Drill has a nice feature of making parquet files out of many incoming datasets, but it seems like there is not a lot of information on how to use those parquet files later on - specifically in Hive.
Is there a way for Hive to make use of those "1_0_0.parquet", etc files? Maybe create a table and load the data from parquet files or create a table and somehow place those parquet files inside hdfs so that Hive reads it?

I have faced this problem, if you are using a Cloudera distribution, you can create the tables using impala (Impala and Hive share the metastore), it allows create tables from a parquet file. Unfortunately Hive doesn't allow this
CREATE EXTERNAL TABLE table_from_fileLIKE PARQUET '/user/etl/destination/datafile1.parquet'
STORED AS PARQUET
LOCATION '/user/test/destination';

Related

Main purpose of the MetaStore in Hive?

I am a little confused on the purpose of the MetaStore. When you create a table in hive:
CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;
So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.
But what is the purpose of storing this MetaData?
When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?
Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.
See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344
Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure
takes the contents of the file in HDFS and creates a MetaData form of it
As far as I know, no files are actually read when you create a table.
SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database

Impala 2.7 fails to read any data from a parquet table created from Hive with Tez

I'm populating a partitioned Hive table in parquet storage format using a query that is using a number of union all operators. Query is executed using Tez, which with default settings results in multiple concurrent Tez writers creating HDFS structure, where parquet files are sitting in subfolders (with Tez writer ID for the folder name) under partition folders.
E.g. /apps/hive/warehouse/scratch.db/test_table/part=p1/8/000000_0
Even after invalidate metadata and collect stats on the table, Impala returns zero rows when the table is queried.
The issue seems to be with Impala not traversing into partition subfolder to look for parquet files.
If I set hive.merge.tezfiles to true (it's false by default), effectively forcing Tez to use an extra processing step to coalesce multiple files into one, resulting parquet files are written directly under partition folder, and after refresh Impala can see the data in the new or updated partitions.
I wonder if there is an config option for Impala to instruct it to look in partition subfolders or perhaps there is a patch for Impala that changes its behavior in that regards.
As of now recursive reading of files from sub directories under the TABLE LOCATION is not supported in Impala.
Example: If a table is created with location '/home/data/input/'
and if the directory structure is as follows:
/home/data/input/a.txt
/home/data/input/b.txt
/home/data/input/subdir1/x.txt
/home/data/input/subdir2/y.txt
then Impala can query from following files only
/home/data/input/a.txt
/home/data/input/b.txt
Following files are not queried
/home/data/input/subdir1/x.txt
/home/data/input/subdir2/y.txt
As a alternative solution, you can read the data from Hive and insert into a Final Hive Table.
Create an Impala view on top of this table for Interactive or Reporting queries.
You can set this feature in Hive using below configuration settings.
Hive supports subdirectory scan with options
SET mapred.input.dir.recursive=true;
and
SET hive.mapred.supports.subdirectories=true;

How to encrypt data in hdfs , and then create hive or impala table to query it?

Recently, I came accross a situation:
There is a data file in remote hdfs, we need to encrpt the data file and then create impala table to query the data in local hdfs system, how impala query encryped data file, I don't know how to solve it.
It can be done by creating User Defined Functions(UDF) in hive. You can create UDF functions by using UDF Hive interface. Then, make jar out of your UDF class, put hive lib.

external tables in Hive

I added a CSV file in HDFS using R script.
I update this CSV with new CSV/append data to it
Created table using hue in Hive over this CSV.
Altered it to be an external table.
Now, if when data is changed in the hdfs location, would data be automatically updated in hive table?
That's the thing with external (and also managed) tables in Hive. They're not really tables. You can think of them as link to HDFS location. So whenever you query external table, Hive reads all the data from location you selected when you created this table.
From Hive doc:
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.

create a Parquet backed Hive table by using a schema file

Cloudera documentation, shows a simple way to "create a Avro backed Hive table by using an Avro schema file." This works great. I would like to do the same thing for a Parquet backed Hive table, but the relevant documentation in this case lists out every column type rather than reading from a schema. Is it possible to read the Parquet columns from a schema, in the same way as Avro data?
Currently, the answer appears to be no. There is an open issue with Hive.
https://issues.apache.org/jira/browse/PARQUET-76
The issue has been active recently, so hopefully in the near future Hive will offer the same functionality for Parquet as it does for Avro.

Resources