Using Externally created Parquet files in Impala - parquet

First off, apologies if this comes across poorly worded, I've tried to help myself but I'm not clear on where its not right.
I'm trying to query data in Impala which has been exported from another system.
Up till now its been exported as a pipe-delimited text file which I've been able to import fine by creating the table with the right delimiter set-up, copying in the file and then running a refresh statement.
We've had some issues where some fields have line-break characters and this has made it look like we've got more data and it doesn't necessarily fit the metadata I've created.
The suggestion was made that we could use Parquet format instead and this would cope with the internal line-breaks fine.
I've received data and it looks a bit like this (I changed the username):
-rw-r--r--+ 1 UserName Domain Users 20M Jan 17 10:15 part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 156K Jan 17 10:15 .part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 14M Jan 17 10:15 part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 110K Jan 17 10:15 .part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 0 Jan 17 10:15 _SUCCESS
-rw-r--r--+ 1 UserName Domain Users 8 Jan 17 10:15 ._SUCCESS.crc
If I create a table stored as parquet through Impala and then do an hdfs dfs -ls on that I get something like the following:
-rwxrwx--x+ 3 hive hive 2103 2019-01-23 10:00 /filepath/testtable/594eb1cd032d99ad-5c13d29e00000000_1799839777_data.0.parq
drwxrwx--x+ - hive hive 0 2019-01-23 10:00 /filepath/testtable/_impala_insert_staging
Which is obviously a bit different to what I've received...
How do I create the table in Impala to be able to accept what I've received and also do I just need the .parquet files in there or do I also need to put the .parquet.crc files in?
Or is what I've received not fit for purpose?
I've tried looking at the Impala documentation for this bit but I don't think that's covering it.
Is it something that I need to do with serde?
I tried specifiying the compression_codec as snappy but this gave the same results.
Any help would be appreciated.

The names of the files do not matter, as long as they are not some special files (like _SUCCESS or .something.crc), they will be read by Impala as Parquet files. You don't need the .crc or _SUCCESS files.
You can use Parquet files from an external source in Impala in two ways:
First create a Parquet table in Impala then put the external files into the directory that correspons to the table.
Create a directory, put the external files into it and then create a so-called external table in Impala. (You can put more data files there later as well.)
After putting external files in tables, you have to issue INVALIDATE METADATA table_name; to make Impala check for new files.
The syntax for creating a regular Parquet table is
CREATE TABLE table_name (col_name data_type, ...)
STORED AS PARQUET;
The syntax for creating an external Parquet table is
CREATE EXTERNAL TABLE table_name (col_name data_type, ...)
STORED AS PARQUET LOCATION '/path/to/directory';
An excerpt from the Overview of Impala Tables section of the docs:
Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:
Internal tables are managed by Impala, and use directories inside the designated Impala work area.
External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
An excerpt from the CREATE TABLE Statement section of the docs:
By default, Impala creates an "internal" table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an "external" table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. For details about internal and external tables, see Overview of Impala Tables.

Related

Confusion with the external tables in hive

I have created the hive external table using below command:
use hive2;
create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by ","
stored as textfile location '/dataDir/';
Now, when I view the HDFS I can see the db but there is no depTable inside the warehouse.
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
[cloudera#quickstart ~]$
Above you can see that there is no table created in this DB. As far as I know, external tables are not stored in the hive warehouse.So am I correct ?? If yes then where is it stored ??
But if I create external table first and then load the data then I am able to see the file inside hive2.db.
hive> create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," stored as textfile;
OK
Time taken: 0.056 seconds
hive> load data inpath '/dataDir/department_data.txt' into table depTable;
Loading data to table default.deptable
Table default.deptable stats: [numFiles=1, totalSize=90]
OK
Time taken: 0.28 seconds
hive> select * from deptable;
OK
1001 FINANCE SYDNEY
2001 AUDIT MELBOURNE
3001 MARKETING PERTH
4001 PRODUCTION BRISBANE
Now, if I fire the hadoop fs query I can see this table under database as below:
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2019-01-17 09:07 /user/hive/warehouse/hive2.db/deptable
If I delete the table still I am able to see table in the HDFS as below:
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2019-01-17 09:11 /user/hive/warehouse/hive2.db/deptable
So, what is the exact behavior of the external tables ?? When I create using LOCATION keyword where does it get stored and when I create using load statement why it is getting stored in the HDFS and after deleting why it doesn't get deleted.
The main difference between EXTERNAL and MANAGED tables is in Drop table/partition behavior.
When you drop MANAGED table/partition, the location with data files also removed.
When you drop EXTERNAL table, the location with data files remains as is.
UPDATE: TBLPROPERTIES ("external.table.purge"="true") in release 4.0.0+ (HIVE-19981) when set on external table would delete the data as well.
EXTERNAL table as well as MANAGED is being stored in the location specified in DDL. You can create table on top of existing location with data files already in the location and it will work for both EXTERNAL or MANAGED, does not matter.
You even can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
If you specified location, the data will be stored in that location for both types of tables. If you did not specify location, the data will be in default location: /user/hive/warehouse/database_name.db/table_name for both managed and external tables.
Update: Also there can be some restrictions on location depending on platform/vendor, see https://stackoverflow.com/a/67073849/2700344, you may not be allowed to create manged/external tables outside their default allowed root location.
See also official Hive docs on Managed vs External Tables

measure the time of load tables with data in hive (its possible?)

I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.

Creating an ORC file and not Hive table?

From what I googled around and found are ways of creating an ORC table using Hive but I want a an ORC file on which I can run my custom map-reduce job.
Also please let me know that the file created by Hive under the warehouse directory for my ORC table is a table file of ORC and not an actutal ORC file I can use? like: /user/hive/warehouse/tbl_orc/000000_0
[Wrap-up of the discussion]
a Hive table is mapped on a HDFS directory (or a list of
directories, if the table is partitioned)
all files in that directory use the same SerDe (ORC, Parquet, AVRO,
Text, etc.) and have the same column set; all together, they contain all the data available for that table
each file in that directory is the result of a previous MapReduce job
-- either a Hive INSERT, a Pig dataset saved via HCatalog, a Spark dataset saved via HiveContext... or any custom job that happens to
drop a file there, hopefully compliant with the table SerDe and
schema (retrieved via MetastoreClient Java API, or via HCatalog API,
whatever)
note that a single job with 3 reducers will probably create 3 new
files (and maybe 1 empty file + 1 small file + 1 big file!); and a
job with 24 mappers and no reducer will create 24 files, unless some
kind of "merge small files" post-processing step is enabled
note also that most file names give absolutely no information about
the way the file is encoded intenally, they are just sequence numbers
(i.e. the 5th job to add 12 files will typically create files 000004_0 to
000004_11)
All in all, processing an ORC fileset with a Java MapReduce program should be very similar to processing a Text fileset. You just have to provide the correct SerDe and the correct field mapping -- I think that the encryption algorithm is explicit in the files so the Serde handles it auto-magically at read time. Just remember that ORC files are not splittable at record level, but at stripe level (a stripe is a bunch of record stored in columnar format w/ tokenization and optional compression).
Of course, that will not give you access to ORC advanced features such a vectorization or stripe pruning (somewhat similar to "smart scan" in Oracle Exadata).

How Hive stores the data (loaded from HDFS)?

I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.
While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?
Say for example:
A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv
Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?
Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).
Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.
There could be 2 possible scenarios.
If the table is external the data is not copied to warehouse directory at all.
If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.
Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).
HadoopGig
To answer you Question :
For External Tables:
Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables
Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
For your reference
Difference between Internal & External tables:
For External Tables
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory /user/hive/warehouse you can change it by updating the location in the config file.
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source:
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
It would not cause data redundancy. For managed (not external) tables Hive moves the data into its warehouse directory. In your example, the data will be moved from original location on HDFS to '/users/big_data/hive/emp_feedback'.
Be careful with the removal of the managed table, it will lead to removal data on HDFS also.
You can send data in two days
A) use LOAD DATA INPATH 'file_location_of_csv' INTO TABLE emp_feedback;
Note that this command will remove content at source directory and create a internal table
OR)
B) Use copyFromLocal or put command to copy local file into HDFS and then create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
e.g.
create external table emp_feedback (
emp_id int,
emp_name string
)
location '/location_in_hdfs_for_csv file';
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Got it. This is what I was able to understand so far.
It all depends upon which type of table is being created and where from the file is picked up. Below are possible use cases
enter image description here

updating Hive external table with HDFS changes

lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ).
myFile.csv is changed every day, then I'm interested to update "myTable" once a day too.
Is there any HiveQL query that tells to update the table every day?
Thank you.
P.S.
I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update Hive partition?
There are two types of tables in Hive basically.
One is Managed table managed by hive warehouse whenever you create a table data will be copied to internal warehouse.
You can not have latest data in the query output.
Other is external table in which hive will not copy its data to internal warehouse.
So whenever you fire query on table then it retrieves data from the file.
SO you can even have the latest data in the query output.
That is one of the goals of external table.
You can even drop the table and the data is not lost.
If you add a LOCATION '/path/to/myFile.csv' clause to your table create statement, you shouldn't have to update anything in Hive. It will always use the latest version of the file in queries.

Resources