Confusion with the external tables in hive - hadoop

I have created the hive external table using below command:
use hive2;
create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by ","
stored as textfile location '/dataDir/';
Now, when I view the HDFS I can see the db but there is no depTable inside the warehouse.
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
[cloudera#quickstart ~]$
Above you can see that there is no table created in this DB. As far as I know, external tables are not stored in the hive warehouse.So am I correct ?? If yes then where is it stored ??
But if I create external table first and then load the data then I am able to see the file inside hive2.db.
hive> create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," stored as textfile;
OK
Time taken: 0.056 seconds
hive> load data inpath '/dataDir/department_data.txt' into table depTable;
Loading data to table default.deptable
Table default.deptable stats: [numFiles=1, totalSize=90]
OK
Time taken: 0.28 seconds
hive> select * from deptable;
OK
1001 FINANCE SYDNEY
2001 AUDIT MELBOURNE
3001 MARKETING PERTH
4001 PRODUCTION BRISBANE
Now, if I fire the hadoop fs query I can see this table under database as below:
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2019-01-17 09:07 /user/hive/warehouse/hive2.db/deptable
If I delete the table still I am able to see table in the HDFS as below:
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2019-01-17 09:11 /user/hive/warehouse/hive2.db/deptable
So, what is the exact behavior of the external tables ?? When I create using LOCATION keyword where does it get stored and when I create using load statement why it is getting stored in the HDFS and after deleting why it doesn't get deleted.

The main difference between EXTERNAL and MANAGED tables is in Drop table/partition behavior.
When you drop MANAGED table/partition, the location with data files also removed.
When you drop EXTERNAL table, the location with data files remains as is.
UPDATE: TBLPROPERTIES ("external.table.purge"="true") in release 4.0.0+ (HIVE-19981) when set on external table would delete the data as well.
EXTERNAL table as well as MANAGED is being stored in the location specified in DDL. You can create table on top of existing location with data files already in the location and it will work for both EXTERNAL or MANAGED, does not matter.
You even can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
If you specified location, the data will be stored in that location for both types of tables. If you did not specify location, the data will be in default location: /user/hive/warehouse/database_name.db/table_name for both managed and external tables.
Update: Also there can be some restrictions on location depending on platform/vendor, see https://stackoverflow.com/a/67073849/2700344, you may not be allowed to create manged/external tables outside their default allowed root location.
See also official Hive docs on Managed vs External Tables

Related

Using Externally created Parquet files in Impala

First off, apologies if this comes across poorly worded, I've tried to help myself but I'm not clear on where its not right.
I'm trying to query data in Impala which has been exported from another system.
Up till now its been exported as a pipe-delimited text file which I've been able to import fine by creating the table with the right delimiter set-up, copying in the file and then running a refresh statement.
We've had some issues where some fields have line-break characters and this has made it look like we've got more data and it doesn't necessarily fit the metadata I've created.
The suggestion was made that we could use Parquet format instead and this would cope with the internal line-breaks fine.
I've received data and it looks a bit like this (I changed the username):
-rw-r--r--+ 1 UserName Domain Users 20M Jan 17 10:15 part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 156K Jan 17 10:15 .part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 14M Jan 17 10:15 part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 110K Jan 17 10:15 .part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 0 Jan 17 10:15 _SUCCESS
-rw-r--r--+ 1 UserName Domain Users 8 Jan 17 10:15 ._SUCCESS.crc
If I create a table stored as parquet through Impala and then do an hdfs dfs -ls on that I get something like the following:
-rwxrwx--x+ 3 hive hive 2103 2019-01-23 10:00 /filepath/testtable/594eb1cd032d99ad-5c13d29e00000000_1799839777_data.0.parq
drwxrwx--x+ - hive hive 0 2019-01-23 10:00 /filepath/testtable/_impala_insert_staging
Which is obviously a bit different to what I've received...
How do I create the table in Impala to be able to accept what I've received and also do I just need the .parquet files in there or do I also need to put the .parquet.crc files in?
Or is what I've received not fit for purpose?
I've tried looking at the Impala documentation for this bit but I don't think that's covering it.
Is it something that I need to do with serde?
I tried specifiying the compression_codec as snappy but this gave the same results.
Any help would be appreciated.
The names of the files do not matter, as long as they are not some special files (like _SUCCESS or .something.crc), they will be read by Impala as Parquet files. You don't need the .crc or _SUCCESS files.
You can use Parquet files from an external source in Impala in two ways:
First create a Parquet table in Impala then put the external files into the directory that correspons to the table.
Create a directory, put the external files into it and then create a so-called external table in Impala. (You can put more data files there later as well.)
After putting external files in tables, you have to issue INVALIDATE METADATA table_name; to make Impala check for new files.
The syntax for creating a regular Parquet table is
CREATE TABLE table_name (col_name data_type, ...)
STORED AS PARQUET;
The syntax for creating an external Parquet table is
CREATE EXTERNAL TABLE table_name (col_name data_type, ...)
STORED AS PARQUET LOCATION '/path/to/directory';
An excerpt from the Overview of Impala Tables section of the docs:
Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:
Internal tables are managed by Impala, and use directories inside the designated Impala work area.
External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
An excerpt from the CREATE TABLE Statement section of the docs:
By default, Impala creates an "internal" table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an "external" table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. For details about internal and external tables, see Overview of Impala Tables.

Hive load data into HDFS

i have data set having 100+ columns for each row. question is how can i load selected columns using hive into hdfs.
for example : col1 ,col2,col3...col50,col51....col99,col100 . I need to load only selected columns col1,col2,col34 and col99.
Approach 1:
1. load all the columns
2. and create view based on selected columns.
Approach 1 - cons- i need to load all the columns unnecessary and it will consume more memory in hdfs also i need to write big query for specifying the column
. Any other best approach.
Hive provides a tabular view on top of HDFS data. If your data is in HDFS, then you can create an external table on it to reference the existing data. You will need to put a schema over the data. This is a one time effort and then you can use all the features of Hive to explore and analyze the dataset. Hive supports views also.
Illustration
Sample data file: data.csv
1,col_1a,col1b
2,col_2a,col2b
3,col_3a,col3b
4,col_4a,col4b
5,col_5a,col5b
6,col_6a,col6b
7,col_7a,col7b
Load and verify data in HDFS
hadoop fs -mkdir /hive-data/mydata
hadoop fs -put data.csv /hive-data/mydata
hadoop fs -cat /hive-data/mydata/*
1,col_1a,col1b
2,col_2a,col2b
3,col_3a,col3b
4,col_4a,col4b
5,col_5a,col5b
6,col_6a,col6b
7,col_7a,col7b
Create a Hive table on top of the HDFS data in default database
CREATE EXTERNAL TABLE default.mydata
(
id int,
data_col1 string,
data_col2 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///hive-data/mydata';
Query the Hive table
select * from default.mydata;
mydata.id mydata.data_col1 mydata.data_col2
1 col_1a col1b
2 col_2a col2b
3 col_3a col3b
4 col_4a col4b
5 col_5a col5b
6 col_6a col6b
7 col_7a col7b

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Where is HIVE metadata stored by default?

I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.
Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.
Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.
After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!

How to Load Data into Hive from HDFS

I am Trying to load data into hive from HDFS . But I Observed that data is moving , meaning after loading the data into hive environment if i look at the HDFS the data which i have loaded is not present . can You Please answer this question with example .
If you would like to create a table in Hive from data in HDFS without moving the data into /user/hive/warehouse/, you should use the optional EXTERNAL and LOCATION keywords. For example, from this page, we have the following example CREATE TABLE statement:
hive> CREATE EXTERNAL TABLE userline(line STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/home/admin/userdata';
Without those, Hive will take your data from HDFS and load it into /user/hive/warehouse (and if the table is dropped, the data is also deleted).

Resources