Difference between external table and internal table when using Apache HAWQ? - hawq

I am using HAWQ to handle a column-based file. While reading the Pivotal document, they suggest that user should use gpfdist to read and write the readable external table in order to quickly process the data in a parallel way.
I made a table as recommended in the documentation and confirmed my data by SQL as below statement.
CREATE EXTERNAL TABLE ext_data
(col1 text, col2 text,col3 text, col4 text, col5 int, col6 int, col7 int,col8 int)
LOCATION ('gpfdist://hawq2:8085/*.csv')
FORMAT 'CSV'(DELIMITER ',');
SELECT gp_segment_id,count(*) from ext_data GROUP BY gp_segment_id;
The data was evenly distributed on all the slave nodes.
Previous my goal was creating the table, reading the data from the file and identifying the loaded data was distributed well. It was achieved by above procedure using gpfdist.
But the question is the difference between the external table and internal table. What is the reason of being using external or internal table even though two methods were same functionality.
I found some blogs that some users follow below procedures when using HAWQ or Greenplume database.
1. making external table using gpfdist
2. making internal table again
3. reading the data from external data into internal data.
I didn't fully get the idea of this behavior. Above all, I don't know why external and internal table exist and should be used for handling data using Apache Hawq or greenplume database.

An External Table that uses gpfdist
Data is in a posix filesystem, not HDFS
No statistics
Files could be on ETL nodes which aren't part of the cluster
You could have multiple files across many servers too
Ideal solution to load data in parallel to an Internal table
insert into table_name select * from external_table_name;
Internal Table
Data is stored in HDFS
Statistics are gathered and stored in the catalog
Files are HDFS files only
You can take advantage of HDFS features like parquet format and snappy compression
Provide the best performance for queries
External Tables just make it easier to load data into the database and make it do so faster.
Think about this scenario. You get a file from your Accounting system that you need to load. You could do this:
scp the file to an edge node
hdfs put the file into hdfs
Create an external table in HAWQ using PXF to read the file
Insert the data into HAWQ table
That will work and PXF will read the file in HDFS in parallel. However, step 2 is a single process and a bottleneck. Instead, do this:
scp the file to an edge node
Start a gpfdist process
Create an external table in HAWQ using gpfdist to read the file
Insert the data into HAWQ table
Now the "put" into HDFS is done in parallel because HAWQ will start virtual segments on each node to put the data. This are typically 6 virtual segments per data node so in a 10 node cluster, you'll have 60 processes putting data into HDFS rather than a single one.

Related

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How Hive stores the data (loaded from HDFS)?

I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.
While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?
Say for example:
A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv
Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?
Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).
Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.
There could be 2 possible scenarios.
If the table is external the data is not copied to warehouse directory at all.
If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.
Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).
HadoopGig
To answer you Question :
For External Tables:
Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables
Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
For your reference
Difference between Internal & External tables:
For External Tables
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory /user/hive/warehouse you can change it by updating the location in the config file.
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source:
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
It would not cause data redundancy. For managed (not external) tables Hive moves the data into its warehouse directory. In your example, the data will be moved from original location on HDFS to '/users/big_data/hive/emp_feedback'.
Be careful with the removal of the managed table, it will lead to removal data on HDFS also.
You can send data in two days
A) use LOAD DATA INPATH 'file_location_of_csv' INTO TABLE emp_feedback;
Note that this command will remove content at source directory and create a internal table
OR)
B) Use copyFromLocal or put command to copy local file into HDFS and then create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
e.g.
create external table emp_feedback (
emp_id int,
emp_name string
)
location '/location_in_hdfs_for_csv file';
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Got it. This is what I was able to understand so far.
It all depends upon which type of table is being created and where from the file is picked up. Below are possible use cases
enter image description here

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Difference between Hive internal tables and external tables?

Can anyone tell me the difference between Hive's external table and internal tables.
I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables.
Can anyone explain me in terms of nodes please.
Hive has a relational database on the master node it uses to keep track of state.
For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database.
If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata'.
When you drop an internal table, it drops the data, and it also drops the metadata.
When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.
Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
To answer you Question :
For External Tables, Hive stores the data in the LOCATION specified during creation of the table(generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.
For your reference,
Difference between Internal & External tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS
file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir,
by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends
on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source :
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
An internal table data is stored in the warehouse folder, whereas an external table data is stored at the location you mentioned in table creation.
So when you delete an internal table, it deletes the schema as well as the data under the warehouse folder, but for an external table it's only the schema that you will loose.
So when you want an external table back you again after deleting it, can create a table with the same schema again and point it to the original data location. Hope it is clear now.
The only difference in behaviour (not the intended usage) based on my limited research and testing so far (using Hive 1.1.0 -cdh5.12.0) seems to be that when a table is dropped
the data of the Internal (Managed) tables gets deleted from the HDFS file system
while the data of the External tables does NOT get deleted from the HDFS file system.
(NOTE: See Section 'Managed and External Tables' in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL which list some other difference which I did not completely understand)
I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom
Location defined during the Table Creation
Location defined in the Database/Schema Creation in which the table is created.
Default Hive Warehouse Directory (Property hive.metastore.warehouse.dir in hive.site.xml)
When the "Location" option is not used during the "creation of a hive table", the above precedence rule is used. This is applicable for both Internal and External tables. This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else.
Note: I might have missed some scenarios, but based on my limited exploration, the behaviour of both Internal and Extenal table seems to be the same except for the one difference (data deletion) described above. I tried the following scenarios for both Internal and External tables.
Creating table with and without Location option
Creating table with and without Partition Option
Adding new data using the Hive Load and Insert Statements
Adding data files to the Table location outside of Hive (using HDFS commands) and refreshing the table using the "MSCK REPAIR TABLE command
Dropping the tables
In external tables, if you drop it, it deletes only schema of the table, table data exists in physical location. So to deleted the data use hadoop fs - rmr tablename .
Managed table hive will have full control on tables. In external tables users will have control on it.
INTERNAL : Table is created First and Data is loaded later
EXTERNAL : Data is present and Table is created on top of it.
Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion, whereas external tables are useful when the files are being used outside of Hive.
External hive table has advantages that it does not remove files when we drop tables,we can set row formats with different settings , like serde....delimited
Also Keep in mind that Hive is a big data warehouse. When you want to drop a table you dont want to lose Gigabytes or Terabytes of data. Generating, moving and copying data at that scale can be time consuming.
When you drop a 'Managed' table hive will also trash its data.
When you drop a 'External' table only the schema definition from hive meta-store is removed. The data on the hdfs still remains.
Consider this scenario which best suits for External Table:
A MapReduce (MR) job filters a huge log file to spit out n sub log files (e.g. each sub log file contains a specific message type log) and the output i.e n sub log files are stored in hdfs.
These log files are to be loaded into Hive tables for performing further analytic, in this scenario I would recommend an External Table(s), because the actual log files are generated and owned by an external process i.e. a MR job besides you can avoid an additional step of loading each generated log file into respective Hive table as well.
The best use case for an external table in the hive is when you want to create the table from a file either CSV or text
Both Internal and External tables are owned by HIVE. The only difference being the ownership of data. The commands for creating both tables are shown below. Only an additional EXTERNAL keyword comes in case of external table creation. Both tables can be created/deleted/modified using SQL Statements.
In case of Internal Tables, both the table and the data contained in the tables are managed by HIVE. That is, we can add/delete/modify any data using HIVE. When we DROP the table, along with the table, the data will also get deleted.
Eg: CREATE TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
In case of External Tables, only the table is managed by HIVE. The data present in these tables can be from any storage locations like HDFS. We cant add/delete/modify the data in these tables. We can only use the data in these tables using SELECT statements. When we DROP the table, only the table gets deleted and not the data contained in it. This is why its said that only meta-data gets deleted. When we create EXTERNAL tables, we need to mention the location of the data.
Eg: CREATE EXTERNAL TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hive/warehouse/tweets';
hive stores only the meta data in metastore and original data in out side of hive when we use external table we can give location' ' by these our original data wont effect when we drop the table
When there is data already in HDFS, an external Hive table can be created to describe the data. It is called EXTERNAL because the data in the external table is specified in the LOCATION properties instead of the default warehouse directory.
When keeping data in the internal tables, Hive fully manages the life cycle of the table and data. This means the data is removed once the internal table is dropped. If the external table is dropped, the table metadata is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting data along with tables by mistake.
For managed tables, Hive controls the lifecycle of their data. Hive stores the data for managed tables in a sub-directory under the directory defined by hive.metastore.warehouse.dir by default.
When we drop a managed table, Hive deletes the data in the table.But managed tables are less convenient for sharing with other tools. For example, lets say we have data that is created and used primarily by Pig , but we want to run some queries against it, but not give Hive ownership of the data.
At that time, external table is defined that points to that data, but doesn’t take ownership of it.
In Hive We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory.
Dropping External tables will delete metadata but not the data.
I would like to add that
Internal tables are used when the data needs to be updated or some rows need to be deleted because ACID properties can be supported on the Internal tables but ACID properties cannot be supported on the external tables.
Please ensure that there is a backup of the data in the Internal table because if a internal table is dropped then the data will also be lost.
In simple words, there are two things:
Hive can manage things in warehouse i.e. it will not delete data out of warehouse.
When we delete table:
1) For internal tables the data is managed internally in warehouse. So will be deleted.
2) For external tables the data is managed eternal from warehouse. So can't be deleted and clients other then hive can also use it.

Resources