WHY does this simple Hive table declaration work? As if by magic - hadoop

The following HQL works to create a Hive table in HDInsight which I can successfully query. But, I have several questions about WHY it works:
My data rows are, in fact, terminated by carriage return line feed, so why does 'COLLECTION ITEMS TERMINATED BY \002' work? And what is \002 anyway? And no location for the blob is specified so, again, why does this work?
All attempts at creating the same table and specifying "CREATE EXTERNAL TABLE...LOCATION '/user/hive/warehouse/salesorderdetail'" have failed. The table is created but no data is returned. Leave off "external" and don't specify any location and suddenly it works. Wtf?
CREATE TABLE IF NOT EXISTS default.salesorderdetail(
SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
Any insights are greatly appreciated.
UPDATE:Thanks for the help so far. Here's the exact syntax I'm using to attempt external table creation. (I've only changed the storage account name.) I don't see what I'm doing wrong.
drop table default.salesorderdetailx;
CREATE EXTERNAL TABLE default.salesorderdetailx(SalesOrderID int,
ProductID int,
OrderQty int,
LineTotal decimal)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
LOCATION 'wasb://mycn-1#my.blob.core.windows.net/mycn-1/hive/warehouse/salesorderdetailx'

When you create your cluster in HDInsight, you have to specify underlying blob storage. It assumes that you are referencing that blob storage. You don't need to specific a location because your query is creating an internal table (see answer #2 below) which is created at a default location. External tables need to specify a location in Azure blob storage (outside of the cluster) so that the data in the table is not deleted when the cluster is dropped. See the Hive DDL for more information.
By default, tables are created as internal, and you have to specify the "external" to make them external tables.
Use EXTERNAL tables when:
Data is used outside Hive
You need data to be updateable in real time
Data is needed when you drop the cluster or the table
Hive should not own data and control settings, directories, etc.
Use INTERNAL tables when:
You want Hive to manage the data and storage
Short term usage (like a temp table)
Creating table based on existing table (AS SELECT)
Does the container "user/hive/warehouse/salesorderdetail" exist in your blob storage? That might explain why it is failing for your external table query.

Related

measure the time of load tables with data in hive (its possible?)

I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.

External hive table as parquet file returns NULL when queried

I created a .parquet file by using map reduce job. Now I want to create an external table on top of this file. Here is the command:
CREATE EXTERNAL TABLE testparquet (
NAME STRING,
AGE INT
)
STORED AS PARQUET
LOCATION 'file location'
The table is created successfully but when I query the table using simple SELECT * , I get data as NULL for all fields. The version of hive is 0.13.
Is there anything that I am missing?
When using external files, you need to explicitly synchronize the metadata store that knows about the schema of your data, with the actual data itself.
Typically, you'll use the INVALIDATE METADATA command to force following queries to re-read the data. You can also use REFRESH <table-name> if you have just one table that has been updated.

How to add partition using hive by a specific date?

I'm using hive (with external tables) to process data stored on amazon S3.
My data is partitioned as follows:
DIR s3://test.com/2014-03-01/
DIR s3://test.com/2014-03-02/
DIR s3://test.com/2014-03-03/
DIR s3://test.com/2014-03-04/
DIR s3://test.com/2014-03-05/
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_06-26_19-56.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_15-20_12-53.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_22-54_27-19.log
How to create a partition table using hive?
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
) PARTITIONED BY (? string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
Could somebody answer this question ? Thanks!
First start with the right table definition. In your case I'll just use what you wrote:
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
) PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
Hive by default expects partitions to be in subdirectories named via the convention s3://test.com/partitionkey=partitionvalue. For example
s3://test.com/dt=2014-03-05
If you follow this convention you can use MSCK to add all partitions.
If you can't or don't want to use this naming convention, you will need to add all partitions as in:
ALTER TABLE test
ADD PARTITION (dt='2014-03-05')
location 's3://test.com/2014-03-05'
If you have existing directory structure that doesn't comply <partition name>=<partition value>, you have to add partitions manually. MSCK REPAIR TABLE won't work unless you structure your directory like so.
After you specify location on table creation like:
CREATE EXTERNAL TABLE test (
foo string,
time string,
bar string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';
You can add partition without specifying full path:
ALTER TABLE test ADD PARTITION (dt='2014-03-05') LOCATION '2014-03-05';
Although I've never checked it, I suggest you to move your partitions into a folder inside the bucket, not directly in the bucket itself. E.g. from s3://test.com/ to s3://test.com/data/.
If you are going to partition using date field you need s3 folder structure as mentioned below:
s3://test.com/date=2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
In such case you can create external table with partition column as date
and run MSCK REPAIR TABLE EXTERNAL_TABLE_NAME to update hive meta store.
Please look at the response posted above by Carter Shanklin. You need to make sure your files are stored in the directory structure as partitionkey=partitionvalue i.e. Hive by default expects partitions to be in subdirectories named via the convention.
In your example it should be stored as
s3://test.com/date=20140305/ip-foo-request-2014-03-05_04-20_00-49.log.
Steps to be followed:
i) Make sure data exists in the above structure
ii) Create the external table
iii) Now run the msck repair table.
I think the the data is present in the s3 location and might not updated in the metadata, (emrfs). In order this to work first do emrfs import and emrfs sync.
And then apply the msck repair.
It will add all the partitions that are present in s3

Error creating a Hive table in HDInsight from a different blob container: Path is not legal

CREATE TABLE test1 (Column1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH 'asv://hivetest#mystorageaccount.blob.core.windows.net/foldername' OVERWRITE INTO TABLE test1 ;
Loading the data generates the following error:
FAILED: Error in semantic analysis: Line 1:18 Path is not legal
''asv://hivetest#mystorageaccount.blob.core.windows.net/foldername'':
Move from:
asv://hivetest#mystorageaccount.blob.core.windows.net/foldername to:
asv://hdi1#hdinsightstorageaccount.blob.core.windows.net/hive/warehouse/test1
is not valid. Please check that values for params "default.fs.name"
and "hive.metastore.warehouse.dir" do not conflict.
The container hivetest is not my default HDInsight container. It is even located on a different storage account. However, the problem is probably not with the account credentials, as I have edited core-site.xml to include mystorageaccount.
How can I load data from a non-default container?
Apparently it's impossible by design to load data into a Hive table from a non-default container. The workaround suggested by the answer in the link is using an external table.
I was trying to use a non-external table so I can take advantage of partitioning, but apparently it's possible to partition even an external table, as explained here.

Difference between Hive internal tables and external tables?

Can anyone tell me the difference between Hive's external table and internal tables.
I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables.
Can anyone explain me in terms of nodes please.
Hive has a relational database on the master node it uses to keep track of state.
For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database.
If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata'.
When you drop an internal table, it drops the data, and it also drops the metadata.
When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.
Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
To answer you Question :
For External Tables, Hive stores the data in the LOCATION specified during creation of the table(generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.
For your reference,
Difference between Internal & External tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS
file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir,
by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends
on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source :
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
An internal table data is stored in the warehouse folder, whereas an external table data is stored at the location you mentioned in table creation.
So when you delete an internal table, it deletes the schema as well as the data under the warehouse folder, but for an external table it's only the schema that you will loose.
So when you want an external table back you again after deleting it, can create a table with the same schema again and point it to the original data location. Hope it is clear now.
The only difference in behaviour (not the intended usage) based on my limited research and testing so far (using Hive 1.1.0 -cdh5.12.0) seems to be that when a table is dropped
the data of the Internal (Managed) tables gets deleted from the HDFS file system
while the data of the External tables does NOT get deleted from the HDFS file system.
(NOTE: See Section 'Managed and External Tables' in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL which list some other difference which I did not completely understand)
I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom
Location defined during the Table Creation
Location defined in the Database/Schema Creation in which the table is created.
Default Hive Warehouse Directory (Property hive.metastore.warehouse.dir in hive.site.xml)
When the "Location" option is not used during the "creation of a hive table", the above precedence rule is used. This is applicable for both Internal and External tables. This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else.
Note: I might have missed some scenarios, but based on my limited exploration, the behaviour of both Internal and Extenal table seems to be the same except for the one difference (data deletion) described above. I tried the following scenarios for both Internal and External tables.
Creating table with and without Location option
Creating table with and without Partition Option
Adding new data using the Hive Load and Insert Statements
Adding data files to the Table location outside of Hive (using HDFS commands) and refreshing the table using the "MSCK REPAIR TABLE command
Dropping the tables
In external tables, if you drop it, it deletes only schema of the table, table data exists in physical location. So to deleted the data use hadoop fs - rmr tablename .
Managed table hive will have full control on tables. In external tables users will have control on it.
INTERNAL : Table is created First and Data is loaded later
EXTERNAL : Data is present and Table is created on top of it.
Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion, whereas external tables are useful when the files are being used outside of Hive.
External hive table has advantages that it does not remove files when we drop tables,we can set row formats with different settings , like serde....delimited
Also Keep in mind that Hive is a big data warehouse. When you want to drop a table you dont want to lose Gigabytes or Terabytes of data. Generating, moving and copying data at that scale can be time consuming.
When you drop a 'Managed' table hive will also trash its data.
When you drop a 'External' table only the schema definition from hive meta-store is removed. The data on the hdfs still remains.
Consider this scenario which best suits for External Table:
A MapReduce (MR) job filters a huge log file to spit out n sub log files (e.g. each sub log file contains a specific message type log) and the output i.e n sub log files are stored in hdfs.
These log files are to be loaded into Hive tables for performing further analytic, in this scenario I would recommend an External Table(s), because the actual log files are generated and owned by an external process i.e. a MR job besides you can avoid an additional step of loading each generated log file into respective Hive table as well.
The best use case for an external table in the hive is when you want to create the table from a file either CSV or text
Both Internal and External tables are owned by HIVE. The only difference being the ownership of data. The commands for creating both tables are shown below. Only an additional EXTERNAL keyword comes in case of external table creation. Both tables can be created/deleted/modified using SQL Statements.
In case of Internal Tables, both the table and the data contained in the tables are managed by HIVE. That is, we can add/delete/modify any data using HIVE. When we DROP the table, along with the table, the data will also get deleted.
Eg: CREATE TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
In case of External Tables, only the table is managed by HIVE. The data present in these tables can be from any storage locations like HDFS. We cant add/delete/modify the data in these tables. We can only use the data in these tables using SELECT statements. When we DROP the table, only the table gets deleted and not the data contained in it. This is why its said that only meta-data gets deleted. When we create EXTERNAL tables, we need to mention the location of the data.
Eg: CREATE EXTERNAL TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hive/warehouse/tweets';
hive stores only the meta data in metastore and original data in out side of hive when we use external table we can give location' ' by these our original data wont effect when we drop the table
When there is data already in HDFS, an external Hive table can be created to describe the data. It is called EXTERNAL because the data in the external table is specified in the LOCATION properties instead of the default warehouse directory.
When keeping data in the internal tables, Hive fully manages the life cycle of the table and data. This means the data is removed once the internal table is dropped. If the external table is dropped, the table metadata is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting data along with tables by mistake.
For managed tables, Hive controls the lifecycle of their data. Hive stores the data for managed tables in a sub-directory under the directory defined by hive.metastore.warehouse.dir by default.
When we drop a managed table, Hive deletes the data in the table.But managed tables are less convenient for sharing with other tools. For example, lets say we have data that is created and used primarily by Pig , but we want to run some queries against it, but not give Hive ownership of the data.
At that time, external table is defined that points to that data, but doesn’t take ownership of it.
In Hive We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory.
Dropping External tables will delete metadata but not the data.
I would like to add that
Internal tables are used when the data needs to be updated or some rows need to be deleted because ACID properties can be supported on the Internal tables but ACID properties cannot be supported on the external tables.
Please ensure that there is a backup of the data in the Internal table because if a internal table is dropped then the data will also be lost.
In simple words, there are two things:
Hive can manage things in warehouse i.e. it will not delete data out of warehouse.
When we delete table:
1) For internal tables the data is managed internally in warehouse. So will be deleted.
2) For external tables the data is managed eternal from warehouse. So can't be deleted and clients other then hive can also use it.

Resources