I want to do some actions to files on hdfs by using hive temporarily,so i do not want to use internal table.but my data is so huge ,for example 1TB,so I worry about the performance of external table.
so I ask the question about
difference of performance between table and extenal table in hive.
You may just create hive external tables and use them. I haven't noticed any major difference in performance internal and external tables.
To improve performance you may create ORC(file format) tables which are managed by hive.
Create ORC table:
CREATE TABLE IF NOT EXISTS <orc_table_name>(
<col name> <type>)
COMMENT 'comments'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
Then insert into ORC tables:
INSERT OVERWRITE TABLE <orc_table_name> SELECT * FROM <external_table_name>;
Refer: HDFS to Hive external table and ORC
Difference between external and internal table performance that i have experienced is
internal tables takes more CPU Time
External tables takes less CPU Time by approximately 40%
Related
I am creating an external table that refers to ORC files in an HDFS location. That ORC files are stored in such a way that the external table is partitioned by date (Mapping to date wise folders on HDFS, as partitions).
However, I am wondering if I can enforce 'Bucketing' on these external tables because the underlying data/files are not 'managed' by hive. They are written externally and hence can bucketing be used in Hive External Tables?
Hive is allowing me to use the 'CLUSTERED BY' clause while creating an external table. But I am not able to understand how hive will redistribute the data into buckets, what is already written on HDFS as ORC files?
I have seen similar questions on PARTITION AND BUCKETING in External tables here:
Hive: Does hive support partitioning and bucketing while usiing external tables
and
Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?
but the answers are talking only about Partition support in external tables or bucket support in MANAGED tables. I am aware of both those options and am already using it but need specific answers about bucketing support in Hive EXTERNAL tables.
So, In summary, Do Hive External Tables support bucketing?
If yes, how is the data in the external folder redistributed into buckets by hive?
Yes, Hive does support bucketing and partitioning for external tables.
Just try it:
SET hive.tez.bucket.pruning=true;
SET hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.enforce.bucketing = true;
drop table stg.test_v1;
create external table stg.test_v1
(
id bigint
,name string
)
partitioned by (created_date string)
CLUSTERED BY(name) INTO 3 BUCKETS
stored as ORC
;
INSERT OVERWRITE TABLE stg.test_v1 partition(created_date)
SELECT
id, name, created_date
FROM
(select stack(3,
1, 'Valeriy', '2020-01-01',
2, 'Victor', '2020-01-01',
3, 'Ankit', '2020-01-01'
) as(id, name, created_date)
)s;
DESC FORMATTED says:
Table Type: EXTERNAL_TABLE
...
Num Buckets: 3
Bucket Columns: [name]
Load more rows and you will see, it will create 3 files per partition
See also this documentation for more details about features supported for Managed and External tables: Managed vs External Tables.
Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table
I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.
i'm beginner to hadoop.
internal table: the table is stored in hive warehouse and if it is dropped, both the metadata and data is deleted.
external table: the table is stored in hdfs and if it is dropped, only the metadata is deleted.
now, which table gives better performance while querying?please give reason.
also, it is highly appreciable if you could give some more difference for this tables in real time.
thanks in advance.
There is no performance difference at all between internal table and external table. The only difference is just like what you mentioned - one thing to note is hive warehouse is also in HDFS (with different path)