I have many small tables (less than 5k bytes when exported as CSV) that are only "from to" (e.g. code to name), and must be used in a JOIN, just to translate internal codes or IDs... How to use CREATE TABLE with them on Hive?
Example:
CREATE TABLE mydb.fromto1(id1 bigint, name1 string);
CREATE TABLE mydb.fromto2(
id2 bigint,
name2 varchar(10)
)
PARTITIONED BY (ingestion_day date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 'hdfs://TLVBRPRDK/apps/hive/warehouse/mydb.db/fromto2'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.row.index.stride'='50000',
'orc.stripe.size'='67108864',
'transient_lastDdlTime'='1577456923'
);
-- INSERT INTO mydb.fromto1 10 lines
-- INSERT INTO mydb.fromto2 10 lines
CREATE VIEW mydb.vw_test1 AS -- need for BEST PERFORMANCE HERE!
SELECT big.*, tiny.name1
FROM mydb.big_fact_table big INNER JOIN mydb.fromto1 tiny ON big.id1=tiny.id1
-- and/or INNER JOIN mydb.fromto2 tiny2 ON big.id2=tiny2.id2
;
How to set correc parameters (partitioned or not, compressed or not, managed or external, row format, etc.) for best performance in a SQL JOIN with Big Data (fact) tables?
Is there a "good Fast Guide" or Wizard for it?
NOTES:
This question/ansewer is not the same. Perhaps there are clues in optiomizations for "Hive Star-schema JOINs", but not here.
There are some clues here on cwiki.Apache/Hive/LanguageManual+JoinOptimization, but it is not about CREATE TABLE.
Definitely you do not need partitioning for such small tables. Better if each table will be in single file, not partitioned, not bucketed.
Use these settings for joins optimization (increase figures if necessary). Check EXPLAIN plan, it should be mapjoin operator, small tables can be joined on the same mapper.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=157286400; --if the file size is smaller than this threshold, map join will be used
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 157286400; --combined small tables size
Using TEXTFILE for small tables may be better than ORC because plain TEXTFILE can be be smaller in size for such small tables. The same rule for compression - use compression only if it helps to significantly reduce the file size, small files are not always can be compressed efficiently (compressed small file can even be bigger than uncompressed). Use ORC for bigger dimensions. Check file size and decide.
Bear in mind that fastest SerDe is LasySimpleSerDe, so default tab-delimited TEXTFILE is good for small files. For bigger files use ORC and compression.
External or managed - does not matter in this context.
Related
Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.
I have two scripts which parse data from raw logs and write it into ORC tables in HIVE. One script creates more columns and another less. Both tables partitioned by date field.
As the result I have ORC tables with different sizes of files.
Table with larger number of columns consists of many small files (~4MB per file inside each partition) and tables with less columns consists of few large files (~250 MB per file inside each partition).
I suppose it happens because of stripe.size setting in ORC. But I don't know how to check size of stripe for existing table. Commands like "show create" and "describe" don't reveal any custom settings, it means that stripe size for tables should be equal to 256 MB.
I'm looking for any advice to check stripe.size for existing ORC table.
Or explanation how file size inside ORC tables depends on data in that tables.
P.s.It matters later when I'm reading from that tables with Map Reduce and there are small number of reducers for tables with big files.
Try the Hive ORC File Dump Utility: ORC File Dump Utility.
I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando
Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.
I need to compress a table. I used alter table tablename compress to compress the table. After doing this the table size remained the same.
How should I be compressing the table?
To compress the old blocks of the table use:
alter table table_name move compress;
This will reinsert the records in another blocks, compressed, and discard old blocks, so you'll gain space. And invalidates the indexex, so you will need to rebuild them.
Compress does not affect already stored rows. Please, check the official documentation:
" You specify table compression with the COMPRESS clause of
the CREATE TABLE statement. You can enable compression for an existing
table by using this clause in an ALTER TABLEstatement. In this case,
the only data that is compressed is the data inserted or updated after
compression is enabled..."
ALTER TABLE t MOVE COMPRESS is a valid answer. But if you use different non default options, especially with big data volume, do regression tests before using ALTER TABLE ... MOVE.
There were historically more problems (performance degradations and bugs) with it. If you have access, look Oracle bug database to see if there are known problems for features and version you use.)
You are on safer side if you: create new table insert data from original (old) table drop old table rename new table to old table name