Does performance issue occur when transfering data from a OrcSerde table to a LazySimpleSerDe table? - hadoop

I have a Hive query (TEZ is enabled).
It selects the data from a ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' table: tableA
Then inserts into another ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' table: tableB
INSERT OVERWRITE TABLE tableB
PARTITION (dt=%(datetime)s)
SELECT
fieldX,
fieldY
FROM
tableA
WHERE
SPLIT(tableA.ac_ad, "_")[0] = 1
AND SPLIT(tableA.ac_ad, "_")[1] IN (1, 2, 3, 4, 5)
We are facing a performance issue with this, and I am thinking the issue is because of the transformation data between OrcSerde and LazySimpleSerDe. The string manipulation SPLIT functions are in the where clause is suspicious as well.
Here is some info by ChatGPT
The transformation from an OrcSerde table to a LazySimpleSerDe table
can impact the performance of a Hive query.
When you perform a SELECT operation on an OrcSerde table, the data is
read in a columnar format and processed efficiently. When you then
INSERT the data into a LazySimpleSerDe table, the data must be
transformed into a row-based format and written to disk. This
transformation process can be time-consuming and may result in slower
performance compared to using an OrcSerde table.
LazySimpleSerDe is designed for simple data structures with relatively
low performance requirements. If you need to perform complex
operations on your data, it's recommended to stick with OrcSerde to
take advantage of its optimizations. If you don't need to perform
complex operations, LazySimpleSerDe may be sufficient.

Related

Best config for tiny tables used for INNER JOINs

I have many small tables (less than 5k bytes when exported as CSV) that are only "from to" (e.g. code to name), and must be used in a JOIN, just to translate internal codes or IDs... How to use CREATE TABLE with them on Hive?
Example:
CREATE TABLE mydb.fromto1(id1 bigint, name1 string);
CREATE TABLE mydb.fromto2(
id2 bigint,
name2 varchar(10)
)
PARTITIONED BY (ingestion_day date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 'hdfs://TLVBRPRDK/apps/hive/warehouse/mydb.db/fromto2'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.row.index.stride'='50000',
'orc.stripe.size'='67108864',
'transient_lastDdlTime'='1577456923'
);
-- INSERT INTO mydb.fromto1 10 lines
-- INSERT INTO mydb.fromto2 10 lines
CREATE VIEW mydb.vw_test1 AS -- need for BEST PERFORMANCE HERE!
SELECT big.*, tiny.name1
FROM mydb.big_fact_table big INNER JOIN mydb.fromto1 tiny ON big.id1=tiny.id1
-- and/or INNER JOIN mydb.fromto2 tiny2 ON big.id2=tiny2.id2
;
How to set correc parameters (partitioned or not, compressed or not, managed or external, row format, etc.) for best performance in a SQL JOIN with Big Data (fact) tables?
Is there a "good Fast Guide" or Wizard for it?
NOTES:
This question/ansewer is not the same. Perhaps there are clues in optiomizations for "Hive Star-schema JOINs", but not here.
There are some clues here on cwiki.Apache/Hive/LanguageManual+JoinOptimization, but it is not about CREATE TABLE.
Definitely you do not need partitioning for such small tables. Better if each table will be in single file, not partitioned, not bucketed.
Use these settings for joins optimization (increase figures if necessary). Check EXPLAIN plan, it should be mapjoin operator, small tables can be joined on the same mapper.
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=157286400; --if the file size is smaller than this threshold, map join will be used
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 157286400; --combined small tables size
Using TEXTFILE for small tables may be better than ORC because plain TEXTFILE can be be smaller in size for such small tables. The same rule for compression - use compression only if it helps to significantly reduce the file size, small files are not always can be compressed efficiently (compressed small file can even be bigger than uncompressed). Use ORC for bigger dimensions. Check file size and decide.
Bear in mind that fastest SerDe is LasySimpleSerDe, so default tab-delimited TEXTFILE is good for small files. For bigger files use ORC and compression.
External or managed - does not matter in this context.

De-duplication from two hive tables

We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table

Can I directly consider the Hive partition columns similar to the partitions columns present in source (Teradata) tables?

Can I directly consider the Hive partition columns similar to the partitions columns present in my source (Teradata) tables? or do I have consider any other parameters to decide the Hive partitioning columns ? Please help.
This is not best practice. if you create data in this manner then a person who is trying to access HDFS data directly will not find 'partition columns' in each partition. For example say Teradata table is partitioned by date column then if hive table is also partitioned by date then HDFS partition say 2016-08-06 will not have date field. So to make it easy for end user partition by a dummy column say date_d which will exactly same values as date column.
Abstractly, partitioning in Teradata and Hive are similar.To begin
with you can probably use the same columns as in your source to
partition the tables.
If you data size is huge in each single partition, then consider
partitioning it further, to improve the performance.The multilevel
partitioning would mostly depend on the number of filters you apply
on your queries.

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Hive (0.12.0) - Load data into table with partition, buckets and attached index

Using Hive 0.12.0, I am looking to populate a table that is partitioned and uses buckets with data stored on HDFS. I would also like to create an index of this table on a foreign key which I will use a lot when joining tables.
I have a working solution but something tells me it is very inefficient.
Here is what I do:
I load my data in a "flat" intermediate table (no partition, no buckets):
LOAD DATA LOCAL INPATH 'myFile' OVERWRITE INTO TABLE my_flat_table;
Then I select the data I need from this flat table and insert it into the final partitioned and bucketed table:
FROM my_flat_table
INSERT OVERWRITE TABLE final_table
PARTITION(date)
SELECT
col1, col2, col3, to_date(my_date) AS date;
The bucketing was defined earlier when I created my final table:
CREATE TABLE final_table
(col1 TYPE1, col2 TYPE2, col3 TYPE3)
PARTITIONED BY (date DATE)
CLUSTERED BY (col2) INTO 64 BUCKETS;
And finally, I create the index on the same column I use for bucketing (is that even useful?):
CREATE INDEX final_table_index ON TABLE final_table (col2) AS 'COMPACT';
All of this is obviously really slow, so how would I go about optimizing the loading process?
Thank you
Whenever I had a similar requirement, I used almost the same approach being used by you as I couldn't find an efficiently working alternative.
However to make the process of Dynamic Partitioning a bit fast, I tried setting few configuration parameters like:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions = 2000;
set hive.exec.max.dynamic.partitions.pernode = 10000;
I am sure you must be using the first two, and the last two you can set depending on your data size.
You can check out this Configuration Properties page and decide for yourself which parameters might help in making your process fast e.g. increasing number of reducers used.
I can not guarantee that using this approach will save your time but definitely you will make the most out of your cluster set up.

Resources