Transfer big table from one Hive database to another - performance

I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.
So, something like this:
-- initial settings
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;
INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
WHEN a is null THEN 'Missing'
ELSE a END AS a,
column1,
column2,
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;
The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?
I'm using map reduce engine for this task.
Thanks!

As all the data in the Hive tables are files on HDFS why don't you move/copy the files directly into the new table's HDFS location.
Example:
Assuming the table you want to move is already present in db1 as table_to_cpy;
create database db2;
create table db2.table_to_cpy like db1.table_to_cpy;
desc formatted db1.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db1.db/table_to_cpy
desc formatted db2.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db2.db/table_to_cpy
hadoop fs -cp /user/hive/warehouse/db1.db/table_to_cpy/* /user/hive/warehouse/db2.db/table_to_cpy/.

Few suggestions on how to speed-up your query:
Avoid unix_timestamp() if possible. This function is non-deterministic and prevents proper optimization of queries, it will be executed in each mapper or reducer and may return different values. Use instead
current_date() AS run_date
See also this answer for more details: https://stackoverflow.com/a/41140298/2700344
Tune mappers and reducers parallelism. If your process ending up with one big file (20 GB) instead of a few smaller then obviously there is not enough parallelism.
For mappers, play with these settings:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB
Decrease hive.exec.reducers.bytes.per.reducer to increase the number of reducers.
Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.
See this answer for details: https://stackoverflow.com/a/42842117/2700344
Try to add distribute by parition_col It will distribute data between reducers according to partition keys and as a result each reducer will create less partitions and consume less memory. Also it helps to avoid too many small output files. This setting should be used with hive.exec.reducers.bytes.per.reducer to avoid problem with uneven distribution between reducers and to avoid too big output files.

Related

hive select query poor performance

I have a hive table which is getting inserted few 1000s of record every hour. But when I execute select * from <table>, it is taking so much time to execute. What is the reason behind this?
Hive is not fast to begin with... Not sure what you're expecting, but it will not be on the order of milliseconds.
If you want performance improvements, use Tez or Spark rather than MapReduce execution, also use Hive 2 w/ LLAP, and land the data in ORC or Parquet format.
If you aren't able to do the above, at least place data into hourly partitions. Then actually query against the partition rather than scanning all the rows/columns because Hive does partition pruning.
Also, HDFS doesn't like files smaller than the hdfs block size (128 MB). Anything smaller means wasted time in map tasks
I agree with #cricket_007 of using execution engine tez/spark.There are some customization you can do from your end to achieve performance in hive:
Use of vectorization which executes in batches of 1024 rows at once
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Use of CBO
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
It is best practice to partition your data to speed up the queries. Partitioning will make hive run the query on the subset of the data instead of the entire dataset. Creating partitions may be done as follows:
The folder structure should look something like this:
path/to/directory/partition=partition_name
Then on the table itself (assuming it's on an external table) you're create table statement should be something like:
CREATE EXTERNAL TABLE table_name (
...
fields
...
)
PARTITIONED BY (partition)
LOCATION '/path/to/directory'
You can then query the table and treat the partition as another column.
If you look at the Hive design and architecture you will see that a typical query will have some overhead. A query will be translated into code for distributed execution, send over to the cluster backend, executed there and then results are stored and collected for displaying. This will add latency to every of your queries even if the input data and the final result set are small.

How to reduce number of mappers, when I am running hive query?

I am using hive ,
I have 24 json files with total size of 300MB (in one folder), so I have created one external table(i.e table1) and I loaded the data(i.e 24 files ) Into external table.
When I am running select query on top of that external table(i.e table1), I observed 3 mappers and 1 reducer is running.
After that I have created one more external table(i.e table2).
I have compressed the my input files (folder which contains 24 files ).
Example : BZIP2
So it compress the data but 24 files created with extension “.BZiP2”
(i.e..file1.bzp2,…..file24.bzp2).
After that , I have load the my compressed files into my external table .
Now, when I am running select query , it is taking 24 mappers and 1 reducer. And observed CPU time is taking more time when compared to uncompressed data(i.e files) .
How can I reduce number of mappers, if data is in compressed format(i.e table2 select query )?
How can I reduce CPU time , if data is in compressed format(i.e table2 select query )? How CPU time will affect performance?
The number of mappers can be less than the number of files only if files are on the same data node. If files are located on different datanodes, the number of mappers will never be less than the number of files. Concatenate all /some files and put them into your table location. use cat command for concatenating non-compressed files. You got 24 mappers because you have 24 files.Parameters mapreduce.input.fileinputformat.split.minsize / maxsize are for splitting bigger files.
If file size of 200000 bytes, setting the value of
set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;
will trigger 200000/100000 = 2 mappers for the map reduce job
setting the value of
set mapreduce.input.fileinputformat.split.maxsize=50000;
set mapreduce.input.fileinputformat.split.minsize=50000;
will trigger 200000/50000 = 4 mappers for the the same job.
Read:
splittable-gzip
set-mappers-in-pig-hive-and-mapreduce
how-to-control-the-number-of-mappers-required-for-a-hive-query
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used by either:
Setting it when logged into the HIVE CLI : set tez.grouping.split-count=4 will create 4 mappers
An entry in the hive-site.xml can be added via Ambari. If set via hive-site.xml HIVE will need to be restarted.

hive merge properties not working for small files

I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave.
Any help will be greatly appreciated
hive.merge.mapfiles=true;
hive merge mapredfiles = true
hive.merge.size.per.task=10000;
hive.merge.smallfiles.avgsize=100;
Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.
set hive.merge.smallfiles.avgsize = 134217728;
This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.
I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)
There's one restriction though, you can't have indexes in your hive tables to execute the merge command.
I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.
ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE
Hope it helps
Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
This will help to run hive job faster for given small files in hive.

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Resources