How to reduce number of mappers, when I am running hive query? - hadoop

I am using hive ,
I have 24 json files with total size of 300MB (in one folder), so I have created one external table(i.e table1) and I loaded the data(i.e 24 files ) Into external table.
When I am running select query on top of that external table(i.e table1), I observed 3 mappers and 1 reducer is running.
After that I have created one more external table(i.e table2).
I have compressed the my input files (folder which contains 24 files ).
Example : BZIP2
So it compress the data but 24 files created with extension “.BZiP2”
(i.e..file1.bzp2,…..file24.bzp2).
After that , I have load the my compressed files into my external table .
Now, when I am running select query , it is taking 24 mappers and 1 reducer. And observed CPU time is taking more time when compared to uncompressed data(i.e files) .
How can I reduce number of mappers, if data is in compressed format(i.e table2 select query )?
How can I reduce CPU time , if data is in compressed format(i.e table2 select query )? How CPU time will affect performance?

The number of mappers can be less than the number of files only if files are on the same data node. If files are located on different datanodes, the number of mappers will never be less than the number of files. Concatenate all /some files and put them into your table location. use cat command for concatenating non-compressed files. You got 24 mappers because you have 24 files.Parameters mapreduce.input.fileinputformat.split.minsize / maxsize are for splitting bigger files.

If file size of 200000 bytes, setting the value of
set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;
will trigger 200000/100000 = 2 mappers for the map reduce job
setting the value of
set mapreduce.input.fileinputformat.split.maxsize=50000;
set mapreduce.input.fileinputformat.split.minsize=50000;
will trigger 200000/50000 = 4 mappers for the the same job.
Read:
splittable-gzip
set-mappers-in-pig-hive-and-mapreduce
how-to-control-the-number-of-mappers-required-for-a-hive-query

In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used by either:
Setting it when logged into the HIVE CLI : set tez.grouping.split-count=4 will create 4 mappers
An entry in the hive-site.xml can be added via Ambari. If set via hive-site.xml HIVE will need to be restarted.

Related

Hive job taking too long to read data and insert in partitioned sorted bucketed table

We have a job that reads from a hive table with around 3billion rows and inserts in a sorted bucketed table.
Files in both source and destination tables are having parquet format.
This job is taking too long to finish. We have had to stop the job after 3 days.
We recently migrated to a new cluster. The older cluster was 5.12 and the latest cluster is 6.3.1.
This job used to run fine and finish within 6 hours in the 5.12 cluster. However, it's taking too long in the new cluster.
We have tried the following things to solve this without any results:-
Removed the cap on reducers. Removed set hive.exec.reducers.max=200;
set mapreduce.job.running.reduce.limit=100;
Merged files at the source to make sure we are not reading small files. File size in the source table was increased to 1G each.
Reduce the no. of rows in the source table to reduce the data mappers are reading.
Reduce the max split size to 64MB to increase the no. of mappers.
Insert in a new table.
Insert in a new table that is not sorted or bucketed.
The query we are trying to run :-
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;
set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;
We found out that hive version 2.x in Cloudera 6.3 is using vectorization while hive 1.x in old Cloudera 5.12 is not using it.
So setting the below property fixed the issue for us. I have no explanations for this. Vectorization should speed up the query and not make it slow.
hive.vectorized.execution.enabled=false;

Transfer big table from one Hive database to another

I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.
So, something like this:
-- initial settings
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;
INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
WHEN a is null THEN 'Missing'
ELSE a END AS a,
column1,
column2,
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;
The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?
I'm using map reduce engine for this task.
Thanks!
As all the data in the Hive tables are files on HDFS why don't you move/copy the files directly into the new table's HDFS location.
Example:
Assuming the table you want to move is already present in db1 as table_to_cpy;
create database db2;
create table db2.table_to_cpy like db1.table_to_cpy;
desc formatted db1.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db1.db/table_to_cpy
desc formatted db2.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db2.db/table_to_cpy
hadoop fs -cp /user/hive/warehouse/db1.db/table_to_cpy/* /user/hive/warehouse/db2.db/table_to_cpy/.
Few suggestions on how to speed-up your query:
Avoid unix_timestamp() if possible. This function is non-deterministic and prevents proper optimization of queries, it will be executed in each mapper or reducer and may return different values. Use instead
current_date() AS run_date
See also this answer for more details: https://stackoverflow.com/a/41140298/2700344
Tune mappers and reducers parallelism. If your process ending up with one big file (20 GB) instead of a few smaller then obviously there is not enough parallelism.
For mappers, play with these settings:
set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB
Decrease hive.exec.reducers.bytes.per.reducer to increase the number of reducers.
Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.
See this answer for details: https://stackoverflow.com/a/42842117/2700344
Try to add distribute by parition_col It will distribute data between reducers according to partition keys and as a result each reducer will create less partitions and consume less memory. Also it helps to avoid too many small output files. This setting should be used with hive.exec.reducers.bytes.per.reducer to avoid problem with uneven distribution between reducers and to avoid too big output files.

How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;

hive merge properties not working for small files

I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave.
Any help will be greatly appreciated
hive.merge.mapfiles=true;
hive merge mapredfiles = true
hive.merge.size.per.task=10000;
hive.merge.smallfiles.avgsize=100;
Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.
set hive.merge.smallfiles.avgsize = 134217728;
This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.
I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)
There's one restriction though, you can't have indexes in your hive tables to execute the merge command.
I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.
ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE
Hope it helps
Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
This will help to run hive job faster for given small files in hive.

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Resources