Distributed by Clause in HIVE - hadoop

I have table of with huge data like 100TB.
When I am querying the table I used distributed by clause on particular column (say x).
The table contains 200 distinct or unique values of X.
So When I queried the table with distributed by clause on X the maximum reducers should be 200. But I am seeing it is utilizing MAX reducers i.e. 999
Let me explain with example
Suppose the description of the emp_table is as fallows with 3 columns.
1.emp_name
2. emp_ID
3.Group_ID
and Group_ID has **200 distinct** values
Now I want to query the table
select * from emp_table distributed by Group_ID;
This Query should use 200 Reducers as per distributed clause. But I am seeing 999 reducers getting utilized.
I am doing it as part optimization. So how can I make sure it should be utilize 200 reducers?

The number of reducers in Hive is decided by either two properties.
hive.exec.reducers.bytes.per.reducer - The default value is 1GB, This makes hive to create one reducer for each 1GB of input table's size.
mapred.reduce.tasks - takes an intger value, and those many reducers will be prepared for the job.
The distribute by clause doesn't have any role in deciding the number of reducers, all its work is to distribute/partition the key value from mappers to prepared reducers based on the column given in the clause.
Consider setting the mapred.reduce.tasks as 200, and the distribute by will take care of partitioning the key values to the 200 reducers in even manner.

The reduce number of hive depends on the size of your input file.But if the output of the mapper contains only 200 groups.Then I guess most of the reduce job will recieve nothing.
If you really want to control the reduce number.set mapred.reduce.tasks will help.

Related

What will happen if Hive number of reducers is different to number of keys?

In Hive I ofter do queries like:
select columnA, sum(columnB) from ... group by ...
I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.
Therefore, why could hive set number of reducers manully?
If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?
If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?
Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:
--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864;
If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max
If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:
set hive.tez.auto.reducer.parallelism = true;
In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.
One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.
Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.

Hive number of reducers in group by and count(distinct)

I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?
Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.
In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.
Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.
As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true
Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true
use EXPLAIN command to check the difference in the execution plan.
See also this answer: https://stackoverflow.com/a/51492032/2700344

The difference between `hive.exec.max.dynamic.partitions` and `hive.exec.max.dynamic.partitions.pernode`

I am looking for some documentation to understand the difference between hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
When do we need to set these parameters and what is the use of these?
hive.exec.max.dynamic.partitions=500
hive.exec.max.dynamic.partitions.pernode=500
Hive's Dynamic Partitioning allows the users to create partitions without having to provide the partition values. The partition values are determined from their corresponding input column values during query execution.
The number of partitions created will be proportional to the varied set of column values. This inturn will create burden for the HDFS namenode and Hive Metastore.
These properties
hive.exec.max.dynamic.partitions
hive.exec.max.dynamic.partitions.pernode
are meant to have them under control by limiting the number the partitions that can be created by Dynamic Partitioning.
hive.exec.max.dynamic.partitions: The maximum number of partitions allowed to be created in total by one dynamic partition insert.
Dynamic partitions are created only through INSERT. The INSERT query may triggger either a Map only job or a MapReduce job based on the DML.
hive.exec.max.dynamic.partitions.pernode: The maximum number of partitions allowed to be created by each mapper or reducer node taking part in the insert job.
Setting 500 as the value for both the properties as mentioned in the question will let only one mapper/reducer to run failing others.
As a best practice,
hive.exec.max.dynamic.partitions ~= n * hive.exec.max.dynamic.partitions.pernode
where n is the number of mapper(s) and (or) reducer(s) required for the job.

Relationship between a HIVE query and the number of mapreducers provided by Hadoop?

I am executing a query in HIVE shell as
SELECT tradeId, bookid, foid from trades where bookid='"ABCDEFG"'
The table "trades" has index on bookid. When the query runs, it shows the details of Mappers and Reducers as follows :-
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for Stage-1: number of mappers: 48; number of reducers: 0
Time taken: **606.183 seconds**, Fetched: **18 row(s)**
If you see it took enormous amount of time to fetch just 18 rows. My question is what I am doing wrong here ? Should the recuder be non-zero ? Will it help if I set it using
set mapred.reduce.tasks = some_number
Shouldn't the indexes help retrieve the data faster ?
When you are doing simple select, all the filtering thing and column selection are done by the mappers itself. There is no purpose for reducer task here, hence number of reducer is zero - which is fine. You probably have around 48*block size amount of data in your table so it spawned 48 mappers. How many map slot per DN do you have and how many of them were free when you fired your query? Chances are all 48 of them are not running in parallel. Though it returned only 18 rows, it read the full table. Is your table bucketed and clustered on the bookid column - in that case you may use TABLESAMPLE clause to make it read only the buckets that contain your ABCDEFG value.

What is a ballpark figure for excuting the following Hive query: SELECT COUNT(*) FROM TABLE; for a table with 8bn rows/40 columns/400Gb?

What is a ballpark figure for executing the following Hive query: SELECT COUNT(*) FROM TABLE; for the following table:
number of rows: ~8bn
number of columns: 40, various sizes of int, doubles and strings
size on HDFS: ~400Gb
I want to check any ballpark figures against the real figure to see if the system is configured correctly.
Apologies if I've missed something crucial, I'm very new to Hive and Hadoop.
Also, will the execution time scale linearly with the number of rows, provided the number of machines is scaled up as well?
It would be impossible to provide a ballpark figure.
However we can list out the influencing factors:
Number of Map Tasks configured in cluster
Block Size (determines the number of mapper that will be used)
Execution time will again depend on these factors.
E.g. if i have 100 Mappers available and my block size is 128MB - I would need 3200 Mappers (400*1024/128). So assuming all mappers are assigned to your job it would take 32 executions of 100 mappers at a time (again assuming all mappers start and end at same time, which is a stupid assumption :)). So time taken would be 32*time per mapper.
I would have left this as a comment but i am not allowed to do so.

Resources