FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'python mapper.py'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM map_output.dt, map_output.uid
USING 'python reducer.py'
AS date, count;
How map reduce is working in this query and what is the significance of "CLUSTER BY" in this query?
Each mapper will read file splits, do something with their splits (for example pre-aggregation like distinct) and produce dt, uid grouped and sorted by dt, so different dt will be put in different files which will be consumed by reducers on the next step.
Reducers will read files prepared by mappers, so records with the same dt will be read by the same reducer because records were distributed by dt and sorted on mapper.
Reducer will merge partial results(files from mappers) and do some count aggregation. If some dt were in the same file, records are sorted, it reduces the amount of work to be done on reducer.
cluster by dt = distribute by dt sort by dt
Without cluster by, two reducers may receive same dt, this will make impossible to perform count correctly because reducers do not know about each other and do not share data between them, same dt will be counted partially on different reducers, final result will contain multiple records with the same dt
Related
I am running a simple query like the one shown below(similar form)
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
There is nothing wrong with query syntax wise.
TABLE2 IS EMPTY and the total size of TABLE1 is 2gb in HDFS(stored as parquet with snappy compression)
When I run the query in hive, I see that 17 map tasks and 0 reducer tasks are launched.
What I notice is that most of the map task complete in a minute.
But one of the map task takes long time. It's like all the data in the table is going to that map task.
The whole query fails eventually with container physical memory limit error.
Any reasons for why this is happening or might happen?
It may happen because some partition is bigger than others.
Try to trigger reducer task by adding distribute by
INSERT OVERWRITE table TABLE2
PARTITION(COLUMN)
SELECT *
FROM TABLE1
DISTRIBUTE BY COLUMN
Additionally you can add some other evenly distributed column with low cardinality to the DISTRIBUTE BY to increase parallelism:
DISTRIBUTE BY COLUMN, COLUMN2
If COLUMN2 has high cardinality, it will produce too many files in each partition, if column values are distributed not evenly (skewed) then it will result in skew in reducer, so it is important to use low-cardinality, evenly distributed column or deterministic function with the same properties like substr(), etc.
Alternatively also try to increase mapper parallelism and check if it helps: https://stackoverflow.com/a/48487306/2700344
In a mapreduce job consisting of select count(*) from products where id = 2, where does the count(*) operation take place, is it in the mapper or the reducer?
It can be both mapper and reducer or reducer only aggregation.
With map-side aggregation enabled:
hive.map.aggr=true;
data will be pre-aggregated (in the scope of split processed) on each mapper using Hash table. Reducer will do final aggregation of partial results received from mapper.
The mappers will output the pairs (#{token}, #{token_count}). The Hadoop framework again sorts these pairs and the reducers sum the values to produce the total counts for each token. In this case, the mappers will each output one row for each token every time the map is flushed instead of one row for each occurrence of each token. The tradeoff is that they need to keep a map of all tokens in memory.
If map-side aggregation is switched-off: hive.map.aggr=false, mapper will filter rows and send them to the reducer, reducer will do the aggregation, this can cause high network IO.
Read more details about Map-side Aggregation in Hive.
See also related https://stackoverflow.com/a/61772631/2700344
I was told that count(distinct ) may result in data skew because only one reducer is used.
I made a test using a table with 5 billion data with 2 queries,
Query A:
select count(distinct columnA) from tableA
Query B:
select count(columnA) from
(select columnA from tableA group by columnA) a
Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected.
However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the same cumulative CPU seconds. And this means they do not have geneiune difference and the time difference may caused by cluster load.
I am confused why the all use one 1 reducers and I even tried mapreduce.job.reduces but it does not work. Btw, if they all use 1 reducers why do people suggest not to use count(distinct ) and it seems data skew is not avoidable?
Both queries are using the same number of mappers which is expected and single final reducer, which is also expected because you need single scalar count result. Multiple reducers on the same vertex are running independently, isolated and each will produce it's own output, this is why the last stage has single reducer. The difference is in the plan.
In the first query execution single reducer reads each mapper output and does distinct count calculation on all the data, it process too much data.
Second query is using intermediate aggrgation and final reducer receives partially aggregated data (distinct values aggregated on previous step). Final reducer needs to aggregate partial results again to get final result, it can be much less data than in the first case.
As of Hive 1.2.0 there is optimization for count(distinct) and you do not need to rewrite query. Set this property: hive.optimize.distinct.rewrite=true
Also there is mapper aggregation (mapper can pre-aggregate data also and produce distinct values in the scope of their portion of data - splits) Set this property to allow map-side aggregation: hive.map.aggr=true
use EXPLAIN command to check the difference in the execution plan.
See also this answer: https://stackoverflow.com/a/51492032/2700344
The workspace i am using is set with Hive 1.1.0 and CDH 5.5.4. I make a query which brings a 22 partitions result. The files saved in this partitions directories are always unique, and can variate from 20MB to 700MB.
From what i understood, this is related to the number of reducers used in the process of the query. Let´s assume i want to have 5 files for each partition instead of 1, i use this command:
set mapreduce.job.reduces=5;
This will make the system use 5 reduce tasks in stage 1, but will automatically switch to 1 reducer at stage 2 (determined automatically at compile time). From what i read, this is due to compiler having more importance than configuration at the time of choosing the number of reducers. It seems that some tasks can not be 'paralelized' and can only be done by one process or reducer task, so system will automatically determine it.
Code :
insert into table core.pae_ind1 partition (project,ut,year,month)
select ts,date_time, if(
-- m1
code_ac_dcu_m1_d1=0
and (min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=1 then ts end ) over (partition by ut order
by ts rows between 1 following and 1000 following)-ts,NULL) as
t_open_dcu_m1_d1,
if( code_ac_dcu_m1_d1=2
and (min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut
order by ts rows between 1 following and 1000 following)-ts) <= 15,
min(case when code_ac_dcu_m1_d1=3 then ts end ) over (partition by ut order
by ts rows between 1 following and 1000 following)-ts, NULL) as
t_close_dcu_m1_d1,
project,ut,year,month
from core.pae_open_close
where ut='902'
order by ut,ts
This leads to having huge files at the end. I would like to know if there is a way of splitting this result files into smaller ones (preferably limiting them by size).
As #DuduMarkovitz pointed, your code contains instruction to order globally the dataset. This will run on single reducer. You better order during select from your table. Even if your files are in order after such insert and they are splittable - they will be read on many mappers then the result will be not in order due to parallelism and you will need to order. Just get rid of this order by ut,ts in the insert and use these configuration settings for controlling the number of reducers:
set hive.exec.reducers.bytes.per.reducer=67108864;
set hive.exec.reducers.max = 2000; --default 1009
The number of reducers determined according to
mapred.reduce.tasks - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.
Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapred.reduce.tasks is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.
So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer
Each reducer will create one file for each partition (not bigger than hive.exec.reducers.bytes.per.reducer ). It's possible that one reducer will receive many partitions data and as a result will create many small files in each partition. It's because on shuffle phase partitions data will be distributed between many reducers.
If you do not want each reducer to create every (or too many) partitions then distribute by partition key (instead of order). In this case the number of files in the partition will be more like partition_size/hive.exec.reducers.bytes.per.reducer
I am running a cross product operation and storing the results in a table. The number of rows in table1 and table2 is ~300K and ~15K respectively. The query is like
create table table3
as
select a.var1*b.var1+......+a.var_n.b.var_n as score
from
table1 a , table2b
I observed that the process is running fastest at 2000 to 3000 mappers as compared to much higher number of mappers allocated (5000).
My questions are :
Does increasing the number of mapper really speed up the process?
Is there any way to to figure out the optimal number of mapper for a process ?