Hive count distinct UDAF2 - hadoop

I've read a question on SO:
I ran into a Hive query calculating a count distinct without grouping,
which runs very slow. So I was wondering how is this functionality
implemented in Hive, is there a UDAFCountDistinct for this?
And the answer:
To achieve count distinct, Hive relies on the GenericUDAFCount. There
is no UDAF specifically implemented for count distinct. Those
'distinct by' keys will be a part of the partitioning key of the
MapReduce Shuffle phase, this way they are 'distincted' quite
natually.
As per your case, it runs slowly because there will be only one
reducer to process massive detailed data. You can use a group by
before counting to get more parallelism:
select count(1) from (select id from tbl group by id) tmp;
However I don't understand a few things:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
Why there will be only one reducer in this case?
Why the weird inner query will cause more partitions?

I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.
partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.
The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by key fall into the same reducer.
The value part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.
Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct without group by like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.

Related

(Spark) Is there any possible way to optimize two large rdd join when both of them is too large for memory(means cannot use broadcast)?

As title.
Is there any possible way to optimize two large rdd join when both of them is too large for memory? In this case I suppose we cannot use broadcast for map side join.
If I have to join this two rdd, and both of them is too large to fit in memory:
country_rdd:
(id, country)
income_rdd:
(id, (income, month, year))
joined_rdd = income_rdd.join(country_rdd)
Is there any possible way to reduce the shuffling here? Or anything I can do to tuning the join performance?
Besides, the joined_rdd will be further calculated and reduced only by country and time, not relevant to id anymore. Eg: my final result = income for different country in different years. What's the best practice to do that?
I used to consider do some pre-partition, but seems if I only need to do join once that won't help much?
In general case (no a priori knowledge of the key properties) it is not possible. Shuffle is a essential part of the join and cannot be avoided.
In specific cases you can reduce shuffling in two ways:
Design your own Partitioner which takes advantage of pre-existing data distribution. For example if you know that data is sorted by key you can use that knowledge to limit the shuffle.
If you apply inner join, and only a fraction of keys occurs in both RDDs you can:
Create Bloom filters on each datasets. Lets call these leftFilter and rightFilter.
Filter RDD with opposite filters (leftRDD with rightFilter, rightRDD with leftFilter)
Join filtered RDD

When we should not use bucketing in hive?

When we should not use bucketing in hive? What is the bottleneck of this technique?
I guess you don't have to use bucketing when you can't benefit from it. As far as I know among main benefits of bucketing: more efficient sampling and map-side joins(see bellow). So if your table is small or you don't need fast sampling and map-side joins just don't use it because you will need to remember that you have to bucket you data before insertion, manually or using set hive.enforce.bucketing = true; There is no bottleneck, it's just one of possible data layouts which allow you to take advantage in some situations.
Hive map-side join example (see more here):
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT a.key, a.value
FROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for
each mapper of A, only the required buckets are fetched. For the query
above, the mapper processing bucket 1 for A will only fetch bucket 1
of B. It is not the default behavior, and is governed by the following
parameter
set hive.optimize.bucketmapjoin = true
Update Considering the data skew when bucketing.
Bucket number calculated using hash_function(bucketing_column) mod num_buckets. If your bucketing column is of int type then hash_int(i) == i(see more here). So if you have skewed values in that column, one value appears much more often then the others for example, then many more rows will be placed in a corresponding bucket, you will have disproportional buckets, this harms the query speed. Hive have build-in tools to overcome data skewness(see Skewed Tables) but I don't think you should use a column with skewed data for bucketing in the first place.
Bucketing is method by which we distribute the data into files. which would otherwise be unevenly distributed.
When to use Bucketing: When we know that query will use column such as "customer_id" which is sequencial or evenly distributed.
When Not to use Bucketing: We would not use bucketing when we know that most use case of the table involve reading subset of data.
For Example: although we keep historical data, we only process last 2 weeks data to determine something. In this scenario we would use partition by weekno.
You should not prefer bucketing when cardinality of partitioning field is not too high. In that case partitioning is more beneficial.
And bucketing can only be done on one field whereas partitioning can be done on multiple fields , with an order like(country, city, state).

ordering of list of values for each keys of reducer output

I am new to hadoop, little confuse about the hadoop.
In mapreduce job the reducer get a list of values for each keys. I want to know, what is the default ordering of values for each keys. Is the the same order as it has been written out from the mapper. Can you change the ordering ( eg asc or desc ) of the values in each key.
Is the the same order as it has been written out from the mapper. - Yes
It is true for single mapper. But, if your job has more than one mapper, you may not see the same order for two runs with same input as different mappers may end different times.
Can you change the ordering ( eg asc or desc ) of the values in each key - Yes
It is done using a technique called 'secondary sort'(you may Google for more reading on this).
In MapReduce, there are a few properties that affect the emission of map output. This is referred to as the secondary sort. Namely, two factors affect this:
Partitioner, which divides the map output among the reducers. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job.
Comparator, which compares values with the same key.
The default partitioner is the org.apache.hadoop.mapred.lib.HashPartitioner class, which hashes a record’s key to determine which partition the record belongs in.
Comparators differ by data type. If you want to control the sort order, override compare(WritableComparable,WritableComparable) of the WritableComparator() interface. See documentation here.

Hive distribute by vs without distribute by

This may sound basic but the question haunts me for a while.
Lets say i have the following query
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
SORT BY s.symbol ASC;
In this case, if the data has good spread on the symbol column then it makes sense to distribute based on the symbol column so that all reducers get good share of the data; Changing the query to the following would give a better performance
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;
What is the effect if i don't specify the distribute by clause? What is the default map output key column chosen in the first query i.e. what is the column that its distributed on?
I found the answer myself. With sort by, the output key from the mapper is not the column on which sort by is applied. The key could be the file offset of the record.
The output from reducers is sorted per reducer but the same sort by column value can appear in the output of more than one reducers. This means that there is an overlap among the output of the reducers. Distribute by ensures that the data is split among the reducers based on the distribute by column and so by ensuring that the same column value go to the same reducer and so the same out file.
Details are available. I think this is the answer you are looking for.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

How to implement self-join/cross-product with hadoop?

It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.
To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.
One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.

Resources