Hive distribute by vs without distribute by - sorting

This may sound basic but the question haunts me for a while.
Lets say i have the following query
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
SORT BY s.symbol ASC;
In this case, if the data has good spread on the symbol column then it makes sense to distribute based on the symbol column so that all reducers get good share of the data; Changing the query to the following would give a better performance
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;
What is the effect if i don't specify the distribute by clause? What is the default map output key column chosen in the first query i.e. what is the column that its distributed on?

I found the answer myself. With sort by, the output key from the mapper is not the column on which sort by is applied. The key could be the file offset of the record.
The output from reducers is sorted per reducer but the same sort by column value can appear in the output of more than one reducers. This means that there is an overlap among the output of the reducers. Distribute by ensures that the data is split among the reducers based on the distribute by column and so by ensuring that the same column value go to the same reducer and so the same out file.

Details are available. I think this is the answer you are looking for.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

Related

Hive count distinct UDAF2

I've read a question on SO:
I ran into a Hive query calculating a count distinct without grouping,
which runs very slow. So I was wondering how is this functionality
implemented in Hive, is there a UDAFCountDistinct for this?
And the answer:
To achieve count distinct, Hive relies on the GenericUDAFCount. There
is no UDAF specifically implemented for count distinct. Those
'distinct by' keys will be a part of the partitioning key of the
MapReduce Shuffle phase, this way they are 'distincted' quite
natually.
As per your case, it runs slowly because there will be only one
reducer to process massive detailed data. You can use a group by
before counting to get more parallelism:
select count(1) from (select id from tbl group by id) tmp;
However I don't understand a few things:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
Why there will be only one reducer in this case?
Why the weird inner query will cause more partitions?
I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.
partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.
The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by key fall into the same reducer.
The value part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.
Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct without group by like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.

Knowing in advance which bucket a Hive row will go

When we have a bucketing defined on a certain column, hive calculates hash value for each unique value for the column and sends rows to these buckets. This might lead to skewed buckets, when hash values for multiple unique values cause them going to the same bucket. Now lets say the bucketed column is country, and the values are like:-
country = {'USA','Brazil','Findland','India','England'}
Now there is no guarantee that hashing will send all the 5 countries to different buckets (assume number of buckets are 5). Is there any way to know in advance which bucket a row with specific value for country will be sent to? I am looking for something like this:-
select know_which_bucket(country,5) from table;
It need not be hive function, I am just trying to explain what I am looking for logically, I just need info. Basically i will specify a set of values , number of buckets and want to know which bucket each of the value will go to. If someone can provide Java pointer for this, it will help too. Also, I am not necessarily looking for a programmatic way, even an online calculator will do.
I figured out, this is how you do it:-
For example, to know which bucket(file) 'USA' will go provided that number of buckets are 5:-
select pmod(hash('USA'),5);
It returns 3, so this will be stored in the file 000003_0. This way we can know for any input value and for any number of buckets.

hbase design concat long key-value pairs vs many columns

Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.

ordering of list of values for each keys of reducer output

I am new to hadoop, little confuse about the hadoop.
In mapreduce job the reducer get a list of values for each keys. I want to know, what is the default ordering of values for each keys. Is the the same order as it has been written out from the mapper. Can you change the ordering ( eg asc or desc ) of the values in each key.
Is the the same order as it has been written out from the mapper. - Yes
It is true for single mapper. But, if your job has more than one mapper, you may not see the same order for two runs with same input as different mappers may end different times.
Can you change the ordering ( eg asc or desc ) of the values in each key - Yes
It is done using a technique called 'secondary sort'(you may Google for more reading on this).
In MapReduce, there are a few properties that affect the emission of map output. This is referred to as the secondary sort. Namely, two factors affect this:
Partitioner, which divides the map output among the reducers. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job.
Comparator, which compares values with the same key.
The default partitioner is the org.apache.hadoop.mapred.lib.HashPartitioner class, which hashes a record’s key to determine which partition the record belongs in.
Comparators differ by data type. If you want to control the sort order, override compare(WritableComparable,WritableComparable) of the WritableComparator() interface. See documentation here.

Determining Bucketing Configuration on Hive Table

I was curious if someone could provide a little more clarification on how to configure the bucketing property on a Hive table. I see that it helps with joins and I believe i read that its good to put it on a column that you will use to join. That could be wrong. I am also curious about how to determine the number of buckets to choose.
If anyone could give a brief explanation and some documentation on how to determine all of these things that would be great.
Thanks in advance for the assistance.
Craig
If you want to implement bucketing in your table first you should set the property
set hive.enforce.bucketing=true;
it will enforce the bucketing.
carnality : no.of possible values for column.
if your implementing bucketing using Cluster By clause, your bucketing column should have high carnality,then you will get the better performance.
if your implementing partitioning using Partitioned By clause your partitioned column should have low carnality,then you will get the better performance
depending on the use case you can choose the number of buckets.It's good to choose (number of buckets) < (your hdfs block size) and it should be power of 2.
bucketing will always creates file's not directories.
The following are few suggestions to be considered while designing buckets.
Buckets are generally created on the most critical columns , a single column or a set of columns, so it implies that these columns would be the primary columns for various join conditions , as the concept of bucketing is to hash these set of columns and store it in such a way that its easily accessible from the hdfs faster.Thus retrieving speed is fast.Its advised not to use all the join columns only the critical and which is we think would improve performance.
The number of buckets would be in exponents of 2. The number of buckets determine the number of reducers to be run and that determines the final number files in which the data is stored. So number of buckets has to be designed keeping in mind the size of data we are handling and there by keeping in mind of avoiding large number of small files in hdfs and few number of big files , thus improving the hive query retrieving speed and optimizations.

Resources