I am trying to use pig's rank operator to assign integer number for a given string. Although it works when I set the parallel clause to 1, it doesn't with a higher value (like 200). I need to use multiple reducer to speed up the processing since by default, pig is only using one reducer, which takes a long time.
My query is as follows :
rank = rank tupl1 by col1 ASC parallel 200;
Actually according to the pig documentation (https://pig.apache.org/docs/r0.11.1/perf.html#parallel) :
You can include the PARALLEL clause with any operator that starts a
reduce phase: COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN
(outer), and ORDER BY.
That's why you have an error I think, it's not possible to set the PARALLEL parameter for rank.
Related
I've read a question on SO:
I ran into a Hive query calculating a count distinct without grouping,
which runs very slow. So I was wondering how is this functionality
implemented in Hive, is there a UDAFCountDistinct for this?
And the answer:
To achieve count distinct, Hive relies on the GenericUDAFCount. There
is no UDAF specifically implemented for count distinct. Those
'distinct by' keys will be a part of the partitioning key of the
MapReduce Shuffle phase, this way they are 'distincted' quite
natually.
As per your case, it runs slowly because there will be only one
reducer to process massive detailed data. You can use a group by
before counting to get more parallelism:
select count(1) from (select id from tbl group by id) tmp;
However I don't understand a few things:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
Why there will be only one reducer in this case?
Why the weird inner query will cause more partitions?
I'll try to explain.
Part 1:
What did the answerer mean by "Those 'distinct by' keys will be a part of the partitioning key of the MapReduce Shuffle phase"? Could you explain more about it?
The UDAF GenericUDAFCount is capable of both count and count distinct. How does it work to achieve count distinct?
Let's take the following query as an example:
select category, count(distinct brand) from market group by category;
One MapReduce Job will be launched for this query.
distinct-by keys are the expressions(columns) within count(distinct ..., in this case, brand.
partition-by keys are the fields used to calculate a hash code for a record at map phase. And then this hash value is used to decided which partition a record should go. Usually, partition-by keys lies in the group by part of a SQL query. In this case, it's category.
The actual output-key of mappers will be the composition of partition-by key and a distinct-by key. For the above case, a mapper's output key may be like (drink, Pepsi).
This design makes all rows with the same group-by key fall into the same reducer.
The value part of mappers' output doesn’t matter here.
Later at the Shuffle phase, records are sort according to the sort-by keys, which is the same as the output key.
Then at reduce phase, at each individual reducer, all records are sorted first by category then by brand. This makes it easy to get the result of the count(distinct ) aggregation. Each distinct (category, brand) pair is guaranteed to be processed only once. The aggregation has been turned into a count(*) at each group. The input key of a call to the reduce method will be one of these distinct pairs. Reducer processes keep track of the composited key. Whenever the category part changes, we know a new group has come and we start counting this group from 1.
Part 2:
Why there will be only one reducer in this case?
When calculating count distinct without group by like this:
select count(distinct brand) from market
There will be just one reducer taking all the work. Why? Because the partition-by key doesn’t exist, or we can say that all records has the same hash code. So they will fall into the same reducer.
Part 3:
Why the weird inner query will cause more partitions?
The inner query's partition-by key is the group by key, id. There’s a chance that id values are quite evenly distributed, so records are processed by many different reducers. Then after the inner query, it's safe to conclude that all the id are different from each other. So now a simple count(1) is all that's needed.
But do note that the output will launch only one reducer. Why doesn’t it suffer? Because no detailed values are needed for count(1), map-side aggregation hugely cut down the amount of data processed by reducers.
One more thing, this rewriting is not guaranteed to perform better since it introduces an extra MR stage.
I'm trying to minimize for a given cluster (512GB RAM, 100 vCores) the execution time of a workflow with multiples "instances" of the same PIG script.
Increasing PARALLEL clause value for COGROUP operations give better results. However, is there a formula to pick up good value for such clause ? PIG documentation is very evasive about that!
Unfortunately the is no firm rule to define number of reducers and it more can be done empirically investigating the COGROUP execution time phase and playing with different values for PARALELL(suggest to start with 100 from my experience).
Nevertheless the upper bound is usually defined as numReduces << heapSize/(2*io.buffer.size). More you can find here
In the following code:
dsBodyStartStopSort =
order dsBodyStartStop
by terminal_id, point_date_stamp, point_time_stamp, event_code
;
dsBodyStartStopRank =
rank dsBodyStartStopSort
;
store dsBodyStartStopSort
into 'xStartStopSort.csv'
using PigStorage(';')
;
I know that if I don't do that RANK in the middle, the sort order will make it to the STORE command. And that is guaranteed by Pig.
And it appears from the testing I've done that doing RANK does not mess up the sort order--but is that guaranteed? I don't want to just be running on luck.
More generally, what is Pig's rule for preserving sort once it's done? Is it until some reduce event occurs? Will it work across FILTER? Certainly not GROUP? Just wondering if there is a well defined set of rules on when and how Pig guarantees or does not guarantee order.
To summarize: 1) Is order preserved across RANK? 2) How is order preserved generally?
The best piece of documentation I found on the topic:
Bags are disordered unless you explicitly apply a nested ORDER BY
operation as demonstrated below. A nested FOREACH will preserve
ordering, letting you order by one combination of fields then project
out just the values you'd like to concatenate.
From looking at unofficial examples and comments, I conclude the following:
If you do an order right before a rank, it should preserve the order. Personally I prefer to just use RANK xxx BY a,b,c; and only use ORDER afterwards if it is really needed.
If you do an order right before a LIMIT, it should feed LIMIT with the top lines. However the output would be sorted rather than in the original order.
I would like to obtain the first quartile of values from a column (speed) of data in table totalSpeeds.
To do this, I tried creating a variable (threshold), then selected values that were less than or equal to it.
SET threshold = (SELECT 0.25*MAX(speed) FROM totalSpeeds);
SELECT speed FROM totalSpeeds WHERE speed <= ${hiveconf:threshold};
This failed and returned a parse error. Is there a more efficient way of obtaining the upper-bound of the first quartile of speeds? Or is there a way of tweaking the above commands to return the first-quartile speeds?
Thanks in advance,
Anita
There is a built in UDF in hive for calculating percentiles. use
select percentile(speed, .25) from totalSpeeds;
explanation of UDF:
Returns the exact pth percentile of a column in the group. p must be between 0 and 1
Similarly we can extract multiple percentiles also by using percentile(speed, array(p1, p2))
I have a question about configuring Map/Side inner join for multiple mappers in Hadoop.
Suppose I have two very large data sets A and B, I use the same partition and sort algorithm to split them into smaller parts. For A, assume I have a(1) to a(10), and for B I have b(1) to b(10). It is assured that a(1) and b(1) contain the same keys, a(2) and b(2) have the same keys, and so on. I would like to setup 10 mappers, specifically, mapper(1) to mapper(10). To my understanding, Map/Side join is a pre-processing task prior to the mapper, therefore, I would like to join a(1) and b(1) for mapper(1), to join a(2) and b(2) for mapper(2), and so on.
After reading some reference materials, it is still not clear to me how to configure these ten mappers. I understand that using CompositeInputFormat I would be able to join two files, but it seems only configuring one mapper and joining the 20 files pair after pair (in 10 sequential tasks). How to configure all these ten mappers and join ten pairs at the same time in a genuine Map/Reduce (10 tasks in parallel)? To my understanding, ten mappers would require ten CompositeInputFormat settings because the files to join are all different. I strongly believe this is practical and doable, but I couldn't figure out what exact commands I should use.
Any hint and suggestion is highly welcome and appreciated.
Shi
Thanks a lot for the replies, David and Thomas!
I appreciate your emphasis about the pre-requirements on Map-side Join. Yes, I am aware about the sort, API, etc. After reading your comments, I think my actual problem is what is the correct expression for joining multiple splits of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:
/A/dataA-r-00000
/A/dataA-r-00001
/B/dataB-r-00000
/B/dataB-r-00001
The expression command I am using now is:
inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/A/dataA-r-00000"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/B/dataB-r-00000"))
It works but as you mentioned, it only starts two mappers (because the inner join prevents from splitting) and could be very inefficient if the files are big. If I want to use more mappers (say another 2 mappers to join dataA-r-00001 and dataB-r-00001), how should I construct the expression, is it something like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00000'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00000'), tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00001'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00001'))" ;
But I think that could be mistaken, because the command above actually perform inner join of four files (which will result in nothing in my case because file *r-00000 and *r-00001 have non-overlapping keys).
Or I could just use the two dirs as inputs, like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/'))" ;
The inner join will match the pairs automatically according to the file endings, say "00000" to "00000", "00001" to "00001"? I am stuck at this point because I need to construct the expression and pass it to
conf.set("mapred.join.expr", joinexpression);
So in one word, how should I build the proper expression if I want to use more mappers to join multiple pairs of files simultaneously?
There are map- and reduce side joins.
You proposed to use a map side join, which is executed inside a mapper and not before it.
Both sides must have the same key and value types. So you can't join a LongWritable and a Text, although they might have the same value.
There are subtle more things to note:
input files have to be sorted, so it has likely to be a reducer output
You can control the number of mappers in your join-map-phase by setting the number of reducers in the job that should've sorted the datasets
The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable.
Run two jobs that sort the two datasetsby their keys, both jobs HAVE TO set the number of reducers to an equal number, say 2.
this will result in 2 sorted files for each dataset
now you setup your job that joins the datasets, this job will spawn with 2 mappers. It could be more if you're setting the reduce numbers higher in the previous job.
do whatever you like in the reduce step.
If the number of the files to be joined is not equal it will result in an exception during job setup.
Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.
This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.
Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.
I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.