How should I select top 10% of the table? - hadoop

I need the select the top x% rows of a table in Pig. Could someone tell me how to do it without writing a UDF?
Thanks!

As mentioned before, first you need to count the number of rows in your table and then obviously you can do:
A = load 'X' as (row);
B = group A all;
C = foreach B generate COUNT(A) as count;
D = LIMIT A C.count/10; --you might need a cast to integer here
The catch is that, dynamic argument support for LIMIT function was introduced in Pig 0.10. If you're working with a previous version, then a suggestion is offered here using the TOP function.

Not sure how you would go about pulling a percentage, but if you know your table size is 100 rows, you can use the LIMIT command to get the top 10% for example:
A = load 'myfile' as (t, u, v);
B = order A by t;
C = limit B 10;
(Above example adapted from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator)
As for dynamically limiting to 10%, not sure you can do this without knowing how 'big' the table is, and i'm pretty sure you couldn't do this in a UDF, you'd need to run a job to count the number of rows, then another job to do the LIMIT query.

I won't write the pig code as it will take a while to write and test, but I would do it like this (if you need the exact solution, if not, there are simpler methods):
Get a sample from your input. Say a few thousand data points or so.
Sort this and find the n quantiles, where n should be somewhere in the order of the number of reducers you have or somewhat larger.
Count the data points for each quantile.
At this point the min point of the top 10% will fall into one of these intervals. Find this interval (this is easy as the counts will tell you exactly where it is), and using the sum of the counts of the larger quantiles together with the relevant quantile find the 10% point in this interval.
Go over your data again and filter out everything but the points larger than the one you just found.
Portions of this might require UDFs.

Related

Is there any option to do FOR loop in excel?

I have an excel that I'm calculating my Scrum Task's completed average. I have Story point item also in the excel. My calculation is:
Result= SP * percentage of completion --> This calculation is for each row and after that I sum up all result and taking the summary.
But sometimes I am adding new task and for each task I am adding the calculation to the average result.
Is there any way to use for loop in the excel?
for(int i=0;i<50;i++){ if(SP!=null && task!=null)(B+i)*(L+i)}
My calculation is like below:
AVERAGE((B4*L4+B5*L5+B6*L6+B7*L7+B8*L8+B9*L9+B10*L10)/SUM(B4:B10))
First of all, AVERAGE is not doing anything in your formula, since the argument you pass to it is just one single value. You already do an average calculation by dividing by the sum. That average is in fact a weighted average, and so you could not even achieve that with a plain AVERAGE function.
I see several ways to make this formula more generic, so it keeps working when you add rows:
1. Use SUMPRODUCT
=SUMPRODUCT(B4:B100,L4:L100)/SUM(B4:B100)
The row number 100 is chosen arbitrarily, but should evidently encompass all data rows. If you have no data occurring below your table, then it is safe to add a large margin. You'll want to avoid the situation where you think you add a line to the table, but actually get outside of the range of the formula. Using proper Excel tables can help to avoid this situation.
2. Use an array formula
This would be a second resort for when the formula becomes more complicated and cannot be executed with a "simple" SUMPRODUCT. But the above would translate to this array formula:
=SUM(B4:B100*L4:L100)/SUM(B4:B100)
Once you have typed this in the formula bar, make sure to press Ctrl+Shift+Enter to enter it. Only then will it act as an array formula.
Again, the same remark about row number 100.
3. Use an extra column
Things get easy when you use an extra column for storing the product of B & L values for each row. So you would put in cell N4 the following formula:
=B4*L4
...and then copy that relative formula to the other rows. You can hide that column if you want.
Then the overal formula can be:
=SUM(N4:N100)/SUM(B4:B100)
With this solution you must take care to always copy a row when inserting a new row, as you need the N column to have the intermediate product formula also for any new row.

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

Set variable to maximum value in HiveQL

I would like to obtain the first quartile of values from a column (speed) of data in table totalSpeeds.
To do this, I tried creating a variable (threshold), then selected values that were less than or equal to it.
SET threshold = (SELECT 0.25*MAX(speed) FROM totalSpeeds);
SELECT speed FROM totalSpeeds WHERE speed <= ${hiveconf:threshold};
This failed and returned a parse error. Is there a more efficient way of obtaining the upper-bound of the first quartile of speeds? Or is there a way of tweaking the above commands to return the first-quartile speeds?
Thanks in advance,
Anita
There is a built in UDF in hive for calculating percentiles. use
select percentile(speed, .25) from totalSpeeds;
explanation of UDF:
Returns the exact pth percentile of a column in the group. p must be between 0 and 1
Similarly we can extract multiple percentiles also by using percentile(speed, array(p1, p2))

Cross product in MapReduce

I'd like to perform the expensive operation of cross product across two data sets in Hadoop using Java MapReduce.
For example, I have records from data set A and data set B, and I'd like each record in data set A to be matched up to each record in data set B in the output. I realize that the output size of this would be |A| * |B|, but want to do it anyways.
I see that Pig has CROSS but am unaware of how it is implemented at a high-level. Perhaps I will go take a look at the source code.
Not looking for any code, just want to know at a high-level how I should approach this problem.
I have done something similar when looking at document similarity (comparing a document to every other document) and ended up with a custom input format that splits up the two datasets and then ensured there was a 'split' for each subset of data.
So your splits would look like (each merging two sets of 10 records, outputting 100 records)
A(1-10) x B(1-10)
A(11-20) x B(1-10)
A(21-30) x B(1-10)
A(1-10) x B(11-20)
A(11-20) x B(11-20)
A(21-30) x B(11-20)
A(1-10) x B(21-30)
A(11-20) x B(21-30)
A(21-30) x B(21-30)
I don't remember how performant it was though, but had a document set in the size order of thousands to compare against one another (on an 8 node dev cluster), with millions of cross products calculated.
I could also make improvements to the algorithm as some documents would never score well against others (if they had too much temporal time between them for example), and generate better splits as a result.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources