I have a column called Project_Id which lists the names of many different projects, say Project A, Project B and so on. A second column lists the sales for each project.
A third column shows time series information. For example:
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
B 22 1
B 38 2
B 76 3
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
D 14 1
D 62 2
From this dataset, I need to choose (and thus create a new data set) only those projects which have at least 4 time series points, say, to get the new dataset (How do I get this by using an R code, is my question):
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
Could someone please help?
Thanks a lot!
I tried to do some filtering with R but had little success.
Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.
I'm looking to take an array of integers and perform a partial bucket sort on that array. Every element in the bucket before it is less than the current bucket elements. For example, if I have 10 buckets for the values 0-100 0-9 would go in the first bucket, 10-19 for the second and so on.
For one example I can take 1 12 23 44 48 and put them into 4 buckets out of 10. But if I have 1, 2, 7, 4, 9, 1 then all values go into a single bucket. I'm looking a way to evenly distribute values to all the buckets while maintaining a ordering. Elements in each bucket don't have to be sorted. For example I'm looking similar to this.
2 1 9 2 3 8 7 4 2 8 11 4 => [[2, 1], [2, 2], [3], [4], [4], [7], [8, 8], [9], [11]]
I'm trying to use this as a quick way to partition a list in a map-reduce.
Thanks for the help.
Edit, maybe this clears things up:
I want to create a hashing function where all elements in bucket1 < bucket2 < bucket3 ..., where each bucket is unsorted.
If I understand it correctly you have around 100TB of data, or 13,743,895,347,200 unsigned 64-bit integers, that you want to distribute over a number of buckets.
A first step could be to iterate over the input, looking at e.g. the highest 24 bits of each integer, and counting them. That will give you a list of 16,777,216 ranges, each with a count of on average 819,200 so it may be possible to store them in 32-bit unsigned integers, which will take up 64 MB.
You can then use this to create a lookup table that tells you which bucket each of those 16,777,216 ranges goes into. You calculate how many integers are supposed to go into each bucket (input size divided by number of buckets) and go over the array, keeping a running total of the count, and set each range to bucket 1, until the running total is too much for bucket 1, then you set the ranges to bucket 2, and so on...
There will of course always be a range that has to be split between bucket n and bucket n+1. To keep track of this, you create a second table that stores how many integers in these split ranges are supposed to go into bucket n+1.
So you now have e.g.:
HIGH 24-BIT RANGE BUCKET BUCKET+1
0 0 ~ 2^40-1 1 0
1 2^40 ~ 2*2^40-1 1 0
2 2*2^40 ~ 3*2^40-1 1 0
3 3*2^40 ~ 4*2^40-1 1 0
...
16 16*2^40 ~ 17*2^40-1 1 0
17 17*2^40 ~ 18*2^40-1 1 284,724 <- highest 284,724 go into bucket 2
18 18*2^40 ~ 19*2^40-1 2 0
...
You can now iterate over the input again, and for each integer look at the highest 24 bits, and use the lookup table to see which bucket the integer is supposed to go into. If the range isn't split, you can immediately move the integer into the right bucket. For each split range, you create an ordered list or priority queue that can hold as many integers as need to go into the next bucket; you store only the highest values in this list or queue; any smaller integer goes straight to the bucket, and if an integer is added to the full list or queue, the smallest value is moved to the bucket. At the end this list or queue is added to the next bucket.
The number of ranges should be as high as possible with the available memory, because that minimises the number of integers in split ranges. With the huge input you have, you may need to save the split ranges to disk, and then afterwards look at each of them seperately, find the highest x values, and move them to the buckets accordingly.
The complexity of this is N for the first run, then you iterate over the ranges R, then N as you iterate over the input again, and then for the split ranges you'll have something like M.logM to sort and M to distribute, so a total of 2*N + R + M.LogM + M. Using a high number of ranges to keep the number of integers in split ranges low will probably be the best strategy to speed the process up.
Actually, the number of integers M that are in split ranges depends on the number of buckets B and ranges R, with M = N × B/R, so that e.g. with a thousand buckets and a million ranges, 0.1% of the input would be in split ranges and have to be sorted. (These are averages, depending on the actual distribution.) That makes the total complexity 2×N + R + (N×B/R).Log(N×B/R) + N×B/R.
Another example:
Input: N = 13,743,895,347,200 unsigned 64-bit integers
Ranges: 232 (using the highest 32 bits of each integer)
Integers per range: 3200 (average)
Count list: 232 16-bit integers = 8 GB
Lookup table: 232 16-bit integers = 8 GB
Split range table: B 16-bit integers = 2×B bytes
With 1024 buckets, that would mean that B/R = 1/222, and there are 1023 split ranges with around 3200 integers each, or around 3,276,800 integers in total; these will then have to be sorted and distributed over the buckets.
With 1,048,576 buckets, that would mean that B/R = 1/212, and there are 1,048,575 split ranges with around 3200 integers each, or around 3,355,443,200 integers in total. (More than 65,536 buckets would of course require a lookup table with 32-bit integers.)
(If you find that the total of the counts per range doesn't equal the total size of the input, there has been overflow in the count list, and you should switch to a larger integer type for the counts.)
Let's run through a tiny example: 50 integers in the range 1-100 have to be distributed over 5 buckets. We choose a number of ranges, say 20, and iterate over the input to count the number of integers in each range:
2 9 14 17 21 30 33 36 44 50 51 57 69 75 80 81 87 94 99
1 9 15 16 21 32 40 42 48 55 57 66 74 76 88 96
5 6 20 24 34 50 52 58 70 78 99
7 51 69
55
3 4 2 3 3 1 3 2 2 3 5 3 0 4 2 3 1 2 1 3
Then, knowing that each bucket should hold 10 integers, we iterate over the list of counts per range, and assign each range to a bucket:
3 4 2 3 3 1 3 2 2 3 5 3 0 4 2 3 1 2 1 3 <- count/range
1 1 1 1 2 2 2 2 3 3 3 4 4 4 4 5 5 5 5 5 <- to bucket
2 1 1 <- to next
When a range has to be split between two buckets, we store the number of integers that should go to the next bucket in a seperate table.
We can then iterate over the input again, and move all the integers in non-split ranges into the buckets; the integers in split ranges are temporarily moved into seperate buckets:
bucket 1: 9 14 2 9 1 15 6 5 7
temp 1/2: 17 16 20
bucket 2: 21 33 30 32 21 24 34
temp 2/3: 36 40
bucket 3: 44 50 48 42 50
temp 3/4: 51 55 52 51 55
bucket 4: 57 75 69 66 74 57 57 70 69
bucket 5: 81 94 87 80 99 88 96 76 78 99
Then we look at the temp buckets one by one, find the x highest integers as indicated in the second table, move them to the next bucket, and what is left over to the previous bucket:
temp 1/2: 17 16 20 (to next: 2) bucket 1: 16 bucket 2: 17 20
temp 2/3: 36 40 (to next: 1) bucket 2: 36 bucket 3: 40
temp 3/4: 51 55 52 51 55 (to next: 1) bucket 3: 51 51 52 55 bucket 4: 55
And the end result is:
bucket 1: 9 14 2 9 1 15 6 5 7 16
bucket 2: 21 33 30 32 21 24 34 17 20 36
bucket 3: 44 50 48 42 50 40 51 51 52 55
bucket 4: 57 75 69 66 74 57 57 70 69 55
bucket 5: 81 94 87 80 99 88 96 76 78 99
So, out of 50 integers, we've had to sort a group of 3, 2 and 5 integers.
Actually, you don't need to create a table with the number of integers in the split ranges that should go to the next bucket. You know how many integers are supposed to go into each bucket, so after the initial distribution you can look at how many integers are already in each bucket, and then add the necessary number of (lowest value) integers from the split range. In the example above, which expects 10 integers per bucket, that would be:
3 4 2 3 3 1 3 2 2 3 5 3 0 4 2 3 1 2 1 3 <- count/range
1 1 1 / 2 2 2 / 3 3 / 4 4 4 4 5 5 5 5 5 <- to bucket
bucket 1: 9 14 2 9 1 15 6 5 7 <- add 1
temp 1/2: 17 16 20 <- 3-1 = 2 go to next bucket
bucket 2: 21 33 30 32 21 24 34 <- add 3-2 = 1
temp 2/3: 36 40 <- 2-1 = 1 goes to next bucket
bucket 3: 44 50 48 42 50 <- add 5-1 = 4
temp 3/4: 51 55 52 51 55 <- 5-4 = 1 goes to next bucket
bucket 4: 57 75 69 66 74 57 57 70 69 <- add 1-1 = 0
bucket 5: 81 94 87 80 99 88 96 76 78 99 <- add 0
The calculation of how much of the input will be in split ranges and need to be sorted, given above as M = N × B/R, is an average for input that is roughly evenly distributed. A slight bias, with more values in a certain part of the input space will not have much effect, but it would indeed be possible to craft worst-case input to thwart the algorithm.
Let's look again at this example:
Input: N = 13,743,895,347,200 unsigned 64-bit integers
Ranges: 232 (using the highest 32 bits of each integer)
Integers per range: 3200 (average)
Buckets: 1,048,576
Integers per bucket: 13,107,200
For a start, if there are ranges that contain more than 232 integers, you'd have to use 64-bit integers for the count table, so it would be 32GB in size, which could force you to use fewer ranges, depending on the available memory.
Also, every range that holds more integers than the target size per bucket is automatically a split range. So if the integers are distributed with a lot of local clusters, you may find that most of the input is in split ranges that need to be sorted.
If you have enough memory to run the first step using 232 ranges, then each range has 232 different values, and you could distribute the split ranges over the buckets using a counting sort (which has linear complexity).
If you don't have the memory to use 232 ranges, and you end up with problematically large split ranges, you could use the complete algorithm again on the split ranges. Let's say you used 228 ranges, expecting each range to hold around 51,200 integers, and you end up with an unexpectedly large split range with 5,120,000,000 integers that need to be distributed over 391 buckets. If you ran the algorithm again for this limited range, you'd have 228 ranges (each holding on average 19 integers with a maximum of 16 different values) for just 391 buckets, and only a tiny risk of ending up with large split ranges again.
Note: the ranges that have to be split over two or more buckets don't necessarily have to be sorted. You can e.g. use a recursive version of Dijkstra's Dutch national flag algorithm to partition the range into a part with the x smallest values, and a part with the largest values. The average complexity of partitioning would be linear (when using a random pivot), against the O(N.LogN) complexity of sorting.
I want to understand "median of medians" algorithm on the following example:
We have 45 distinct numbers divided into 9 group with 5 elements each.
48 43 38 33 28 23 18 13 8
49 44 39 34 29 24 19 14 9
50 45 40 35 30 25 20 15 10
51 46 41 36 31 26 21 16 53
52 47 42 37 32 27 22 17 54
The first step is sorting every group (in this case they are already sorted)
Second step recursively, find the "true" median of the medians (50 45 40 35 30 25 20 15 10) i.e. the set will be divided into 2 groups:
50 25
45 20
40 15
35 10
30
sorting these 2 groups
30 10
35 15
40 20
45 25
50
the medians is 40 and 15 (in case the numbers are even we took left median)
so the returned value is 15 however "true" median of medians (50 45 40 35 30 25 20 15 10) is 30, moreover there are 5 elements less then 15 which are much less than 30% of 45 which are mentioned in wikipedia
and so T(n) <= T(n/5) + T(7n/10) + O(n) fails.
By the way in the Wikipedia example, I get result of recursion as 36. However, the true median is 47.
So, I think in some cases this recursion may not return true median of medians. I want to understand where is my mistake.
The problem is in the step where you say to find the true median of the medians. In your example, you had these medians:
50 45 40 35 30 25 20 15 10
The true median of this data set is 30, not 15. You don't find this median by splitting the groups into blocks of five and taking the median of those medians, but instead by recursively calling the selection algorithm on this smaller group. The error in your logic is assuming that median of this group is found by splitting the above sequence into two blocks
50 45 40 35 30
and
25 20 15 10
then finding the median of each block. Instead, the median-of-medians algorithm will recursively call itself on the complete data set 50 45 40 35 30 25 20 15 10. Internally, this will split the group into blocks of five and sort them, etc., but it does so to determine the partition point for the partitioning step, and it's in this partitioning step that the recursive call will find the true median of the medians, which in this case will be 30. If you use 30 as the median as the partitioning step in the original algorithm, you do indeed get a very good split as required.
Hope this helps!
Here is the pseudocode for median of medians algorithm (slightly modified to suit your example). The pseudocode in wikipedia fails to portray the inner workings of the selectIdx function call.
I've added comments to the code for explanation.
// L is the array on which median of medians needs to be found.
// k is the expected median position. E.g. first select call might look like:
// select (array, N/2), where 'array' is an array of numbers of length N
select(L,k)
{
if (L has 5 or fewer elements) {
sort L
return the element in the kth position
}
partition L into subsets S[i] of five elements each
(there will be n/5 subsets total).
for (i = 1 to n/5) do
x[i] = select(S[i],3)
M = select({x[i]}, n/10)
// The code to follow ensures that even if M turns out to be the
// smallest/largest value in the array, we'll get the kth smallest
// element in the array
// Partition array into three groups based on their value as
// compared to median M
partition L into L1<M, L2=M, L3>M
// Compare the expected median position k with length of first array L1
// Run recursive select over the array L1 if k is less than length
// of array L1
if (k <= length(L1))
return select(L1,k)
// Check if k falls in L3 array. Recurse accordingly
else if (k > length(L1)+length(L2))
return select(L3,k-length(L1)-length(L2))
// Simply return M since k falls in L2
else return M
}
Taking your example:
The median of medians function will be called over the entire array of 45 elements like (with k = 45/2 = 22):
median = select({48 49 50 51 52 43 44 45 46 47 38 39 40 41 42 33 34 35 36 37 28 29 30 31 32 23 24 25 26 27 18 19 20 21 22 13 14 15 16 17 8 9 10 53 54}, 45/2)
The first time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 50 45 40 35 30 20 15 10.
In this call, n = 45, and hence the select function call will be M = select({50 45 40 35 30 20 15 10}, 4)
The second time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 40 20.
In this call, n = 9 and hence the call will be M = select({40 20}, 0).
This select call will return and assign the value M = 20.
Now, coming to the point where you had a doubt, we now partition the array L around M = 20 with k = 4.
Remember array L here is: 50 45 40 35 30 20 15 10.
The array will be partitioned into L1, L2 and L3 according to the rules L1 < M, L2 = M and L3 > M. Hence:
L1: 10 15
L2: 20
L3: 30 35 40 45 50
Since k = 4, it's greater than length(L1) + length(L2) = 3. Hence, the search will be continued with the following recursive call now:
return select(L3,k-length(L1)-length(L2))
which translates to:
return select({30 35 40 45 50}, 1)
which will return 30 as a result. (since L has 5 or fewer elements, hence it'll return the element in kth i.e. 1st position in the sorted array, which is 30).
Now, M = 30 will be received in the first select function call over the entire array of 45 elements, and the same partitioning logic which separates the array L around M = 30 will apply to finally get the median of medians.
Phew! I hope I was verbose and clear enough to explain median of medians algorithm.
Imagine I've defined the following name in J:
m =: >: i. 2 4 5
This looks like the following:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
I want to create a monadic verb of rank 1 that applies to each list in this list of lists. It will double (+:) or add 1 (>:) to each alternate item in the list. If we were to apply this verb to the first row, we'd get 2 3 6 5 10.
It's fairly easy to get a list of booleans which alternate with each item, e.g., 0 1 $~{:$ m gives us 0 1 0 1 0. I thought, aha! I'll use something like +:`>: #. followed by some expression, but I could never quite get it to work.
Any suggestions?
UPDATE
The following appears to work, but perhaps it can be refactored into something more elegant by a J pro.
poop =: monad define
(($ y) $ 0 1 $~{:$ y) ((]+:)`(]>:) #. [)"0 y
)
I would use the oblique verb, with rank 1 (/."1)- so it applies to successive elements of each list in turn.
You can pass a gerund into /. and it applies them in order, extending cyclically.
+:`>: /."1 m
2
3
6
5
10
12
8
16
10
20
22
13
26
15
30
32
18
36
20
40
42
23
46
25
50
52
28
56
30
60
62
33
66
35
70
72
38
76
40
80
I spent a long time and I looked at it, and I believe that I know why ,# works to recover the shape of the argument.
The shape of the arguments to the parenthesized phrase is the shape of the argument passed to it on the right, even though the rank is altered by the " conjugate (well, that is what trace called it, I thought it was an adverb). If , were monadic, it would be a ravel, and the result would be a vector or at least of a lower rank than the input, based on adverbs to ravel. That is what happens if you take the conjunction out - you get a vector.
So what I believe is happening is that the conjunction is making , act like a dyadic , which is called an append. The append alters what it is appending to what it is appending to. It is appending to nothing but that thing still has a shape, and so it ends up altering the intermediate vector back to the shape of the input.
Now I'm probably wrong. But $,"0#(+:>:/.)"1 >: i. 2 4 5 -> 2 4 5 1 1` which I thought sort of proved my case.
(,#(+:`>:/.)"1 a) works, but note that ((* 2 1 $~ $)#(+ 0 1 $~ $)"1 a) would also have worked (and is about 20 times faster, on large arrays, in my brief tests).