How can I do joins given a threashold in Hadoop using PIG - hadoop

Let's say I have a dataset with following schema:
ItemName (String) , Length (long)
I need to find items that are duplicates based on their length. That's pretty easy to do in PIG:
raw_data = LOAD...dataset
grouped = GROUP raw_data by length
items = FOREACH grouped GENERATE COUNT(raw_data) as count, raw_data.name;
dups = FILTER items BY count > 1;
STORE dups....
The above finds exact duplicates. Given the set bellow:
a, 100
b, 105
c, 100
It will output 2, (a,c)
Now I need to find duplicates using a threshold. For example a threshold of 5 would mean match items if their length +/- 5. So the output should look like:
3, (a,b,c)
Any ideas how I can go about doing this?
It is almost like I want PIG to use a UDF as its comparator when it is comparing records during its join...

I think the only way to do what you want is to load the data into two tables and do a cartesian join of the data set onto itself, so that each value can be compared to each other value.
Pseudo-code:
r1 = load dataset
r2 = load dataset
rcross = cross r1, r2
rcross is a cartesian product that will allow you to check the difference in length between each pair.

I was solving a similar problem once and got one crazy and dirty solution.
It is based on next lemma:
If |a - b| < r then there exists such an integer number x: 0 <= x < r that
floor((a+x)/r) = floor((b+x)/r)
(further I will mean only integer division and will omit floor() function, i.e. 5/2=2)
This lemma is obvious, I'm not gonna prove it here
Based on this lemma you may do a next join:
RESULT = JOIN A by A.len / r, B By B.len / r
And get several values from all values corresponding to |A.len - B.len| < r
But doing this r times:
RESULT0 = JOIN A by A.len / r, B By (B.len / r)
RESULT1 = JOIN A by (A.len+1) / r, B By (B.len+1) / r
...
RESULT{R-1} = JOIN A by (A.len+r-1) / r, B By (B.len+r-1) / r
you will get all needed values. Of course you will get more rows than you need, but as I said already it's a dirty solution (i.e. it's not optimal, but works)
The other big disadvantage of this solution is that JOINs should be written dynamically and their number will be big for big r.
Still it works if you know r and it is rather small (like r=6 in your case)
Hope it helps

Related

Vlookup?? query??? Suggestions?

I have the following Google Sheet.
[1]: [https://docs.google.com/spreadsheets/d/1q9I7XhyEGKeAk93mDyiXpP95Kc9qmiPfvnhEOcTwbIU/edit#gid=0]
I could edit the fields in green (A16, B16), with A16 being a data validation drop-down list and B16 being a typed-in number.
Using vlookup, I can get the Base Value without any issues.
=VLOOKUP(A16,A2:B12,2, false)
What I can not figure out is how to get the factor value. For example, I have select H, which is row #9. As the size value is 1.62, which is greater than the Mid Size but less than the Max Size for that row, I want to return the Factor 2 value of 1.5.
I have tried multiple vlookup / query codes, but all don't work.
The Sum is just Base * Factor to give a final value. Ideally, I would select the lookup, enter a size and it only shows the Sum value.
This definitely is NOT the most efficient way, but hopefully it works:
=IF(A16=A2,IF(B16>B2,"1.62",C2),=IF(A16=A3,IF(B16>B3,"1.62",C3),=IF(A16=A4,IF(B16>B4,"1.62",C4),=IF(A16=A5,IF(B16>B5,"1.62",C5),=IF(A16=A6,IF(B16>B6,"1.62",C6),=IF(A16=A7,IF(B16>B7,"1.62",C7),=IF(A16=A8,IF(B16>B8,"1.62",C8),=IF(A16=A9,IF(B16>B9,"1.62",C9),=IF(A16=A10,IF(B16>B10,"1.62",C10),=IF(A16=A11,IF(B16>B11,"1.62",C11),=IF(A16=A12,IF(B16>B12,"1.62",C12),"ERROR")))))))))))
Probably not the best way to handle this, but it does work. Partially cause my base factor value is 1, so multipling the value by 1, does not change anything
=(IFNA(QUERY(A2:H12,"select D where (A = '"&A16&"' AND (C <= "&B16&" AND E > "&B16&")) ORDER BY A"),1)*IFNA(QUERY(A2:H12,"select F where (A = '"&A16&"' AND (E < "&B16&" AND G >= "&B16&")) ORDER BY A"),1)*IFNA(QUERY(A2:H12,"select H where (A = '"&A16&"' AND (G <= "&B16&")) ORDER BY A"),1))

Creating a list of all possible combinations from a set of items for n combination sizes

Apologies in advance if the wording of my question is confusing. I've been having lots of trouble trying to explain it.
Basically I'm trying to write an algorithm that will take in a set of items, for example, the letters in the alphabet and a combination size limit (1,2,3,4...) and will produce all the possible combinations for each size limit.
So for example lets say our set of items was chars A,B,C,D,E and my combination limit was 3, the result I would have would be:
A,
AB, AC, AD, AE,
ABC, ABD, ABE, ACD, ACE, ADE,
B,
BC, BD, BE,
BCD, BCE, BDE,
C,
CD, CE,
CDE,
D,
DE,
E
Hopefully that makes sense.
For the context, I want to use this for my game to generate army compositions with limits to how many different types of units they will be composed of. I don't want to have to do it manually!
Could I please gets some advice?
A recursion can do the job. The idea is to choose a letter, print it as a possibility and combine it with all letters after it:
#include <bits/stdc++.h>
using namespace std;
string letters[] = {"A", "B", "C", "D", "E"};
int alphabetSize = 5;
int combSizeLim = 3;
void gen(int index = 0, int combSize = 0, string comb = ""){
if(combSize > combSizeLim) return;
cout<<comb<<endl;
for(int i = index; i < alphabetSize; i++){
gen(i + 1, combSize + 1, comb + letters[i]);
}
}
int main(){
gen();
return 0;
}
OUTPUT:
A
AB
ABC
ABD
ABE
AC
ACD
ACE
AD
ADE
AE
B
BC
BCD
BCE
BD
BDE
BE
C
CD
CDE
CE
D
DE
E
Here's a simple recursive solution. (The recursion depth is limited to the length of the set, and that cannot be too big or there will be too many combinations. But if you think it will be a problem, it's not that hard to convert it to an iterative solution by using your own stack, again of the same size as the set.)
I'm using a subset of Python as pseudo-code here. In real Python, I would have written a generator instead of passing collection through the recursion.
def gen_helper(collection, elements, curr_element, max_elements, prefix):
if curr_element == len(elements) or max_elements == 0:
collection.append(prefix)
else:
gen_helper(collection, elements, curr_element + 1,
max_elements - 1, prefix + [elements[curr_element]])
gen_helper(collection, elements, curr_element + 1,
max_elements, prefix)
def generate(elements, max_elements):
collection = []
gen_helper(collection, elements, 0, max_elements, [])
return collection
The working of the recursive function (gen_helper) is really simple. It is given a prefix of elements already selected, the index of an element to consider, and the number of elements still allowed to be selected.
If it can't select any more elements, it must choose to just add the current prefix to the accumulated result. That will happen if:
The scan has reached the end of the list of elements, or
The number of elements allowed to be added has reached 0.
Otherwise, it has precisely two options: either it selects the current element or it doesn't. If it chooses to select, it must continue the scan with a reduced allowable count (since it has used up one possible element). If it chooses not to select, it must continue the scan with the same count.
Since we want all possible combinations (as opposed to, say, a random selection of valid combinations), we need to make both choices, one after the other.

Check if a vector lies in the span a subset of columns of a matrix in Sage

I'm new to programming with Sage. I have a rectangular R*C matrix (R rows and C columns) and the rank of M is (possibly) smaller than both R and C. I want to check if a target vector T is in the span of a subset of columns of M. I have written the following code in Sage (I haven't included the whole code because the way I get M and T are rather cumbersome). I just want to check if the code does what I want.
Briefly, this is what my code is trying to do: M is my given matrix, I first check that T is indeed in the span of columns of M (the first if condition). If they do, I proceed to trim down M (which had C columns) to a matrix M1 which has exactly rank(M) many columns (this is what the first while loop does). After that, I keep removing the columns one by one to check if the rest of the columns contain T in their span (this is the second while loop). In the second while loop, I first remove a column from M2 (which is essentially a copy of M1) and call this matrix M3. To M3. I augment the vector T and check if the rank decreases. Since T was already in the span of M2, rank([M2 T]) should be the same as rank(M2). Now by removing column c and augmenting T to M2 doesn't decrease the rank, then I know that c is not necessary to generate T. This way I only keep those columns that are necessary to generate T.
It does return correct answers for the examples I tried, but I am going to run this code on a matrix with entries which vary a lot in magnitude (say the maximum is as large as 20^20 and minimum is 1)and typically the matrix dimensions could go up to 300. So planning to run it over a set of few hundred test cases over the weekend. It'll be really helpful if you can tell me if something looks fishy/wrong -- for eg. will I run into precision errors? How should I modify my code so that it works for all values/ranges as mentioned above? Also, if there is any way to speed up my code (or write the same thing in a shorter/nicer way), I'd like to know.
R = 155
C= 167
T = vector(QQ, R)
M1 = matrix(ZZ, R, C)
M1 = M
C1 = C
i2 = 0
if rank(M.augment(T)) == rank(M):
print("The rank of M is")
print(rank(M))
while i2 < C1 :
if rank(M1.delete_columns([i2])) == rank(M1) :
M1 = M1.delete_columns([i2])
C1 = C1 - 1
else :
i2 = i2+1
C2 = M1.ncols()
print("The number of columns in the trimmed down matrix M1 is")
print(C2)
i3 = 0
M2 = M1
print("The rank of M1 which is now also the rank of M2 is")
print(rank(M2))
while i3 < C2 :
M3 = M2.delete_columns([i3])
if rank(M3.augment(T)) < rank(M2) :
M2 = M3
C2 = C2 - 1
else :
i3 = i3 + 1
print("Rank of matrix M is")
print(M.rank())
If I wanted to use Sage to decide whether a vector T was in the image of some a matrix M1 constructed from some subset of columns of another matrix M, I would do this:
M1 = M.matrix_from_columns([list of indices of the columns to use])
T in M1.column_space()
or use a while loop to modify M1 each time, as you do. (But I think T in M1.column_space() should work better than testing equality of ranks.)

Solving State Space Response with Variable A matrix

I am trying to verify my RK4 code and have a state space model to solve the same system. I have a 14 state system with initial conditions, but the conditions change with time (each iteration). I am trying to formulate A,B,C,D matrices and use sys and lsim in order to compile the results for all of my states for the entire time span. I am trying to do it similar to this:
for t=1:1:5401
y1b=whatever
.
.
y14b = whatever
y_0 = vector of ICs
A = (will change with time)
B = (1,14) with mostly zeros and 3 ones
C = ones(14,1)
D = 0
Q = eye(14)
R = eye(1)
k = lqr(A,B,C,D)
A_bar = A - B*k
sys = ss(A_bar,B,C,D)
u = zeros(14,1)
sto(t,14) = lsim(sys,u,t,y_0)
then solve for new y1b-y14b from outside function
end
In other words I am trying to use sto(t,14) to store each iteration of lsim and end up with a matrix of all of my states for each time step from 1 to 5401. I keep getting this error message:
Error using DynamicSystem/lsim (line 85)
In time response commands, the time vector must be real, finite, and must contain
monotonically increasing and evenly spaced time samples.
and
Error using DynamicSystem/lsim (line 85)
When simulating the response to a specific input signal, the input data U must be a
matrix with as many rows as samples in the time vector T, and as many columns as
input channels.
Any helpful input is greatly appreciated. Thank you
For lsim to work, t has to contain at least 2 points.
Also, the sizes of B and C are flipped. You have 1 input and 1 output so u should be length of t in lsim by 1.
Lastly, it looks like you try to put all initials conditions at once in lsim with y_0 where you just want the part relevant to this iteration.
s = [t-1 t];
u = [0; 0];
if t==1
y0 = y_0;
else
y0 = sto(t-1,1:14);
end
y = lsim(sys, u, s, y0);
sto(t,1:14) = y(end,:);
I'm not sure I understood correctly your question but I hope it helps.

Quickest way to get elements given matrix of indices in MATLAB

I have an N by 2 matrix A of indices of elements I want to get from a 2D matrix B, each row of A being the row and column index of an element of B that I want to get. I would like to get all of those elements stacked up as an N by 1 vector.
B is a square matrix, so I am currently using
N = size(B,1);
indices = arrayfun(#(i) A(i,1) + N*(A(i,2)-1), 1:size(A,1));
result = B(indices);
but, while it works, this is probing to be a huge bottleneck and I need to speed up the code in order for it to be useful.
What is the fastest way I can achieve the same result?
How about
indices = [1 N] * (A'-1) + 1;
I can never remember if B(A(:,1), A(:,2)) works the way you want it to, but I'd try that to avoid the intermediate variable. If that does not work, try subs2ind.
Also, you can look at how you generated A in the first place. if A came about from the output of find, for example, it is faster to use logical indexing. i.e if
B( B == 2 )
Is faster than finding the row,col indexes that satisfy that condition, then indexing into B.

Resources