I have a large dataframe (27million rows and 18 columns). The dataframe contains many dublicated rows, which I can drop by using, for example, the distinct function from data.table. However, this gives me only the first unique row, whereas I want to have all unique rows. Here is a reproducible example:
library(data.table)
library(dplyr)
df<-setNames(data.frame(matrix(ncol = 4, nrow = 10)), c("code", "var1", "var2", "var3"))
df$code<-c("101", "102", "103", "104", "105", "106", "107", "108", "109", "110")
df$var1<-c(1, 1, 1, 2, 2, 1, 1, 2, 3,3)
df$var2<-c(1, 1,1, 2, 2, 1,1,2, 3,3 )
df$var3<-c(1, 1,1, 2, 2, 1,1,2, 3,3 )
df<-as.data.table(df)
df<-df %>% distinct(var1, var2, var3, .keep_all=T)
## which gives:
code var1 var2 var3
1: 101 1 1 1
2: 104 2 2 2
3: 109 3 3 3
## however, I want:
code var1 var2 var3
1 101 1 1 1
2 104 2 2 2
3 106 1 1 1
4 108 2 2 2
5 109 3 3 3
A data.table solution would be great due to the size of the original dataframe.
I have a solution using data.table but it may be less than optimal for your dataset given its size becuase it creates some extra columns.
Although in your example data var1, var2, and var3 were always the same, I'm going to assume here that they do not have to be.
library(data.table)
library(zoo) # for rollapplyr function
df<-setNames(data.frame(matrix(ncol = 4, nrow = 10)), c("code", "var1", "var2", "var3"))
df$code<-c("101", "102", "103", "104", "105", "106", "107", "108", "109", "110")
df$var1<-c(1, 1, 1, 2, 2, 1, 1, 2, 3,3)
df$var2<-c(1, 1,1, 2, 2, 1,1,2, 3,3 )
df$var3<-c(1, 1,1, 2, 2, 1,1,2, 3,3 )
df <- as.data.table(df)
The first step is to create a variable that signifies if the variable in row i is different from that in row i-1. We do this for each of the three variables. This takes the diff function, which calculates the difference in a vector between i and i-1 and applies it across each column.
df[,dif1 := abs(rollapplyr(var1, 2, function(x){diff(x,lag = 1)}, fill = 1)),]
df[,dif2 := abs(rollapplyr(var2, 2, function(x){diff(x,lag = 1)}, fill = 1)),]
df[,dif3 := abs(rollapplyr(var3, 2, function(x){diff(x,lag = 1)}, fill = 1)),]
If any of those variables has changed from row i-1 then it is not a duplicate. Because we took the absolute value of the change spit out by the diff function, this means any value above zero signifies a change. We can take the sum of the change across three variables, and then filter to those rows whose change is above zero.
df[,drop := sum(dif1, dif2, dif3), by = code]
df[drop>0, .(code, var1, var2, var3),]
code var1 var2 var3
1: 101 1 1 1
2: 104 2 2 2
3: 106 1 1 1
4: 108 2 2 2
5: 109 3 3 3
Again, not sure how fast it will be. I ran this on 1e4 rows and it took 0.829 sec and again on 1e6 rows and it took 51.118 sec, so it seems to scale okay.
Related
I'm hoping there is SPSS syntax that I can use to randomly select a number from among a couple of variables. For example: the data lists the ages of respondent's children in four variables - Age1 Age2 Age3 Age4
Resp 1: 3 6 8
Resp 2: 2 10
Resp 3: 4
I want to create a variable that stores a randomly selected age for each respondent - something like:
Resp 1: 6
Resp 2: 2
Resp 3: 4
The code I'm using at the moment:
COUNT kids=age1 to age4 (1 thru 16).
COMPUTE rand=RND(RV.UNIFORM(1, kids),1).
DO REPEAT
x1=age1 to age4
/x2=1 to 4.
IF (rand=x2) random_age=x1.
END REPEAT.
Here is my suggested code for the task.
First creating some sample data to demonstrate on:
data list list/id age1 to age4 (5f2).
begin data
1, 4, 5, 6, 7
2, 4, 5, 6,
3, 6, 7,,
4, 8,,,
5, 5, 6, 7,
6, 10,,,
end data.
Now to randomly select one of the ages:
compute numages=4-nmiss(age1 to age4).
compute SelectThis = rnd(uniform(numages)+.5).
do repeat ag=age1 to age4 /ind=1 to 4.
if SelectThis=ind SelectedRandAge=ag.
end repeat.
exe.
Well, here's my attempt for the time being:
data list list /age1 to age4.
begin data.
10 9 5 8
3
13 15
1 4 5
4 7 8 2
end data.
count valid=age1 to age4 (lo thru hi).
compute s=trunc(1+uniform(valid)).
vector age=age1 to age4.
compute myvar=age(s).
list age1 to age4 myvar.
I came across this question in recent interview :
Given an array A of length N, we are supposed to answer Q queries. Query form is as follows :
Given x and k, we need to make another array B of same length such that B[i] = A[i] ^ x where ^ is XOR operator. Sort an array B in descending order and return B[k].
Input format :
First line contains interger N
Second line contains N integers denoting array A
Third line contains Q i.e. number of queries
Next Q lines contains space-separated integers x and k
Output format :
Print respective B[k] value each on new line for Q queries.
e.g.
for input :
5
1 2 3 4 5
2
2 3
0 1
output will be :
3
5
For first query,
A = [1, 2, 3, 4, 5]
For query x = 2 and k = 3, B = [1^2, 2^2, 3^2, 4^2, 5^2] = [3, 0, 1, 6, 7]. Sorting in descending order B = [7, 6, 3, 1, 0]. So, B[3] = 3.
For second query,
A and B will be same as x = 0. So, B[1] = 5
I have no idea how to solve such problems. Thanks in advance.
This is solvable in O(N + Q). For simplicity I assume you are dealing with positive or unsigned values only, but you can probably adjust this algorithm also for negative numbers.
First you build a binary tree. The left edge stands for a bit that is 0, the right edge for a bit that is 1. In each node you store how many numbers are in this bucket. This can be done in O(N), because the number of bits is constant.
Because this is a little bit hard to explain, I'm going to show how the tree looks like for 3-bit numbers [0, 1, 4, 5, 7] i.e. [000, 001, 100, 101, 111]
*
/ \
2 3 2 numbers have first bit 0 and 3 numbers first bit 1
/ \ / \
2 0 2 1 of the 2 numbers with first bit 0, have 2 numbers 2nd bit 0, ...
/ \ / \ / \
1 1 1 1 0 1 of the 2 numbers with 1st and 2nd bit 0, has 1 number 3rd bit 0, ...
To answer a single query you go down the tree by using the bits of x. At each node you have 4 possibilities, looking at bit b of x and building answer a, which is initially 0:
b = 0 and k < the value stored in the left child of the current node (the 0-bit branch): current node becomes left child, a = 2 * a (shifting left by 1)
b = 0 and k >= the value stored in the left child: current node becomes right child, k = k - value of left child, a = 2 * a + 1
b = 1 and k < the value stored in the right child (the 1-bit branch, because of the xor operation everything is flipped): current node becomes right child, a = 2 * a
b = 1 and k >= the value stored in the right child: current node becomes left child, k = k - value of right child, a = 2 * a + 1
This is O(1), again because the number of bits is constant. Therefore the overall complexity is O(N + Q).
Example: [0, 1, 4, 5, 7] i.e. [000, 001, 100, 101, 111], k = 3, x = 3 i.e. 011
First bit is 0 and k >= 2, therefore we go right, k = k - 2 = 3 - 2 = 1 and a = 2 * a + 1 = 2 * 0 + 1 = 1.
Second bit is 1 and k >= 1, therefore we go left (inverted because the bit is 1), k = k - 1 = 0, a = 2 * a + 1 = 3
Third bit is 1 and k < 1, so the solution is a = 2 * a + 0 = 6
Control: [000, 001, 100, 101, 111] xor 011 = [011, 010, 111, 110, 100] i.e. [3, 2, 7, 6, 4] and in order [2, 3, 4, 6, 7], so indeed the number at index 3 is 6 and the solution (always talking about 0-based indexing here).
i would like to sum all the values from my 2nd column which have the same value in the first column.
So my matrix looks maybe like this:
column: [1 1 1 2 2 3 3 3 3 4 5 5]
column: [3 5 8 2 6 4 0 6 1 0 2 6]
now i would like to have for the value 1 in the 1st column a sum of 3, 5 and 8 in the 2nd column, the same goes for 2, 3 and so from the 1st column.
Like this for example:
[1 2 3 4 5],
[16 8 11 0 8]
i'm thankful for any suggestions!
Sum all values when values are equal :
Just to init :
a = [1 1 1 2 2 3 3 3 3 4 5 5 ; 3 5 8 2 6 4 0 6 1 0 2 6];
a = a.';
Let's go :
n=0
for i=1:size(a,1)
if a(i,1) == a(i,2)
n = n + a(i,1)
end
end
n
For the second question :
mat=0
for j = 1:max(a(:,1))
n=0
for i=1:size(a,1)
if j == a(i,1)
n = n + a(i,2)
end
end
mat(j,1) = j
mat(j,2) = n
end
mat
Result :
mat =
1 16
2 8
3 11
4 0
5 8
Question :
Given a computer ,where were made the following memory accesses
(from left to right) :
5 ,10 ,2 ,34 ,18 ,4 ,22 ,21 ,11 ,2
* Decide if we have a HIT or MISS when dealing with a 4-way associative mapping ,
when the total size of the cache is 32 blocks of 2 bytes !
* When you're done , write the final map of the cache
My answer :
Size of a set is 4 , hence :
(number of blocks )/(number of ways)=32/4=8
Then we have a cache the has eight cells , from 0 to 7 (please correct me if I'm wrong !!?)
Now : 5:(4,5)→5/2=2→2 % 8=2→cell 2→miss
10:(10,11)→10/2=5→5 % 8=5→cell 5→miss
2:(2,3)→2/2=1→1 %8=1→cell 1→miss
34:(34,35)→34/2=17→17 % 8=1→cell 1→miss
18:(18,19)→18/2=9→9 % 8=1→cell 1→miss
4:HIT in cell 2
22:(22,23)→22/2=11→11 % 8=3→cell 3→miss
21:(20,21)→21/2=10→10 % 8=2→cell 2→miss
11: HIT in cell 5
2:HIT in cell 1
Now , the final map of the cache is :
0: empty
1: (2,3) (34,35) (18,19)
2: (4,5) (20,21)
3: (22,23)
4: empty
5: (10,11)
6: empty
7: empty
Is my answer correct ?
Am I wrong with the map of the cache ?
I'd appreciate your help .... my exam is soon :)
Thanks ,
Ron
A simple Python program (ignoring replacements since there are none) says you are correct
from collections import defaultdict
d = defaultdict(list)
for item in (5 ,10 ,2 ,34 ,18 ,4 ,22 ,21 ,11 ,2):
value = item // 2 * 2, item // 2 * 2 + 1
cell = item // 2 % 8
if value in d[cell]:
print "HIT", cell
else:
d[cell].append(value)
print "MISS", cell
for i in range(8):
print i, d[i]
--
MISS 2
MISS 5
MISS 1
MISS 1
MISS 1
HIT 2
MISS 3
MISS 2
HIT 5
HIT 1
0 []
1 [(2, 3), (34, 35), (18, 19)]
2 [(4, 5), (20, 21)]
3 [(22, 23)]
4 []
5 [(10, 11)]
6 []
7 []
I have this matrix:
S.No. A B
1 5268020 1756
2 15106230 5241
3 24298744 9591
4 23197375 9129
I want to get a matrix which will have two columns [X,Y]. X will take values from S.No. and Y will can be either 1 or 0. For example, for 1 5268020 1756 there should be total 5268020 (1,0) i.e, (X,Y) pairs and 1756 (1,1) pairs.
How can I get this matrix in Octave ??
If I understand your question correctly, you want to fill a matrix with repeated entries (x,0) and (x,1), where x=1...4, where repetition is determined by values found in column A and B. Given the values you supplied that's going to be a huge matrix (67,896,086 rows). So, you could try something like this (replace m below, which has less elements for illustrative purpose):
m = [1, 2, 1;
2, 3, 2;
3, 2, 1;
4, 2, 2];
res = [];
for k = 1:4
res = [res ; [k*ones(m(k, 2), 1), zeros(m(k, 2), 1);
k*ones(m(k, 3), 1), ones(m(k, 3), 1)]];
endfor
which yields
res =
1 0
1 0
1 1
2 0
2 0
2 0
2 1
2 1
3 0
3 0
3 1
4 0
4 0
4 1
4 1
Out of curiosity, is there any reason not to consider a matrix like
1 0 n
1 1 m
2 0 p
2 1 q
...
where n, m, p, q, are values found in columns A and B. This would probably be easier to handle , no?