Is there a way to aggregate multiple subgroups within a larger group/column with data.table? - data.table

New R user here and I am trying to aggregate multiple groups of data within a larger group, for example Males/Females of Adults by Census Tracts. Currently I am writing it as:
DEmale <- DE_2016small[Gender_2016 == "M", .N, by = Residence_Addresses_CensusTract_2016] %>% rename(Males = N)
and a second function as:
DEfem <- DE_2016small[Gender_2016 == "F", .N, by = Residence_Addresses_CensusTract_2016] %>% rename(Females = N)
Is there any way to combine the code to find M and F as the same time, rather than as two separate commands?
The tabular data is huge, and I will need to create multiple groups more efficiently than a single command at a time.

Another way based on data.table:
dcast(data=DE_2016small,
formula=Residence_Addresses_CensusTract_2016 ~ factor(Gender_2016, c("M", "F"), c("Male", "Female")),
fun=length)
# Residence_Addresses_CensusTract_2016 Male Female
# 1: A 19 19
# 2: B 16 15
# 3: C 18 25
# 4: D 15 13
# 5: E 22 24
# 6: F 14 25
# 7: G 21 22
# 8: H 18 20
# 9: I 20 19
# 10: J 14 22
# 11: K 18 22
# 12: L 22 23
# 13: M 20 16
# 14: N 28 21
# 15: O 16 17
# 16: P 18 15
# 17: Q 26 22
# 18: R 20 22
# 19: S 23 26
# 20: T 18 12
# 21: U 20 11
# 22: V 18 18
# 23: W 20 22
# 24: X 21 14
# 25: Y 24 12
# 26: Z 17 17
data
set.seed(123)
DE_2016small = data.table(Gender_2016=sample(c('M','F'),1000,replace=T),Residence_Addresses_CensusTract_2016=sample(LETTERS,1000,replace=T))

Related

How can this formula to swizzle rows be simplified?

The problem is quite simple to understand but solving it was not as easy as it sounded at first.
Let's assume the following, an image that is 8*4, normal order is easy, you return the pixel index:
// 00 01 02 03 04 05 06 07
// 08 09 10 11 12 13 14 15
// 16 17 18 19 20 21 22 23
// 24 25 26 27 28 29 30 31
Now suppose you want to swizzle rows like so:
// 00 01 02 03 04 05 06 07
// 16 17 18 19 20 21 22 23
// 08 09 10 11 12 13 14 15
// 24 25 26 27 28 29 30 31
I solved it, not without trouble to be honest, with the following formula:
index / 8 % 2 * 16 + index / 16 * 8 + index % 8
Isn't there a simpler formula to get the same result?
Assuming / and % return the quotient and remainder in the Euclidean division:
The classic ordering can be obtained as:
row = n / 8
col = n % 8
And the swizzled ordering can be obtained as:
col = n % 8
old_row = n / 8
new_row = 2 * (old_row / 2) + (1 - (old_row % 2))
Explanation:
2 * (old_row / 2) groups the rows two by two;
(1 - (old_row % 2)) swaps row 0 and row 1 of each group.

Mutate new column from random value in existing columns

I'm looking to mutate my data and create a new column which randomly selects a value from the existing data. My data looks something like:
individual
age_2010
age_2011
age_2012
age_2013
a
20
21
NA
21
b
33
34
35
36
c
76
NA
78
79
d
46
46
48
49
And I want it to look like:
individual
age_2010
age_2011
age_2012
age_2013
Random Sample
a
20
21
22
NA
21
b
33
34
35
36
36
c
76
NA
78
79
78
d
46
46
48
49
48
Is there any way to add a new column which includes a random figure from any of the previous age columns, and preferably keeping the data in wide form?
I think this is an easier approach:
d[, RandomSample:=sample(na.omit(t(.SD)),1),individual]
If dealing with the edge cases discussed above is desired, and one wanted to follow this approach, we could do this:
f <- function(df) {
s = na.omit(t(df))
ifelse(length(s)>0, sample(s,1),NA_real_)
}
d[, RandomSample:=f(.SD),individual]
Or,
we could just wrap the original approach in tryCatch
d[, RandomSample:=tryCatch(sample(na.omit(t(.SD)),1),error=\(e) NA),individual]
You can reshape longer, then do grouped sampling:
library(data.table)
# Sample data
d <- structure(list(individual = c("a", "b", "c", "d"), age_2010 = c(20, 33, 76, 46), age_2011 = c(21, 34, NA, 46), age_2012 = c(NA, 35, 78, 48), age_2013 = c(21, 36, 79, 49)), row.names = c(NA, -4L), spec = structure(list(cols = list(individual = structure(list(), class = c("collector_character", "collector")), age_2010 = structure(list(), class = c("collector_double", "collector")), age_2011 = structure(list(), class = c("collector_double", "collector")), age_2012 = structure(list(), class = c("collector_double", "collector")), age_2013 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2L), class = "col_spec"), class = c("data.table", "data.frame"))
d
#> individual age_2010 age_2011 age_2012 age_2013
#> 1: a 20 21 NA 21
#> 2: b 33 34 35 36
#> 3: c 76 NA 78 79
#> 4: d 46 46 48 49
# Solution
d[, "Random Sample"] <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(x = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[[`)(2) # extract vector from frame
d
#> individual age_2010 age_2011 age_2012 age_2013 Random Sample
#> 1: a 20 21 NA 21 21
#> 2: b 33 34 35 36 33
#> 3: c 76 NA 78 79 76
#> 4: d 46 46 48 49 49
Alternatively, you can also use apply(), which is less verbose but much slower:
d[, "Random Sample"] <- apply(d[, -1], 1, \(x) x |> na.omit() |> sample(1))
See the benchmark here for speed comparison. On just 40k observations, apply() needs 59 times longer and 8 times the memory.
# Make large sample data set
d_large <- d |>
list() |>
rep(1e4) |>
rbindlist()
bench::mark(
base = apply(d_large[, -1], 1, \(x) x |> na.omit() |> sample(1)),
dt = d_large |>
melt("individual") |>
(`[`)(!is.na(value),
.(x = sample(value, 1)),
keyby = .(individual)) |>
(`[[`)(2),
check = F
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 617.86ms 617.9ms 1.62 103.3MB 12.9
#> 2 dt 6.96ms 10.5ms 80.9 13.1MB 47.3
Created on 2022-07-27 by the reprex package (v2.0.1)
Edit:
Here are versions that work with the edge case where all years are NA. In the first case I went for a join with the original table, which is a bit more expensive than the other version
# Solution with Data Table
d <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(`Random Sample` = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[`)(d) # right join with original frame
Here I simply used purrr::possibly() to return NA when sampling a zero length vector.
# Solution with apply
d[, "Random Sample"] <- apply(d[, -1], 1,
\(x) x |> na.omit() |> purrr::possibly(sample, NA)(1))

Nested for/while loop python triangle

Code
num = int(input(“Enter the number of lines: “))
for i in range(10):
for j in range(1,i):
print(num, the end='')
num = num+1
print()
I am writing a program which is should be like this.
Enter the number of lines: 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37
38 39 40 41 42 43 44
45 46 47 48 49 50
51 52 53 54 55
56 57 58 59
60 61 62
63 64
65
I don’t have any example from the lecturer, i just following the step from website, but the output of my code is like this: i am confused where i made the mistake, don’t get any clue to wear for or while. Please help me, thank you.
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
Try this:
input_data = input('Enter number of lines: ')
num = int(input_data)
# how many items to print in the first line?
items_to_print = num
# what's the starting number?
print_number = 11
for i in range(0, num):
# don't decrease num
# decrease items_to_print
# each line will reduce 1 item to print
for j in range(0, items_to_print):
print(print_number, end = ' ')
print_number += 1
print()
items_to_print -= 1
Result:
Enter number of lines: 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37
38 39 40 41 42 43 44
45 46 47 48 49 50
51 52 53 54 55
56 57 58 59
60 61 62
63 64
65
Explanation
Start small and make your way up.
First just do this:
input_data = input('Enter number of lines: ')
num = int(input_data)
print(num)
That'll print 10 if you entered 10. Great.
Second, add the first for loop and test whether it will print 10 rows.
input_data = input('Enter number of lines: ')
num = int(input_data)
for i in range(0, num):
print(f'Printing line {i}')
Third, try to print a block of 10 x 10. So, you add another variable called items_to_print. Set it to num. If you enter 10 as input, you will get 10 rows and 10 columns.
input_data = input('Enter number of lines: ')
num = int(input_data)
print_number = 0
items_to_print = num
for i in range(0, num):
print(f'Printing line {i}')
for j in range(0, items_to_print):
print(print_number, end = ' ')
Fourth step is to reduce the number of zeros printed before restarting the i loop. So, you decrement items_to_print.
input_data = input('Enter number of lines: ')
num = int(input_data)
print_number = 0
items_to_print = num
for i in range(0, num):
print(f'Printing line {i}')
for j in range(0, items_to_print):
print(print_number, end = ' ')
items_to_print -= 1
Now that your printing is working great, let's set print_number to start with 11 and each time a print happens in j loop, increment print_number. Then you will have same code I published at the top of this answer.
Well you have three little problems so let's address them one at a time.
First: default range function starts at 0 so when your j starts at one you are missing one iteration of the cicle. That explains missing one column and row but not two so let's keep going.
Second: the range function is non inclusive meaning you're I goes from 0 to 9 then in the inner loop you go from 1 to a maximum of 8. There's your missing second iteration.
Third: you are looping from 1 to an encreasing value what you want is the opposite so you need a decreasing range.
This is how you're code should look like.
num = 11
for i in range(10, 0, - 1):
for j in range(i):
print(num, end = " ")
num += 1
print()
Good luck and happy coding

The Traveling Salesman algorithm bug

I have tried to make an algorithm solving the traveling salesman problem as follows:
%main function:
[siz, ~] = size(table);
done(1:siz) = false;
done(1) = true;
[dist, path] = bruteForce(table, done, 1);
function bruteForce:
function [distance, path] = bruteForce(table, done, index)
size = length(done);
dmin = inf;
distance = 0;
path = [];
%finding minimum distance
for i = 1:size
if ~done(i)
done(i) = true;
%iterating through all nodes using recursion
[d, p] = bruteForce(table, done, i);
if (d < dmin)
dmin = d;
path = [i p];
distance = dmin + table(i, index);
end
%freing the node again
done(i) = false;
end
end
if distance == 0
distance = table(1, index);
path = 1;
end
Unfortunately, for the following matrix:
B = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
Instead of getting the expected result:
1-8-5-4-10-6-3-7-2-11-9-1 = 253km
I get:
1-8-11-3-4-6-10-5-9-2-7-1 = 271km
Could you help me find the bug?
If brute force is a must and speed is no issue, then just use the perms function for the number of cities. This allows for an easy implementation:
table = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
[siz, ~] = size(table);
[bp, b] = bruteForce(table, siz)
function [bestpath, best] = bruteForce(table, siz)
p = perms(1:siz);
[r, c] = size(p);
best = inf;
for i = 1:r
path = p(i, :);
dist = distCalculatorReturn(table, path);
if dist < best
best = dist;
bestpath = path;
end
end
bestpath = [bestpath, bestpath(1)];
end
function [totaldist] = distCalculatorReturn(distMatrix, proposedPath)
dist = 0;
i = 1;
while i ~= length(proposedPath)
dist = dist + distMatrix(proposedPath(i),proposedPath(i+1));
i = i+1;
end
dist = dist + distMatrix(proposedPath(1), proposedPath(end));
totaldist = dist;
end
This yields the answer you are looking for. However, if you are only solving problems of that size, why not apply a standard simulated annealing. This gives much faster solution times and should solve the problem size consistently:
table = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
[path, dist] = tsp(table, length(table))
function [path, dist] = tsp(D, n)
L = 40*n;
epsi = 1e-9;
x = randperm(n);
fx = distCalculatorReturn(D, x);
T = 1000000;
while T > epsi
for i=1:L
num1 = 1 + floor(rand*n);
num2 = 1 + floor(rand*n);
while num1 == num2
num1 = 1 + floor(rand*n);
end
y = x;
swap1 = y(num1);
y(num1) = y(num2);
y(num2) = swap1;
fy = distCalculatorReturn(D,y);
if fy < fx
x = y;
fx = fy;
elseif rand < exp(-(fy - fx)/T)
x = y;
fx = fy;
end
end
T = 0.9*T;
end
path = [x, x(1)];
dist = fx;
end
Your code does not compute the distance for each possible path (as bruteForce suggests). Instead it always starts at node 1 and from there goes always to the node that is closest to the current node. As your example shows, that does not necessarily lead to the overall shortest path. You will need to go through all possible paths to be sure you find the optimum.
Here is my go at your problem:
% distance matrix
B = [0 29 20 21 16 31 100 12 4 31 18;
29 0 15 29 28 40 72 21 29 41 12;
20 15 0 15 14 25 81 9 23 27 13;
21 29 15 0 4 12 92 12 25 13 25;
16 28 14 4 0 16 94 9 20 16 22;
31 40 25 12 16 0 95 24 36 3 37;
100 72 81 92 94 95 0 90 101 99 84;
12 21 9 12 9 24 90 0 15 25 13;
4 29 23 25 20 36 101 15 0 35 18;
31 41 27 13 16 3 99 25 35 0 38;
18 12 13 25 22 37 84 13 18 38 0];
% compute all possible paths assuming we always start at node 1
nNodes = size(B,1);
paths = perms(2:nNodes);
nPaths = size(paths,1);
paths = [ones(nPaths,1) paths ones(nPaths,1)]; % start and finish tour at node 1
% with a random start point:
% paths = perms(1:nNodes);
% paths = [perms(1:nNodes) paths(:,1)];
% compute overall distance for each path
distance = inf;
for idx=1:nPaths
from = paths(idx,1:end-1);
to = paths(idx,2:end);
d = sum(diag(B(from,to)));
if d<distance
distance = d;
optPath = paths(idx,:);
end
end
This leads to the following result:
optPath = [1 9 11 2 7 3 6 10 4 5 8 1]
distance = 253

Change a column_vector to a matrix in MATLAB

I have a column vector that needs to be changed into a matrix. The size of matrix is specified and can change. Please suggest a vectorized solution.
rows = 3 ; cols = 4 ; %matrix elements for this case = 12
colvector = [ 2;4;5;8;10;14;16;18;20;21;28;30] ;
desired_mat = [ ...
2 4 5 8
10 14 16 18
20 21 28 30 ] ;
Thanks!
The reshape function does that:
>> colvector = [ 2;4;5;8;10;14;16;18;20;21;28;30] ;
>> A = reshape(colvector, 3, 4)
A =
2 8 16 21
4 10 18 28
5 14 20 30

Resources