data.table create table from rows - data.table

I would like to analyze a table that reports job codes used by people over the course of several pay periods: I want to know how many times each person has used each job code.
The table lists people in the first column, and pay periods in subsequent columns -- I cannot transpose without creating new problems with names.
The table looks like this:
people
pp1
pp2
pp3
pp4
Bob
A
A
A
C
Ted
B
B
B
B
Alice
B
A
C
C
My desired output looks like this:
people
A
B
C
Bob
3
0
1
Ted
0
4
0
Alice
1
1
2
My code is as follows:
myDT <- data.table(
people = c('Bob','Ted','Alice'),
pp1 = c('A','B','B'),
pp2 = c('A','B','A'),
pp3 = c('A','B','C'),
pp4 = c('C','B','C')
)
id.col=paste('pp',1:3)
myDT[ , table(as.matrix(.SD)), .SDcols = id.col, by = 1:nrow(myDT)]
but it's nowhere close to working

melt(myDT, "people") |>
dcast(people ~ value, fun.aggregate = length)
# people A B C
# <char> <int> <int> <int>
# 1: Alice 1 1 2
# 2: Bob 3 0 1
# 3: Ted 0 4 0

Related

algorithm for dividing x amount of people into n rooms of different sizes

For a project I have to design an algorithm that will fit a group of people into hotel rooms given their preference. I have created a dictionary in Python that has a person as key, and as a value a list of all people they would like to be in a room with.
There are different types of rooms that can hold between 2-10 people. How many rooms of what type there are is specified by the user of the program.
I have tried to brute force this problem by trying all room combinations and then giving each room a score based on the preference of the residents and looking for the maximum score. This works fine for small group sizes but having a group of 200 will give 200! combinations which my poor computer will not be able to compute within my lifetime.
I was wondering if there is an algorithm that I have not been able to find with the solution to my problem.
Thanks in advance!
Thijs
What you can do is think of your dictionary as a graph. Then you can create an adjacency matrix.
For example let say you have a group of 4 people, A, B, C and D.
A: wants to be with B and C
B: wants to be with A
C: wants to be with D
D: want to be with A and C
Your matrix would look like this:
// A B C D
// A 0 1 1 0
// B 1 0 0 0
// C 0 0 0 1
// D 1 0 1 0
Let's call this matrix M. You can then calculate the transpose (let's call it MT) and add M to MT. You will get something like this.
// A B C D
// A 0 2 1 1
// B 2 0 0 0
// C 1 0 0 2
// D 1 0 2 0
Then order the lines (or the columns it doesn't matter because it is symmetric) based on the sum of its values.
// A B C D
// A 0 2 1 1
// C 1 0 0 2
// D 1 0 2 0
// B 2 0 0 0
Do the same with the columns
// A C D B
// A 0 1 1 2
// C 1 0 2 0
// D 1 2 0 0
// B 2 0 0 0
Start filling your rooms starting from the first line based on the greatest value in that line and reduce the matrix by removing people that were assigned a room. You should start by selecting the biggest room first.
For example if we have a room that can have 2 people you'd assign person B and A to it since the biggest value in the first line is 2 and it corresponds to person B.
The reduced matrix would then be:
// C D
// C 0 2
// D 2 0
And you loop till all is done.
You already had a greedy solution described. So instead I'll suggest a simulated annealing solution.
For this you first assign everyone to rooms randomly. And now you start considering swapping people at random. You always accept swaps that improve your score, but have a chance of accepting a bad swap. The chance of accepting a bad swap goes down if the swap is really bad, and also goes down with time. After you've experimented enough, whatever you have is probably pretty good.
It is called "simulated annealing" because it is a simulation of the process by which a slowly cooling substance forms a well-organized crystal structure. So the parameter that you usually use is called T for temperature. And a standard function is:
def maybe_swap(assignment, x, y, T):
score_now = score(assignment)
swapped = swap(assignment, x, y)
score_swapped = score(swapped)
if random.random() < math.exp( (score_swapped - score_now) / T ):
return swapped
else:
return assignment
And then you just have to play around with how much work to do. Something like this:
for count_down in range(400, -1, -1):
for i in range(n^2):
x = floor(random.random(n))
y = floor(random.random(n))
if x != y:
assignment = maybe_swap(assignment, x, y, count_down / 100.0)
(You should play around with the parameters.)

diagonal value in co-occurrence matrix

I am so newbie and thank you so much in advance for advice
I want to make co-occurrence matrix, and followed link below
How to use R to create a word co-occurrence matrix
but I cannot understand why value of A-A is 10 in the matirx below
It should be 4 isn't it? because there are four A
dat <- read.table(text='film tag1 tag2 tag3
1 A A A
2 A C F
3 B D C ', header=T)
crossprod(as.matrix(mtabulate(as.data.frame(t(dat[, -1])))))
( ) A C F B D
A 10 1 1 0 0
C 1 2 1 1 1
F 1 1 1 0 0
B 0 1 0 1 1
D 0 1 0 1 1
The solution you use presumes each tag appears only once per film, which jives with the definition of a co-occurrence matrix as far as I can tell. Therefore, each A on the first line gets counted as co-occurring with itself and with the other two As, resulting in a total of ten co-occurences when factoring in the A on the second line.

Pandas multiindex sort

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks
Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop
Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

Assign random number by group

I am trying to assign a the same random number to each observation within a group. Thus in the dataset below, the value of the variable "random" would be equal for each observation where gp=B, and would take another value for each observation where gp=A, and so on.
data test ;
input gp $ a b c ;
datalines;
B 2 2 3
B 2 2 3
A 1 2 3
A 1 2 3
C 3 3 4
C 3 3 4
;
Stupidly I tried to create a different seed for each group based upon common unique values to each group:
data test2 ;
set test ;
seed = a*b*c ;
random = ranuni(seed) ;
run ;
This creates a common starting point per group, but which obviously changes for each observation.
How can I obtain a random number equivalent for each observation in the group? Due to the very large size of the real dataset I would like to avoid any sorting or other time consuming processes.
The required datset would thus look something like:
data want ;
input gp $ a b c random ;
datalines;
B 2 2 3 0.123
B 2 2 3 0.123
A 1 2 3 0.456
A 1 2 3 0.456
C 3 3 4 0.789
C 3 3 4 0.789
;
this should do the trick, ask me if you have any questions:
proc sort data=test;
by gp;
run;
data test2;
drop seed;
set test;
by gp;
retain random;
if first.gp then do;
seed = a*b*c ;
random = ranuni(seed) ;
end;
run;
basically, each time you call ranuni you get a new random number, so you only want to call it when the id (gp) changes.

improve the performance of the code with fewer number of operations

There are two vectors:
a = 1:5;
b = 1:2;
in order to find all combinations of these two vectors, I am using the following piece of code:
[A,B] = meshgrid(a,b);
C = cat(2,A',B');
D = reshape(C,[],2);
the result includes all the combinations:
D =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
now the questions:
1- I want to decrease the number of operations to improve the performance for vectors with bigger size. Is there any single function in MATLAB that is doing this?
2- In the case that the number of vectors is more than 2, the meshgrid function cannot be used and has to be replaced with for loops. What is a better solution?
For greater than 2 dimensions, use ndgrid:
>> a = 1:2; b = 1:3; c = 1:2;
>> [A,B,C] = ndgrid(a,b,c);
>> D = [A(:) B(:) C(:)]
D =
1 1 1
2 1 1
1 2 1
2 2 1
1 3 1
2 3 1
1 1 2
2 1 2
1 2 2
2 2 2
1 3 2
2 3 2
Note that ndgrid expects (rows,cols,...) rather than (x,y).
This can be generalized to N dimensions (see here and here):
params = {a,b,c};
vecs = cell(numel(params),1);
[vecs{:}] = ndgrid(params{:});
D = reshape(cat(numel(vecs)+1,vecs{:}),[],numel(vecs));
Also, as described in Robert P.'s answer and here too, kron can also be useful for replicating values (indexes) in this way.
If you have the neural network toolbox, also have a look at combvec, as demonstrated here.
One way would be to combine repmat and the Kronecker tensor product like this:
[repmat(a,size(b)); kron(b,ones(size(a)))]'
ans =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
This can be scaled to more dimensions this way:
a = 1:3;
b = 1:3;
c = 1:3;
x = [repmat(a,1,numel(b)*numel(c)); ...
repmat(kron(b,ones(1,numel(a))),1,numel(c)); ...
kron(c,ones(1,numel(a)*numel(b)))]'
There is a logic! First: simply repeat the first vector. Secondly: Use the tensor product with the dimension of the first vector and repeat it. Third: Use the tensor product with the dimension of (first x second) and repeat (in this case there is not fourth, so no repeat.

Resources