From edge or arc list to clusters in Stata - social-networking

I have a Stata dataset that represents connections between users that looks like this:
src_user linked_user
1 2
2 3
3 5
1 4
6 7
I would like to get something like this:
user cluster
1 1
2 1
3 1
4 1
5 1
6 2
7 2
where isid user evaluates to TRUE and I have grouped all users into disjoint clusters. I have tried thinking of this as a reshape problem, but without much success. None of the user-written SNA commands seem to accomplish this as far as I can tell.
What is the most efficient way of doing it with Stata, other than looping, which I am eager to avoid ?

If you reshape the data to long form, you can use group_id (from SSC) to get what you want.
clear
input user1 user2
1 2
2 3
3 5
1 4
6 7
end
gen id = _n
reshape long user, i(id) j(n)
clonevar cluster = id
list, sepby(cluster)
group_id cluster, match(user)
bysort cluster user (id): keep if _n == 1
list, sepby(cluster)

Related

Ranking over each matrix column's sort in julia

I have a matrix (m) of scores for 4 students on 3 different exams.
4 3 1
3 2 5
8 4 6
1 5 2
I want to know, for each student, the exams they did best to worse on. Desired output:
1 2 3
2 3 1
1 3 2
3 1 2
Now, I'm new to the language (and coding in general), so I read GeeksforGeeks' page on sorting in Julia and tried
mapslices(sortperm, -m; dims = 2)
However, this gives something subtly different: a matrix of each row being the index of the sorting.
1 2 3
3 1 2
1 3 2
2 3 1
Perhaps it was obvious, but I now realize this is not actually what I want, but I cannot find a built-in function/fast way to complete this operation. Any ideas? Preferably something which doesn't iterate through items in the matrix/row, as in reality my matrix is very, very large. Thanks!
Such functionality is provided by StatsBase.jl. Here is an example:
julia> using StatsBase
julia> m = [4 3 1
3 2 5
8 4 6
1 5 2]
4×3 Array{Int64,2}:
4 3 1
3 2 5
8 4 6
1 5 2
julia> mapslices(x -> ordinalrank(x, rev=true), m, dims = 2)
4×3 Array{Int64,2}:
1 2 3
2 3 1
1 3 2
3 1 2
You might want to use other rank, depending on how you want to split ties, see here for details.
Figured out something which works!
Run m_index_rank = mapslices(sortperm, -m; dims = 2) on the matrix and get a ranking for each row through index. Then, realizing this is, in each row, an inverse permutation away from the desired output, run mapslices(invperm, m_index_rank; dims = 2) for the desired result.
In one line, this is mapslices(r -> invperm(sortperm(r, rev=true)), m; dims=2) over the desired matrix m. dims = 2 is to carry out the operation row-wise.
I'm marking this resolved for now, but please let me know if there are cleaner/faster ways to do this.
Edit: Replaced my syntactically clunky mapslices(invperm, mapslices(sortperm, -m; dims = 2); dims = 2) with a more natural one, thanks to #phipsgabler

Subsetting multiple variables by rank within data-frame

Hopefully the below makes sense.
I have a data set a large number of variables (row). Within each variable are sections that are scored 1-15. I need to subset the dataframe based on the three highest scoring sections for each variable. Each section has additional data associated with it that would be needed, but is not required as part of the selection.
Having trouble with this. Any help is appreciated.
Dummy layout below
Variable Aux_score
1 1
1 6
1 3
1 8
1 10
2 3
2 2
2 12
2 10
2 11
3 7
3 2
3 9
3 8
3 12
You can do it like this with base r:
do.call(rbind, lapply(split(df, df$Variable), function(df) df[ tail(order(df$Aux_score), 3), ]))
Or like this with tidyverse:
df %>% group_by(Variable) %>% top_n(3, Aux_score) %>% ungroup()

Put elements of similar type together

I read this interview question somewhere and was trying to solve it:
Given a fruit stall (at max 8 different types of fruits). Put fruits of similar types together.
Restrictions: a) Fruit Stall is your entire world (i.e. dont use extra space), b) Taking a fruit and knowing its type (getType()) is a costly operation but swapping is a very cheap operation.
Note: You need to write a code to handle all cases keeping in mind the max types of fruit can be 8.
So, the idea which pops in my mind is, we need to call getType() for all the fruits(array elements) and then sort them accordingly based on a particular type. I am not able to get how swapping can be done here without knowing the Type of the fruit and what can be the best solution to this problem?
Since this is an interview question, I'm going to assume that your fruit stall is an array. Divide the array into eight regions, so that each region contains only fruit of a given type, using seven pointers, one to the start of each region except the first. Use an eighth pointer to point at the start of the unsorted area.
Initialize the pointers to point at the start of the array. Getting the definition of the pointers is tricky because you have to cover cases where there are no fruits of a given type. One possible definition is that Pointer i contains the number of fruits sorted so far of types up to and including i, for i = 1..8. Then at the beginning all the pointers are set equal to zero and 1 1 1 2 2 3 4 4 | corresponds to p1=3 p2=5 p3=6 p4=8 p5=p6=p7=p8=8
Repeatedly look at the first fruit at the start of the unsorted region to find out its type. If it should not go in the final region swap it with the element at the start of the final region and advance the pointer to the start of the final region. If it should not go in the second last region swap it with the element at the start of the second last region and advance the pointer to the start of the second last region... and so on until the new fruit is in its correct place. Now advance the pointer to the first unsorted fruit and repeat.
This looks at each fruit once, and I don't think you can sort with fewer calls to getType(). You don't care about the number of swaps, so I think this is optimal.
I will put in lines showing the swaps starting with c1,c2,c1,c3,c2,c1,c4,c4. I won't bother to write in the cs and I will use a | to divide the region on the left where everything is known to be in order from the region on the right where the types are unknown
| 1 2 1 3 2 1 4 4
1 | 2 1 3 2 1 4 4
1 2 | 1 3 2 1 4 4
1 1 | 2 3 2 1 4 4
1 1 2 | 3 2 1 4 4
1 1 2 3 | 2 1 4 4
1 1 2 2 | 3 1 4 4
1 1 2 2 3 | 1 4 4
1 1 2 2 1 | 3 4 4
1 1 1 2 2 | 3 4 4
1 1 1 2 2 3 | 4 4
1 1 1 2 2 3 4 | 4
1 1 1 2 2 3 4 4 |
This can most likely be done as an in place merge sort. As you mentioned check the type of each fruit immediately. This wont use up any extra memory (many guides on how to do an in place merge sort) will only call getType() once, and will result in nlog(n) run time with n memory usage.
Is there any info we know right off the bat? It seems like the question is worded in such a way that they would normally give us an alternative way to avoid having to make the getType() call n times. If this is an in person interview question don't be surprised if the goal of this exercise is supposed to evolve as the interviewer starts going into it. This would explain why they specifically mention the getType() as being expensive

Filter on google sheets minimum across columns

Hi I am trying to do get some data displayed using FILTER function in google sheets.
What i want is the minimum value across 3 columns on 1 row.
Is this possible?
For example:
A 1 6 10
B 3 5 9
C 4 4 8
D 5 3 7
A 2 1 6
Filter on A should give:
A 1
A 1
Filter on B should give:
B 3
I would really like to use filter function but =filter({A:A,min(B:D)},A:A="A") doesn't work.
Maybe, if your three (labelled) columns are A, B and C:
=filter(A2:C2,A2:C2=min(A2:C2))
but in that case filter would be overkill.

Dropping people in Stata from a panel based on their situation in multiple years

I have an unbalanced panel of 7 years with every person interviewed 4 times and I want to drop all the people that reported that they were unemployed/inactive in all 4 periods. However, I do not want to drop the observations of the people that may have been out of the labour market for 1, 2 or 3 out of the 4 periods they were interviewed. How do I tell Stata to drop people based on their situation in multiple years (t to t-3)? When I do drop if ecostatus>3, for example, Stata drops observations that I need, i.e. the people that were inactive for less than the full period of the survey.
// create some example data
clear
input id t unemp
1 1 1
1 2 1
1 3 1
1 4 1
2 1 1
2 2 0
2 3 1
2 4 1
end
// create the total number of unemployment spells
bys id : egen totunemp = total(unemp)
// display the data
sort id t
list, sepby(id)
// keep those observations with at least one
// employment spell
keep if totunemp < 4
// display the data
list

Resources