Dropping people in Stata from a panel based on their situation in multiple years - panel

I have an unbalanced panel of 7 years with every person interviewed 4 times and I want to drop all the people that reported that they were unemployed/inactive in all 4 periods. However, I do not want to drop the observations of the people that may have been out of the labour market for 1, 2 or 3 out of the 4 periods they were interviewed. How do I tell Stata to drop people based on their situation in multiple years (t to t-3)? When I do drop if ecostatus>3, for example, Stata drops observations that I need, i.e. the people that were inactive for less than the full period of the survey.

// create some example data
clear
input id t unemp
1 1 1
1 2 1
1 3 1
1 4 1
2 1 1
2 2 0
2 3 1
2 4 1
end
// create the total number of unemployment spells
bys id : egen totunemp = total(unemp)
// display the data
sort id t
list, sepby(id)
// keep those observations with at least one
// employment spell
keep if totunemp < 4
// display the data
list

Related

Algorithm to match people with available appointments

I need some help making a program that finds the best solution for everyone (more on that later).
6 7
0 0 0 0 0 0 0
1 0 0 1 1 0 0
2 2 2 1 2 2 2
2 1 1 1 2 1 2
0 1 2 2 1 0 0
1 2 1 2 0 1 1
The example given above is a problem that the algorithm is supposed to solve,
the first number of the first row indicates the number of people (6)
the second number of the first row indicates the number of appointments (7)
0 = the person doesnt have a problem with the date
1 = the person could choose these date if none else is available
2 = the person cant choose this appointment
Row = Person
Colum = Available Appointment
What the program needs to do now is to find the best possible solution for everyone by choosing which colum would be the best for the person's desire by arranging peoples appointments based on their choices
ex.
In the 3rd row the person can only attend the appointment on the 4th column since he cant attend to the other ones (2) which also makes column 4 complete and out of use for the other people.
The reason I need help with this is because I have no idea on how to approach this because this might be a simple example but since its an algorithm its supposed to work with dozens of peoples and appointments.
The exercise is somewhat ambiguous, probably on purpose. My wild guess would be to sort the meetings by:
the highest number of possible participants, i.e., the lowest number of 2s in a matrix column.
the lowest “badness”, i.e., the lowest number of 1s in a matrix column.
Why not #2s: Because we don’t care about those who cannot participate at this sorting stage.
Why not #0s: Because we want to minimize the number of people inconvenienced by the meeting time, not (necessarily) maximize the number of people pleased with the meeting time.
#!/usr/bin/env python
import sys
n_people, n_appointments = (int(i)
for i in sys.stdin.readline().split())
people_appointments = tuple(tuple(int(i)
for i in line.split())
for line in sys.stdin)
assert len(people_appointments) == n_people
for appointments in people_appointments:
assert len(appointments) == n_appointments
appointment_metric = {}
for appointment in range(n_appointments):
n_missing = sum(people_appointments[i][appointment] == 2
for i in range(n_people))
badness = sum(people_appointments[i][appointment] == 1
for i in range(n_people))
appointment_metric.setdefault(
(n_missing, badness), []).append(str(appointment + 1))
for metric in sorted(appointment_metric):
print(f'Appointment Nr. {" / ".join(appointment_metric[metric])} '
f'(absence {metric[0]}, badness {metric[1]})')
Possible output (best appointment (by the metric described above) to worst appointment):
Appointment Nr. 6 (absence 1, badness 2)
Appointment Nr. 7 (absence 2, badness 1)
Appointment Nr. 1 / 2 / 3 / 5 (absence 2, badness 2)
Appointment Nr. 4 (absence 2, badness 3)
There are (of course) many other ways to evaluate meetings. Picking and defining a metric is quite likely an implicit part of the exercise.

Put elements of similar type together

I read this interview question somewhere and was trying to solve it:
Given a fruit stall (at max 8 different types of fruits). Put fruits of similar types together.
Restrictions: a) Fruit Stall is your entire world (i.e. dont use extra space), b) Taking a fruit and knowing its type (getType()) is a costly operation but swapping is a very cheap operation.
Note: You need to write a code to handle all cases keeping in mind the max types of fruit can be 8.
So, the idea which pops in my mind is, we need to call getType() for all the fruits(array elements) and then sort them accordingly based on a particular type. I am not able to get how swapping can be done here without knowing the Type of the fruit and what can be the best solution to this problem?
Since this is an interview question, I'm going to assume that your fruit stall is an array. Divide the array into eight regions, so that each region contains only fruit of a given type, using seven pointers, one to the start of each region except the first. Use an eighth pointer to point at the start of the unsorted area.
Initialize the pointers to point at the start of the array. Getting the definition of the pointers is tricky because you have to cover cases where there are no fruits of a given type. One possible definition is that Pointer i contains the number of fruits sorted so far of types up to and including i, for i = 1..8. Then at the beginning all the pointers are set equal to zero and 1 1 1 2 2 3 4 4 | corresponds to p1=3 p2=5 p3=6 p4=8 p5=p6=p7=p8=8
Repeatedly look at the first fruit at the start of the unsorted region to find out its type. If it should not go in the final region swap it with the element at the start of the final region and advance the pointer to the start of the final region. If it should not go in the second last region swap it with the element at the start of the second last region and advance the pointer to the start of the second last region... and so on until the new fruit is in its correct place. Now advance the pointer to the first unsorted fruit and repeat.
This looks at each fruit once, and I don't think you can sort with fewer calls to getType(). You don't care about the number of swaps, so I think this is optimal.
I will put in lines showing the swaps starting with c1,c2,c1,c3,c2,c1,c4,c4. I won't bother to write in the cs and I will use a | to divide the region on the left where everything is known to be in order from the region on the right where the types are unknown
| 1 2 1 3 2 1 4 4
1 | 2 1 3 2 1 4 4
1 2 | 1 3 2 1 4 4
1 1 | 2 3 2 1 4 4
1 1 2 | 3 2 1 4 4
1 1 2 3 | 2 1 4 4
1 1 2 2 | 3 1 4 4
1 1 2 2 3 | 1 4 4
1 1 2 2 1 | 3 4 4
1 1 1 2 2 | 3 4 4
1 1 1 2 2 3 | 4 4
1 1 1 2 2 3 4 | 4
1 1 1 2 2 3 4 4 |
This can most likely be done as an in place merge sort. As you mentioned check the type of each fruit immediately. This wont use up any extra memory (many guides on how to do an in place merge sort) will only call getType() once, and will result in nlog(n) run time with n memory usage.
Is there any info we know right off the bat? It seems like the question is worded in such a way that they would normally give us an alternative way to avoid having to make the getType() call n times. If this is an in person interview question don't be surprised if the goal of this exercise is supposed to evolve as the interviewer starts going into it. This would explain why they specifically mention the getType() as being expensive

From edge or arc list to clusters in Stata

I have a Stata dataset that represents connections between users that looks like this:
src_user linked_user
1 2
2 3
3 5
1 4
6 7
I would like to get something like this:
user cluster
1 1
2 1
3 1
4 1
5 1
6 2
7 2
where isid user evaluates to TRUE and I have grouped all users into disjoint clusters. I have tried thinking of this as a reshape problem, but without much success. None of the user-written SNA commands seem to accomplish this as far as I can tell.
What is the most efficient way of doing it with Stata, other than looping, which I am eager to avoid ?
If you reshape the data to long form, you can use group_id (from SSC) to get what you want.
clear
input user1 user2
1 2
2 3
3 5
1 4
6 7
end
gen id = _n
reshape long user, i(id) j(n)
clonevar cluster = id
list, sepby(cluster)
group_id cluster, match(user)
bysort cluster user (id): keep if _n == 1
list, sepby(cluster)

Array problem using if and do loop

This is my code:
data INDAT8; set INDAT6;
Array myarray{24,27};
goodgroups=0;
do i=2 to 24 by 2;
do j=2 to 27;
if myarray[i,j] gt 1 then myarray[i+1,j] = 'bad';
else if myarray[i,j] eq 1 and myarray[i+1,j] = 1 then myarray[i+1,j]= 'good';
end;
end;
run;
proc print data=INDAT8;
run;
Problem:
I have the data in this format- it is just an example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values from n=2 , n-3, and n=4 I will plot a graph of these values.
If you can help in any any language using array, if and do loop. it would be great.
I can change my code accordingly.
Update:
The answer for this doesn't have to be in sas. Since it is more algorithm-related than anything, I will accept suggestions in any language as long as they show how to accomplish this using arrays and do.
I am having trouble understanding your problem statement, but from what I can gather here is what I can suggest:
Place data into bins and the process the summary data.
Implementation 1
Assumption: You don't know what the range of the first column will be or distriution will be sparse
Create a hash table. The Key will be the item you are doing your grouping on. The value will be the count seen so far.
Proces each record. If the key already exists, increment the count (value for that key in the hash). Otherwise add the key and set the value to 1.
Continue until you have processed all records
Count the number of keys in the hash table and the number of values that are greater than your threshold.
Implementation 2
Assumption: You know the range of the first column and the distriution is reasonably dense
Create an array of integers with enough elements so the index can match the column value. Initialize all elements to zero. This array will hold your count for each item you are grouping on
Process each record. Examine value of first column. Increment corresponding index in array. (So if you have "2 1 good", do groupCount[2]++)
Continue until you have processed all records
Walk each element in the array. Count how many items are non zero (meaning they appeared at least once) and how many items meet your threshold.
You can use the same approach for gathering the good and bad counts.

Algorithmic issue

I am trying to find a O (n) algorithm for this problem but unable to do so even after spending 3 - 4 hours. The brute force method times out (O (n^2)). I am confused as to how to do it ? Does the solution requires dynamic programming solution ?
http://acm.timus.ru/problem.aspx?space=1&num=1794
In short the problem is this:
There are some students sitting in circle and each one of them has its own choice as to when he wants to be asked a question from a teacher. The teacher will ask the questions in clockwise order only. For example:
5
3 3 1 5 5
This means that there are 5 students and :
1st student wants to go third
2nd student wants to go third
3rd student wants to go first
4th student wants to go fifth
5th student wants to go fifth.
The question is as to where should teacher start asking questions so that maximum number of students will get the turn as they want. For this particular example, the answer is 5 because
3 3 1 5 5
2 3 4 5 1
You can see that by starting at fifth student as 1st, 2 students (3 and 5) are getting the choices as they wanted. For this example the answer is 12th student :
12
5 1 2 3 6 3 8 4 10 3 12 7
because
5 1 2 3 6 3 8 4 10 3 12 7
2 3 4 5 6 7 8 9 10 11 12 1
four students get their choices fulfilled.
It's actually a rather simple problem. If student k wants to be the jth to present, then she will be satisfied iff the (k - j + 1)th (modulo n) is the first to present. This should lead you to a a simple O(n) algorithm.

Resources