Dropping observations in Stata based on age and panel round - panel

I have a panel dataset and I want to drop those respondents that were aged 40 years and over in their first round of the survey.
I tried doingdrop if age>40 and also drop if age>40 & t==1 where t is an identifier of the survey wave the person is in. However when I do the second I am left with people over the age of 40.
Here is an example of how my data looks like:
pid age wave year of survey
1 20 1 2005
1 21 2 2006
1 22 3 2007
1 23 4 2008
2 37 1 2006
2 38 2 2007
2 39 3 2008
2 40 4 2009
3 40 1 2008
3 41 2 2009
3 42 3 2010
3 43 4 2011
My aim is not to lose the 3rd respondent given that he/she was within my target age group when they were first surveyed but they were not in the following survey years (rather than just being left with his/her first wave of data and dropping the other 3 that is what is being done if I simply do drop if age<=40).
Is there another way so as to be left with only people up to the age of 40 while keeping those who were 40 in their first wave even if they turn 41, 42 etc in subsequent waves? I basically want to constrain my panel into the up to 40 years of age group while keeping those who were 40 in their wave but might be over 40 in subsequent waves (I only have 4 waves).

Stata gives you exactly what you're asking for. With drop if age > 40 you simply lose any observation for which age > 40. With drop if age > 40 & wave == 1 you add an additional condition: drop it if it simultaneously has wave == 1. I think that's clear.
I find your explanation somewhat contradictory. You don't want to lose any observation from respondent 3 because in her first wave she's not over 40, although she is in her following waves. But then you say you want to be left with only people up to the age of 40.
The following just drops all observations for any person who in her first wave is over 40. Let us know if this is not what you seek.
clear all
set more off
input ///
pid age wave survyear
1 20 1 2005
1 21 2 2006
1 22 3 2007
1 23 4 2008
2 37 1 2006
2 38 2 2007
2 39 3 2008
2 40 4 2009
3 40 1 2008
3 41 2 2009
3 42 3 2010
3 43 4 2011
4 42 1 2009
4 43 2 2010
4 44 3 2011
4 45 4 2012
end
list, sepby(pid)
*-----
bysort pid (age): drop if age[1] > 40
list, sepby(pid)
You probably want to read Speaking Stata: How to move step by: step, by Nick Cox. See also help subscripting.
Edit
With no knowledge of the database structure, sorting by wave should be a more general approach. That involves bysort pid (wave): ... in the previous code. Imagine a case where a person has the same age for two consecutive waves. If so, sorting by age would not give consistent results. The wave variable is likely to be the one that uniquely identifies cases, for each person. Read help sort and help isid carefully, including the manual entries.

Related

Does Cognos Framework Manager has the "Last" function like Dynamic Cubes in Cognos?

I was wondering if Cognos Framework Manager has the built-in function "Last" like in Dynamic Cubes?
Or does someone know how to model following case:
We have two dimensions - a time dimension with year, half-year, quarter and month and another dimension that categorises people depending how long they are attending a project (1-30 days, 31-60 d, 60-180, 180 -365, 1-2 years, +2 years). However the choice of the time dimension level (year, half-year etc.) influences the categorization of the other dimension).
An example:
A person attends a project starting from 15.11.2018 and ends 30.06.2020. The cognos user uses for the time dimension the year level thus 2018, 2019 & 2018 will be displayed.
For 2018 the person will be in the category 31-60 days, since 46 days have passed until 31.12.2018. For 2019 the person will be listed in category 1-2 years as 46 + 365 days will have been passed since 31.12.2019. For 2020 the person will also be in that category as 46 + 365 + 180 day have gone by.
The categories will change if the user selects another time dimension level e.g. half-years:
2nd HY 2018: 31-60 (46 days passed)
1st HY 2019: 180-365 days (46 + 180 --> End of HY2019)
2nd HY 2019: 1-2 years (46 + 180 + 180)
1st HY 2020: 1-2 years (46 + 180 + 180 + 180)
Does someone know how to model dynamic dimension categories based on selection of another dimension (here time dimension)?
The fact table contains monthly data and for the mentioned peroson above there will be 20 seperate records (for each month between november 2018 and june 2020).
For any period, a person may or may not be working on a project.
Without knowing exactly what your data and metadata is it would be somewhat difficult to prescribe an exact solution but the approach would probably be somewhat similar to a degenerate dimension scenario.
You would want to model the project dimension as a fact as well as a dimension. You would have relationships between it and time and whatever other dimensions you need.
Depending on the data and the metadata you might need to do some gymnastics to get there.
If the data was in a form similar to this it would be not too difficult. This is an example to get you an idea about some ways of approaching the problem.
Date_Key Person_Key Project_Key commitment_status, which would be the measure.
20200101 1 1 1
20200101 1 2 0
20200101 1 3 0
20200102 1 1 1
20200102 1 2 0
20200102 1 3 0
20200103 1 1 0
20200103 1 2 1
20200103 1 3 0
In the above, person 1 was working on project 1 for 2 days and then put onto project 2 for a day. By aggregating the commitment status, which is done by setting the aggregate rule property, you would be able to determine the number of days a person has been working on a project no matter what time period you have set in your query.

Cumulative sum by bins in obiee

I want to generate a report in obiee , by grouping countries and showing there cumulative sum
I tried creating bins like for china I created a bin which contains singapore, taiwan and china. another bin for japan containing some countries. using pivot table i can show the sum of customers in a region by dates for these two bins. but when I need a cumulative sum for every bin it is giving weird values
Number of employee by region and date where china and japan are bins
china japan
01-Nov-18 1 3
02-Nov-18 2 4
03-Nov-18 1 1
04-Nov-18 2 5
05-Nov-18 4 7
06-Nov-18 5 7
where as result i want( how can I achieve this)
China Japan
01-Nov-18 1 3
02-Nov-18 3 7
03-Nov-18 4 8
04-Nov-18 6 13
05-Nov-18 10 20
06-Nov-18 15 27
The measure has to be a running sum in that case. RSUM(YourMeasure)
Make a duplicate layer of the measure column. Click on the duplicated column and select the option 'Display as Running Sum'.

Exporting related tables from Access to one Excel worksheet

I am trying to set up an Access database and will be importing the data from Excel. We do our analysis in R and the current Excel worksheet we use is formatted and arranged to work well for exporting to R and doing analysis there.
The format is as follows:
The first 12 columns of data describe date, location and other information which then applies to the following 12 columns. The trouble is that for a single set of observations the information in the first 12 columns doesn't change from row to row but the values for the second 12 columns does change from row to row.
year mm dd loc start end obs sess test object success
2013 5 15 park 1600 1700 MTM MTM1 1 ball y
2013 5 15 park 1600 1700 MTM MTM1 2 stick y
2013 5 15 park 1600 1700 MTM MTM1 3 rock n
2013 5 15 park 1600 1700 MTM MTM1 4 rock n
2013 5 15 park 1600 1700 MTM MTM1 5 stick y
2013 5 15 park 1600 1700 MTM MTM1 6 stick y
2013 6 24 yard 1500 1530 LFR LFR1 1 ball n
2013 6 24 yard 1500 1530 LFR LFR1 2 stick n
2013 6 24 yard 1500 1530 LFR LFR1 3 stick n
2013 6 24 yard 1500 1530 LFR LFR1 4 stick n
2013 6 24 yard 1500 1530 LFR LFR1 5 stick y
2013 6 24 yard 1500 1530 LFR LFR1 6 rock y
2013 6 24 yard 1500 1530 LFR LFR1 7 ball y
Above is an imaginary dataset which matches the format of the real one (the real one is too wide to fit here).
Notice that the entries for year, mm (month), dd (day), loc (location), start, end, obs (observer), and sess (session) all stay the same but test, object, and success change from row to row for a given set of observations.
In Access I would like to use a unique_ID (primary key) to relate tables so that the information for the first 8 columns need only be entered once and have it relate to each entry for the last 3 columns. In this example then, I have one Excel worksheet that will become two related Access tables (objects).
Before converting to Access though I would like to know that I will be able to export the data back to Excel (and/or directly to a text file) so that it will look just like this again. That is, I do NOT want to export multiple tables to separate Excel worksheets. I want all Access tables within my database to be exported to just one worksheet and in the format shown above. The reason for this is that we run analysis in R based on both the session and the instance levels (called different things in the real data, but that is the idea) so it is important for the location data and result data to be associated with every row in the output file (.xls or .csv).
Is this possible?
I am mostly looking for an outline of how this might be done. Specific code is not necessary, though your personal assessment of the complexity of the potential code (given my complete and absolute ignorance of VBA) that will be required would be appreciated.
The answer appears to be Yes. I just need to create a Query which will return the data to the original form and then export that query as a table.

algorithm to calculate number pair sequence

I need to calculate a sequence of numbers (similar to Sudoku) to match teams to play each other.
I need to create a matrix for 8 and 9 teams and can't figure out the formula. I have to believe this is really simple, but I have no idea what to search for to find it.
Here is a working version for 7 teams:
team |1 2 3 4 5 6 7
====================
week 1 | 7 6 5 4 3 2
week 2 | 7 5 6 3 4 1
week 3 | 6 5 7 2 1 4
week 4 | 5 6 7 1 2 3
week 5 | 4 3 2 1 7 6
week 6 | 3 4 1 2 7 5
week 7 | 2 1 4 3 6 5
So for the first week, team 1 doesn't play (no available partner), team 2 plays team 7, team 3 plays team 6, etc.
For week 2, team 1 plays team 7, etc.
No team may play the other team. The event continues for as many weeks as we have teams, so 8 teams would play for 8 weeks.
Each team should play another team once and only once. They can't play themselves (hence the blank entry in each row.
Note that the upper right triangle is a mirror of the bottom left triangle, but that still didn't help me determine the formula.
My guess is that if I spent enough hours, I could figure out the formula. But since this has to have been done a few million times by people over the ages, I am guessing that it's a well known algorithm and I just need to find someone who knows the name (so I can look it up) or can tell me what it is so I can create this for a friend who needs it.
Thanks!
The best answer so far is from Dennis Meng (I can't comment, so I have to use an answer). That link pointed me to a question where the answer worked, sort of. I don't have an algorithym yet, but the methodology worked adequately. I have my rows and columns. It doesn't provide me with a "mirror" image the way the example does. But it does give me a unique team for each week. I am hoping that will be enough.
I just used excel to lay it out as that was faster than trying to figure out the logic, write the code, and get a nice formatted result - especially since I only seem to need to do it once.
But if it turns out I need to do it again, I will write a simple application and post it here.
Of course, it would be great if I could get the routine that generated the above matrix....
Of course, that also leads me to another issue. How can I mark Dennis' comment as the answer???? He deserves the credit (unless someone chimes in with the mirror solution....)
Oh well, thanks Dennis!

Enumerate all partial orders

How to efficiently enumerate all partial orders on a finite set?
I want to check whether a partial order with specified properties exists. To check this I am going with brute force to enumerate all possible partial orders on small finite sets.
They will have to be really small finite sets for your project to be practical.
The number of labelled posets with n labelled elements is Sloane sequence A001035, whose values are known up to n=18:
0 1
1 1
2 3
3 19
4 219
5 4231
6 130023
7 6129859
8 431723379
9 44511042511
10 6611065248783
11 1396281677105899
12 414864951055853499
13 171850728381587059351
14 98484324257128207032183
15 77567171020440688353049939
16 83480529785490157813844256579
17 122152541250295322862941281269151
18 241939392597201176602897820148085023
Sequence A000112 is the number of unlabelled posets; unsurprisingly, the numbers are smaller but still rapidly grow out of reach. They seem to be known only up to n=16; p16 is 4483130665195087.
There is an algorithm in a paper by Gunnar Brinkman and Brendan McKay, listed in the references on the OEIS A000112 page, linked above. The work was done in 2002, using about 200 workstations, and counting the 4483130665195087 unlabelled posets of size 16 took about 30 machine-years (the reference machine is a 1 GHz Pentium III). Today, it could be done faster but then the value of p17 is presumably about two decimal orders of magnitude bigger.

Resources