Replace missing values by a reference value for each id in panel data - panel

I have panel data:
Id | Wave| Localisation| Baseline
1 | 1 | AA | 1
1 | 2 | . | 0
1 | 3 | . | 0
2 | 2 | AB | 1
2 | 3 | . | 0
3 | 1 | AB | 1
3 | 3 | . | 0
4 | 2 | AC | 1
4 | 3 | . | 0
Some variable values of one panel (hhsize, localisation, whatever) serves as a reference (these values are included in the baseline interview only).
By consequence, for each id we do not have all the information. For the id==1 for instance, we have missing values in the column Localisation for not-baseline-interview (baseline==0).
I would like to spread the baseline values to each panel. That is, I want to replace missing values . in column Localisation by Localisation given in the baseline interview for each id.
In fact, the information Localisation remain the same across different waves. So it is useful to know the localization for each wave for one person.

If the data pattern you present is the same for the entire dataset, then the following should work:
clear
input Id Wave str2 Localisation Baseline
1 1 AA 1
1 2 . 0
1 3 . 0
2 2 AB 1
2 3 . 0
3 1 AB 1
3 3 . 0
4 2 AC 1
4 3 . 0
end
bysort Id (Wave): replace Localisation = Localisation[1] if Localisation == "."
list, sepby(Id) abbreviate(15)
+-------------------------------------+
| Id Wave Localisation Baseline |
|-------------------------------------|
1. | 1 1 AA 1 |
2. | 1 2 AA 0 |
3. | 1 3 AA 0 |
|-------------------------------------|
4. | 2 2 AB 1 |
5. | 2 3 AB 0 |
|-------------------------------------|
6. | 3 1 AB 1 |
7. | 3 3 AB 0 |
|-------------------------------------|
8. | 4 2 AC 1 |
9. | 4 3 AC 0 |
+-------------------------------------+

Related

Multiply Matrices in DAX

Suppose I have two matrices MatrixA and MatrixB given as follows (where i is the row number and j is the column number:
MatrixA | MatrixB
i | j | val | i | j | val
---|---|---- | ---|---|----
1 | 1 | 3 | 1 | 1 | 2
1 | 2 | 5 | 1 | 2 | 3
1 | 3 | 9 | 2 | 1 | 7
2 | 1 | 2 | 2 | 2 | -1
2 | 2 | 1 | 3 | 1 | 0
2 | 3 | 3 | 3 | 2 | -4
3 | 1 | 3 |
3 | 2 | -1 |
3 | 3 | 2 |
4 | 1 | 0 |
4 | 2 | 7 |
4 | 3 | 6 |
In a more familiar form, they look like this:
MatrixA = 3 5 9 MatrixB = 2 3
2 1 3 7 -1
-1 2 0 0 -4
7 0 6
I'd like to calculate their product (which is demonstrated in this YouTube video):
Product = 41 -32
11 -7
12 -5
14 -3
In the unpivoted column form I used earlier, this is
i | j | val
---|---|----
1 | 1 | 41
1 | 2 | -32
2 | 1 | 11
2 | 2 | -7
3 | 1 | 12
3 | 2 | -5
4 | 1 | 12
4 | 2 | -3
I'm looking for a general calculation that multiplies any compatible k x n and n x m matrices together as a calculated table.
I think I've got it figured out. If MatrixA is k x n and MatrixB is n x m dimensional:
Product =
ADDCOLUMNS(
CROSSJOIN(VALUES(MatrixA[i]), VALUES(MatrixB[j])),
"val",
SUMX(
ADDCOLUMNS(
SELECTCOLUMNS(GENERATESERIES(1, DISTINCTCOUNT(MatrixA[j])), "Index", [Value]),
"A", LOOKUPVALUE(MatrixA[val], MatrixA[i], [i], MatrixA[j], [Index]),
"B", LOOKUPVALUE(MatrixB[val], MatrixB[i], [Index], MatrixB[j], [j])),
[A] * [B]))
The CROSSJOIN creates a new table with columns [i] and [j] which has k x m rows. For each i and j row pair in this cross join table, the value for that cell is computed as the sum product of i row of MatrixA with j column of MatrixB. The GENERATESERIES bit just creates an Index list that has a length of the matching dimension n.
For example, when i = 3 and j = 2, the middle section for the given example is
ADDCOLUMNS(
SELECTCOLUMNS(GENERATESERIES(1, DISTINCTCOUNT(MatrixA[j])), "Index", [Value]),
"A", LOOKUPVALUE(MatrixA[val], MatrixA[i], 3, MatrixA[j], [Index]),
"B", LOOKUPVALUE(MatrixB[val], MatrixB[i], [Index], MatrixB[j], 2))
which generates the table
Index | A | B
------|-----|----
1 | -1 | 3
2 | 2 | -1
3 | 0 | -4
where the [A] column is the 3rd row of MatrixA and the [B] column is the 2nd row of MatrixB.

Expand rectangles as much as possible to cover another rectangle, minimizing overlap

Given a tiled, x- and y-aligned rectangle and (potentially) a starting set of other rectangles which may overlap, I'd like to find a set of rectangles so that:
if no starting rectangle exists, one might be created; otherwise do not create additional rectangles
each of the rectangles in the starting set are expanded as much as possible
the overlap is minimal
the whole tiled rectangle's area is covered.
This smells a lot like a set cover problem, but it still is... different.
The key is that each starting rectangle's area has to be maximized while still minimizing general overlap. A good solution keeps a balance between necessary overlaps and high initial rectangles sizes.
I'd propose a rating function such as that:
Higher is better.
Examples (assumes a rectangle tiled into a 4x4 grid; numbers in tiles denote starting rectangle "ID"):
easiest case: no starting rectangles provided, can just create one and expand it fully:
.---------------. .---------------.
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| => |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
·---------------· ·---------------·
rating: 16 * 1 - 0 = 16
more sophisticated:
.---------------. .---------------. .---------------.
| 1 | 1 | | | | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 |
|---|---|---|---| |---|---|---|---| |---|---|---|---|
| 1 | 1 | | | | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 |
|---|---|---|---| => |---|---|---|---| or |---|---|---|---|
| | | 2 | 2 | | 2 | 2 | 2 | 2 | | 1 | 1 | 2 | 2 |
|---|---|---|---| |---|---|---|---| |---|---|---|---|
| | | 2 | 2 | | 2 | 2 | 2 | 2 | | 1 | 1 | 2 | 2 |
·---------------· ·---------------· ·---------------·
ratings: (4 + 4) * 2 - 0 = 16 (4 + 4) * 2 - 0 = 16
pretty bad situation, with initial overlap:
.-----------------. .-----------------------.
| 1 | | | | | 1 | 1 | 1 | 1 |
|-----|---|---|---| |-----|-----|-----|-----|
| 1,2 | 2 | | | | 1,2 | 1,2 | 1,2 | 1,2 |
|-----|---|---|---| => |-----|-----|-----|-----|
| | | | | | 2 | 2 | 2 | 2 |
|-----|---|---|---| |-----|-----|-----|-----|
| | | | | | 2 | 2 | 2 | 2 |
·-----------------· ·-----------------------·
rating: (8 + 12) * 2 - (2 + 2 + 2 + 2) = 40 - 8 = 36
covering with 1 only:
.-----------------------.
| 1 | 1 | 1 | 1 |
|-----|-----|-----|-----|
| 1,2 | 1,2 | 1 | 1 |
=> |-----|-----|-----|-----|
| 1 | 1 | 1 | 1 |
|-----|-----|-----|-----|
| 1 | 1 | 1 | 1 |
·-----------------------·
rating: (16 + 2) * 1 - (2 + 2) = 18 - 4 = 16
more starting rectangles, also overlap:
.-----------------. .---------------------.
| 1 | 1,2 | 2 | | | 1 | 1,2 | 1,2 | 1,2 |
|---|-----|---|---| |---|-----|-----|-----|
| 1 | 1 | | | | 1 | 1 | 1 | 1 |
|---|-----|---|---| => |---|-----|-----|-----|
| 3 | | | | | 3 | 3 | 3 | 3 |
|---|-----|---|---| |---|-----|-----|-----|
| | | | | | 3 | 3 | 3 | 3 |
·-----------------· ·---------------------·
rating: (8 + 3 + 8) * 3 - (2 + 2 + 2) = 57 - 6 = 51
The starting rectangles may be located anywhere in the tiled rectangle and have any size (minimum bound 1 tile).
The starting grid might be as big as 33x33 currently, though potentially bigger in the future.
I haven't been able to reduce this problem instantiation to a well-problem, but this may only be my own inability.
My current approach to solve this in an efficient way would go like this:
if list of starting rects empty:
create starting rect in tile (0,0)
for each starting rect:
calculate the distances in x and y direction to the next object (or wall)
sort distances in ascending order
while free space:
pick rect with lowest distance
expand it in lowest distance direction
I'm unsure if this gives the optimal solution or really is the most efficient one... and naturally if there are edge cases this approach would fail on.
Proposed attack. Your mileage may vary. Shipping costs higher outside the EU.
Make a list of open tiles
Make a list of rectangles (dimension & corners)
We're going to try making +1 growth steps: expand some rectangle one unit in a chosen direction. In each iteration, find the +1 with the highest score. Iterate until the entire room (large rectangle) is covered.
Scoring suggestions:
Count the squares added by the extension: open squares are +1; occupied squares are -1 for each other rectangle overlapped.
For instance, in this starting position:
- - 3 3
1 1 12 -
- - 2 -
...if we try to extend rectangle 3 down one row, we get +1 for the empty square on the right, but -2 for overlapping both 1 and 2.
Divide this score by the current rectangle area. In the example above, we would have (+1 - 2) / (1*2), or -1/2 as the score for that move ... not a good idea, probably.
The entire first iteration would consider the moves below; directions are Up-Down-Left-Right
rect dir score
1 U 0.33 = (2-1)/3
1 D 0.33 = (2-1)/3
1 R 0.33 = (1-0)/3
2 U -0.00 = (0-1)/2
2 L 0.00 = (1-1)/2
2 R 0.50 = (2-1)/2
3 D 0.00 = (1-1)/2
3 L 0.50 = (1-0)/2
We have a tie for best score: 2 R and 3 L. I'll add a minor criterion of taking the greater expansion, 2 tiles over 1. This gives:
- - 3 3
1 1 12 2
- - 2 2
For the second iteration:
rect dir score
1 U 0.33 = (2-1)/3
1 D 0.33 = (2-1)/3
1 R 0.00 = (0-1)/3
2 U -0.50 = (0-2)/4
2 L 0.00 = (1-1)/4
3 D -1.00 = (0-2)/2
3 L 0.50 = (1-0)/2
Naturally, the tie from last time is now the sole top choice, since the two did not conflict:
- 3 3 3
1 1 12 2
- - 2 2
Possible optimization: If a +1 has no overlap, extend it as far as you can (avoiding overlap) before computing scores.
In the final two iterations, we will similarly get 3 L and 1 D as our choices, finishing with
3 3 3 3
1 1 12 2
1 1 2 2
Note that this algorithm will not get the same answer for your "pretty bad example": this one will cover the entire room with 2, reducing to only 2 overlap squares. If you'd rather have 1 expand in that case, we'll need a factor for the proportion of another rectangle that you're covering, instead of my constant value of 1.
Does that look like a tractable starting point for you?

Access element of matrix by value

Let's say I define a matrix:
matrix a = (2,3 \ 4,7 \ 6,13)
I can access "13" like this:
display a[3,2]
Is it also possible to access "13" while referring to "6" to specify the row? In other words, we would somehow signify that the row is the row (there could be more than one) that contains a 6 in the first column and then we want the second column of this row.
In R, we might do it like this:
a1 <- data.frame(c(2,4,6), c(3,7,13))
a1[a1[,1]==6, 2]
Is there anything analogous in Stata?
You could do this with Stata's matrix language, with some programming, but I would turn to Mata whose defined functions allow direct solutions similar in spirit to R. Consider this dialogue.
. mata
------------------------------------------------- mata (type end to exit) --------------
: a = (2,3 \ 4,7 \ 6,13)
: a :== 1
1 2
+---------+
1 | 0 0 |
2 | 0 0 |
3 | 0 0 |
+---------+
: a :== 6
1 2
+---------+
1 | 0 0 |
2 | 0 0 |
3 | 1 0 |
+---------+
: rowsum(a :== 6)
1
+-----+
1 | 0 |
2 | 0 |
3 | 1 |
+-----+
: select(a, rowsum(a :== 6))
1 2
+-----------+
1 | 6 13 |
+-----------+
: a2 = select(a, rowsum(a :== 6))
: a2[, 2]
13
: b = (6,6 \ 6,6 \ 6,6)
: select(b, rowsum(b :== 6))
1 2
+---------+
1 | 6 6 |
2 | 6 6 |
3 | 6 6 |
+---------+
: b2 = select(b, rowsum(b :== 6))
: b2[, 2]
1
+-----+
1 | 6 |
2 | 6 |
3 | 6 |
+-----+
"row contains a 6" is defined by the total of "element is equal to 6" across rows. Note that the code works if (a) there is more than one 6 on a row and/or (b) there is more than one row with a 6. In the last case, what is selected contains more than one element.
The notation should seem self-explanatory, except possibly that : as a prefix signals "elementwise" operations. To copy a Stata matrix into Mata, use st_matrix().
Note: Working out what the code should be to select in the first column only is set as an exercise for the zealous.

Stata: need help creating a binary variable from panel data

I have a dataset in which a household id (hhid) and a member id (mid) identify a unique person. I have results from two separate surveys taken a year apart (surveyYear). I also have data on whether or not the individual was enrolled in school at the time.
I want a binary variable which signifies if the individual in question dropped out of school between the surveys (i.e. 1 if dropped and 0 if still in school)
I have a decent understanding of Stata but this coding challenge seems a little beyond me because I am not sure how to compare the in-school status of the later id with the earlier id and then propagate that result into a binary column.
Here is an example of what I need
Previously:
+----------------------------------+
| hhid mid survey~r inschool |
|----------------------------------|
1. | 1 2 3 1 |
2. | 1 2 4 1 |
3. | 1 3 3 1 |
4. | 1 3 4 1 |
5. | 2 1 3 1 |
6. | 2 1 4 0 |
7. | 2 2 3 0 |
8. | 2 2 4 0 |
+----------------------------------+
After:
+--------------------------------------------+
| hhid mid survey~r inschool dropped |
|--------------------------------------------|
1. | 1 2 3 1 0 |
2. | 1 2 4 1 0 |
3. | 1 3 3 1 0 |
4. | 1 3 4 1 0 |
5. | 2 1 3 1 1 |
6. | 2 1 4 0 1 |
7. | 2 2 3 0 0 |
8. | 2 2 4 0 0 |
+--------------------------------------------+
bysort hhid mid (surveyyear) : gen dropped = inschool[1] == 1 & inschool[2] == 0
The commentary is longer than the code:
Within blocks of observations with the same hhid and mid, sort by surveyyear.
You want students who were inschool in year 3 but not in year 4. So, inschool is 1 in the first observation and 0 in the second.
Here subscripting [1] and [2] refers to order within blocks of observations defined by the by: statement.
If further detail is needed see e.g. this article. Note that contrary to one tag, no loop is needed (or, if you wish, that the loop over possibilities is built in to the by: framework).

How to understand this style of K-map

I have seen a different style of Karnaugh Map for logic design. This is the style they used:
Anyone knows how this K-Map done? How to comprehend with this kind of map? Or how they derived from that equation from that map. The map is quite different from the common map like this:
The maps relate to each other this way, the only difference is the cells' (terms') indexes corresponding to the variables or the order of the variables.
The exclamation mark is only an alternative to the negation of a variable. !A is the same as ¬A, also sometimes noted A'.
!A A A !A ↓CD\AB → 00 01 11 10
+----+----+----+----+ +----+----+----+----+
!B | 1 | 0 | 1 | 0 | !D 00 | 1 | 1 | 1 | 0 |
+----+----+----+----+ +----+----+----+----+
B | 1 | 1 | 1 | 1 | !D ~ 01 | 1 | x | x | 1 |
+----+----+----+----+ +----+----+----+----+
B | x | x | x | x | D 11 | x | x | x | x |
+----+----+----+----+ +----+----+----+----+
!B | 1 | 1 | x | x | D 10 | 0 | 1 | 1 | 1 |
+----+----+----+----+ +----+----+----+----+
!C !C C C
If you are unsure, of the indexes in the given K-map, you can always check that by writing the corresponding truth-table.
For example the output value of the first cell in the "strange" K-map is equal to 1 if !A·!B·!C·!D (all variables in its negation), that corresponds with the first line of the truth-table, so the index is 0. And so on.
index | A B C D | y
=======+=========+===
0 | 0 0 0 0 | 1
1 | 0 0 0 1 | 1
2 | 0 0 1 0 | 0
3 | 0 0 1 1 | x ~ 'do not care' state/output
-------+---------+---
4 | 0 1 0 0 | 1
5 | 0 1 0 1 | x
6 | 0 1 1 0 | 1
7 | 0 1 1 1 | x
-------+---------+---
8 | 1 0 0 0 | 0
9 | 1 0 0 1 | 1
10 | 1 0 1 0 | 1
11 | 1 0 1 1 | x
-------+---------+---
12 | 1 1 0 0 | 1
13 | 1 1 0 1 | x
14 | 1 1 1 0 | 1
15 | 1 1 1 1 | x
You can use the map the same way you would use the "normal" K-map to find the implicants (groups), because all K-maps indexing needs to conform to the Gray's code.
You can see the simplified boolean expression is the same in both styles of these K-maps:
f(A,B,C,D) = !A·!C + A·C + B + D = ¬A·¬C + A·C + B + D
!A·!C is marked red,
A·C blue,
B orange
and D green.
The K-maps were generated using latex's \karnaughmap command and tikz library.
it's the same in principle just the rows and columns (or the variables) are in a different order
The red labels are for when the variable is true, the blue for when it's false
It's actually the same map, but instead of A they have C and instead of B they have A and instead of C they have D and instead of D they have B

Resources