Reorganise panel data in Stata - panel

I am working on my own data with Stata and I had a problem to restructure my data in a panel format.
More specifically, I have a dataset that have a first column of years, a second column of individuals and the rest of the columns include the variables.
The first lines of the dataset include observations of the first year (y1) for all the individuals of my sample. The following lines include obervations for the second year (y2) for all the individuals of my sample. the following lines include observations for the third year (y3) for all the individuals of my sample.
I want to have a dataset where the first lines will include observations for the first individual of my sample over all years. The following lines will include observations for the second individual over all years. The following lines will include obervations for the third individual over all years. etc.
Here is an example: I need to change the format of this dataset:
year id var1 var2 var3
y1 1 .. .. ..
y1 2 .. .. ..
y1 3 .. .. ..
y2 1 .. .. ..
y2 2 .. .. ..
y2 3 .. .. ..
y3 1 .. .. ..
y3 2 .. .. ..
y3 3 .. .. ..
into this format:
year id var1 var2 var3
y1 1 .. .. ..
y2 1 .. .. ..
y3 1 .. .. ..
y1 2 .. .. ..
y2 2 .. .. ..
y3 2 .. .. ..
y1 3 .. .. ..
y2 3 .. .. ..
y3 3 .. .. ..

To close this out with an answer: This question seems to be about sorting the data, so
sort id year
yields the desired result.
You may also want to consult some guides -- e.g., help gs.

Related

Truth table of f(x1,x2,x3,x4) function from given two (4-1) multiplexers

Given two (4-1) multiplexers
How can I get the truth table of f(x1,x2,x3,x4) function??
A 4-1 multiplexer has the following general truth-table:
A1 A0 Y
0 0 I0
0 1 I1
1 0 I2
1 1 I3
The two control inputs A0 and A1 select which of the four inputs is switched through to the output.
To get your question solved, start with the left-hand multiplexer and write a truth-table for it.
In a second step write the overall truth-table by plugging in the intermediate signal values in the general table shown above.
The resulting truth-table has four input columns X1, X2, X3, X4.
There is one output column Y. Rather than using intermediate truth-tables you could figure out the output value for each of the 16 input combinations.

Find top documents which matched the query of words

So basically in this problem, we have 1000000 Documents:
Documents have:
-Text (contains a lot of words)
-Date
-DocId
.. and so on
and we have a query which has some words (max 1000):
So we now the problem is we have first find the intersection between Documents and Query and then top K top documents which have the most number of words matched.
For Example:
D1 - w1, w2, w3, w4, ... wn
D2 - w2, w4, w5, x2
D3 - a1, a2, w1, x1, x2
Q(w1,a1,w4,w5,x1,w5,w6)
so now doing the intersection of queries and docs
D1 - w1,w4,w5,w6 - 4 match
D2 - w4,w5 - 2 match
D3 - a1,x1,w1 - 3 match
So top 2 Docs are D1 and D3
I have tried to put words to document mapping in a 2d matrix.
D1 D2 D3
w1 1 1
w2 1 1
w3 1
.
.
.
a1 1
a2 1
x1 1
x2 1 1
From this matrix, I tried to find numbers but the interviewer was not happy.
Please help guys !!
If you have to program it yourself, you'd probably build a hash table with the 1000 words, then go through the documents and check all words for matches. Keep a list of the k best matches around and update it after each document.
In real life, I would stuff the documents into a PostgreSQL database, create a full text search index on the text and run an SQL query containing the search words. Why reinvent the wheel?

Pandas multiindex sort

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks
Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop
Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

SPSS Return Highest Variable Name

1) I need SPSS to return the variable name of the highest variable in a series of subscales. Basically I have ten subscales with mean scores ranging from 0 to 5, I don't want to know the highest score for each case, but rather which subscale is highest.
When I use this syntax, I just get the score, which doesn't tell me which category it belongs to.
COMPUTE Motivation_Highest2 = MAX(Stress_Mgmt, Revitalisation, Enjoyment, Challenge, SocialRecog, Affiliation, Competition, HealthPress, IllHealth, PosHealth,
WtMgmt, Appearance, StrengthEnd, Nimbleness, MotivationHighest).
VARIABLE LABELS Motivation_Highest2 'Motivation Intensity: Highest Score on any Motivation Subscale'.
EXECUTE.
Can I ask SPSS to return the variable name instead of the score?
2) There may be two scores that are both equally high. In this case, I would like SPSS to give me both variable names.
Thanks!
This is an ok job for a macro to do.
DEFINE !MaxVars (OutN = !TOKENS(1)
/OutV = !CHAREND("/")
/Var = !CMDEND)
NUMERIC !OutN.
!DO !I !IN (!Var)
COMPUTE !OutN = MAX(!OutN,!I).
!DOEND
STRING !OutV (!CONCAT("A",!LENGTH(!Var))).
!DO !I !IN (!Var)
IF !I = !OutN !OutV = LTRIM(CONCAT(RTRIM(!OutV)," ",!QUOTE(!I))).
!DOEND
!ENDDEFINE.
And here is an example of using it on a set of data.
DATA LIST FREE / X1 X2 X3.
BEGIN DATA
1 2 3
3 2 1
4 4 0
0 4 4
1 1 1
END DATA.
!MaxVars OutN = Max OutV = Vars /Var = X1 X2 X3.
If you then run LIST Max Vars. it will return in the output:
Max Vars
3 X3
3 X1
4 X1 X2
4 X2 X3
1 X1 X2 X3

Converting a single col into mutiple rows (GNUPLOT file)

I'm trying to make a gnuplot picture from a file, have a problem with the distribution of 'data.txt' file. The actual distribution is :
1 4 x1
1 16 x2
4 4 x3
4 16 x4
8 4 x5
8 16 x6
The first line makes reference to the number of lines that i want, and the other colums make reference to the x and y axis.
I'm trying two approximations to make the picture without success:
Use some gnuplot function to sketch the picture using the actual distribution that my file have. I have not found this command.
Make a bash script to convert the actual file into another with the correct distribution
4 x1 x3 x5
16 x2 x4 x6
Addressing #2
awk '{a[$2] = a[$2] $3 " "} END {for (i in a) print i, a[i]}' file
4 x1 x3 x5
16 x2 x4 x6

Resources