Stata counting observations across multiple columns into one table - datatable

I need a table of 3 columns:
Column 1 counts how often the letter a,b,c,d are prevalent in the variable 1 to 3 across all rows
Column 2 counts how often the number a,b,c,d are prevalent in the variable 4 to 6 across all rows
Column 3 subtracts 2 from 1
The data looks like this:
observation
var 1
var 2
var 3
var 4
var 5
var 6
1
a
b
d
c
a
b
2
b
c
d
b
a
d
3
b
d
a
c
d
a
The table should look something like this:
Column 1 (var1-3)
Column 2 (var4-6)
Column 3)
a
2
3
-1
b
3
2
1
c
1
2
-1
d
3
2
1
I am using Stata and I have no idea where to start. I have tried with tabulate, tab1, table but none of it seems to suit my needs.

There are likely to be many other ways to do this.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte observation str1(var1 var2 var3 var4 var5 var6)
1 "a" "b" "d" "c" "a" "b"
2 "b" "c" "d" "b" "a" "d"
3 "b" "d" "a" "c" "d" "a"
end
rename (var4-var6) (war#), addnumber
reshape long var war, i(obs) j(which)
rename (var war) (value=)
reshape long value, i(obs which) j(group) string
contract group value
reshape wide _freq, i(value) j(group) string
char _freqvar[varname] "var1-var3"
char _freqwar[varname] "var4-var6"
gen difference = _freqvar - _freqwar
list, subvarname abbrev(10) noobs
+--------------------------------------------+
| value var1-var3 var4-var6 difference |
|--------------------------------------------|
| a 2 3 -1 |
| b 3 2 1 |
| c 1 2 -1 |
| d 3 2 1 |
+--------------------------------------------+

Related

divide a table into sub_table by query using Laravel

I have a table (table1) with 3 column (ID,V,R), I want to divide it to some table based on ID. for example suppose it is a table1:
ID V R
1 1 T
1 2 F
2 1 T
2 3 T
3 4 F
then because I have 3 type of ID (which it could be more or less) I want 3 different table like this:
table1_1:
ID V R
1 1 T
1 2 F
table 1_2:
ID V R
2 1 T
2 3 T
table 1_3:
ID V R
3 4 F
is there any solution to do it in Laravel?
I apologize in advance if my question is so simple.

Pandas multiindex sort

In Pandas 0.19 I have a large dataframe with a Multiindex of the following form
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks
Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.
rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
Output
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3
Some Speed tests
# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop
#Ted
1000 loops, best of 3: 5 ms per loop
Here is a solution, albeit klugdy:
Input dataframe:
C0 C1 C2
A B
bar one 4 2 4
two 1 3 2
foo one 9 7 1
two 2 1 3
Custom sorting function:
def sortit(x):
xcolumns = x.columns.values
x.index = x.index.droplevel()
x.sort_values(by='two',axis=1,inplace=True)
x.columns = xcolumns
return x
df.groupby(level=0).apply(sortit)
Output:
C0 C1 C2
A B
bar one 4 4 2
two 1 2 3
foo one 7 9 1
two 1 2 3

gsub many columns simultaneously based on different gsub conditions?

I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

How to select from table where condition (and) is for some rows

I'm using oracle 11.
I have some table
+---------+-----------+--------+--------+
| attr_id | record_id | value1 | value2 |
+---------+-----------+--------+--------+
| 1 | 1 | 2 | null |
| 2 | 1 | null | 6 |
| 3 | 1 | 4 | null |
| 1 | 2 | null | 4 |
+---------+-----------+--------+--------+
And I want to select like this:
select record_id from table
where ((attr_id = 1 and value1 = 2) and (attr_id = 3 and value1 = 4))
I expect output record_id = 1.
How can I do it?
If I understand it well, you want to select the record_id having both:
(attr_id = 1 and value1 = 2) is one row and
(attr_id = 3 and value1 = 4) in an other row.
You can't use a simple SELECT in table WHERE clause here, as it will check for the condition in the same row, which can't be satisfied (attr_id can't be equal to 1 and 3 in the same row). So your query can't work.
But, there is one solution. You will need a self-JOIN, producing all combinations of pairs of (attr_id,value) for a given record_id.
As I suppose you are relatively new to SQL, I will build my answer in several steps:
self-join
As I said before, we need first to join your table with itself:
select t1.attr_id as attr_id_1,
t1.value1 as value1_1,
t2.attr_id as attr_id_2,
t2.value1 as value1_2,
record_id
from t as t1
join t as t2 using(record_id)
producing:
ATTR_ID_1 VALUE1_1 ATTR_ID_2 VALUE1_2 RECORD_ID
1 2 1 2 1
2 (null) 1 2 1
3 4 1 2 1
1 2 2 (null) 1
2 (null) 2 (null) 1
3 4 2 (null) 1
1 2 3 4 1 <---------
2 (null) 3 4 1
3 4 3 4 1
1 (null) 1 (null) 2
(live example: http://sqlfiddle.com/#!2/73a490/6)
As you can see, all combinations of rows having the same record_id are in that result set. Including combinations of a row with itself. Please notice I have spotted the row that you are looking for.
Get the "right" record
Now, it is quite easy to get the record_id having both (attr_id = 1 and value1 = 2) (attr_id = 3 and value1 = 4). You only have to take care of prefixing your column with the correct table alias:
select t1.attr_id as attr_id_1,
t1.value1 as value1_1,
t2.attr_id as attr_id_2,
t2.value1 as value1_2,
record_id
from t as t1
join t as t2 using(record_id)
where t1.attr_id = 1 and t1.value1 = 2
and t2.attr_id = 3 and t2.value1 = 4
Producing:
ATTR_ID_1 VALUE1_1 ATTR_ID_2 VALUE1_2 RECORD_ID
1 2 3 4 1
(see http://sqlfiddle.com/#!2/73a490/8)
The final answer
Finally, it is quite easy to polish a little bit the query to return only the desired value:
select record_id
from t as t1
join t as t2 using(record_id)
where t1.attr_id = 1 and t1.value1 = 2
and t2.attr_id = 3 and t2.value1 = 4

Resources