How can split one column data into multiple columns based on column values using awk?
Example file and desired output is below. My bash version is 3.2.52(1).
$ cat examplefile
A
1
B
2
B
3
C
10
C
11
C
13
A
4
B
5
B
6
B
7
C
14
Desired output:
$ cat outputfile
A B C
1 2 10
null B C
null 3 11
null null C
null null 13
A B C
4 5 14
null B null
null 6 null
null B null
null 7 null
Or forget about null values How can I obtain two columns as in the outputfile2?
cat examplefile2
A
1
B
2
B
3
cat outputfile2
A B
1 2
B
3
You can get it:
awk 'BEGIN{l=1;ll="";} {if (l) {ll=$0;l=0;} else {if (length(a[ll])>0) {a[ll]=a[ll]","ll","$0;} else {a[ll]=ll","$0;}l=1;}} END{for (k in a){print a[k];}}' examplefile
It works for any number of classes (A,B,C...).
The output is:
A,1,A,4
B,2,B,3,B,5,B,6,B,7
C,10,C,11,C,13,C,14
If you want it as columns, just have a quick look to the following post:
An efficient way to transpose a file in Bash
Related
I need a table of 3 columns:
Column 1 counts how often the letter a,b,c,d are prevalent in the variable 1 to 3 across all rows
Column 2 counts how often the number a,b,c,d are prevalent in the variable 4 to 6 across all rows
Column 3 subtracts 2 from 1
The data looks like this:
observation
var 1
var 2
var 3
var 4
var 5
var 6
1
a
b
d
c
a
b
2
b
c
d
b
a
d
3
b
d
a
c
d
a
The table should look something like this:
Column 1 (var1-3)
Column 2 (var4-6)
Column 3)
a
2
3
-1
b
3
2
1
c
1
2
-1
d
3
2
1
I am using Stata and I have no idea where to start. I have tried with tabulate, tab1, table but none of it seems to suit my needs.
There are likely to be many other ways to do this.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte observation str1(var1 var2 var3 var4 var5 var6)
1 "a" "b" "d" "c" "a" "b"
2 "b" "c" "d" "b" "a" "d"
3 "b" "d" "a" "c" "d" "a"
end
rename (var4-var6) (war#), addnumber
reshape long var war, i(obs) j(which)
rename (var war) (value=)
reshape long value, i(obs which) j(group) string
contract group value
reshape wide _freq, i(value) j(group) string
char _freqvar[varname] "var1-var3"
char _freqwar[varname] "var4-var6"
gen difference = _freqvar - _freqwar
list, subvarname abbrev(10) noobs
+--------------------------------------------+
| value var1-var3 var4-var6 difference |
|--------------------------------------------|
| a 2 3 -1 |
| b 3 2 1 |
| c 1 2 -1 |
| d 3 2 1 |
+--------------------------------------------+
I have a scenario as below.
In a file say file1.txt, I have
A 1
A 2
A 3
B 5
B 2
C 9
C 10
I would like to sort and get results like below.
A 3
B 5
C 10
I tried
sort fike1.txt -k1,1 -kn2
But didn't work.
I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.
So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.