How to parse through a csv file by awk? - bash

Actually I have csv file with suppose 20 headers and they have corresponding values for those headers in the next row for a particular record.
Example : Source file
Age,Name,Salary
25,Anand,32000
I want my output file to be in this format.
Example : Output file
Age
25
Name
Anand
Salary
32000
So for doing this which awk/grep/sed command to be used?

I'd say
awk -F, 'NR == 1 { split($0, headers); next } { for(i = 1; i <= NF; ++i) { print headers[i]; print $i } }' filename
That is
NR == 1 { # in the first line
split($0, headers) # remember the headers
next # do nothing else
}
{ # after that:
for(i = 1; i <= NF; ++i) { # for all fields:
print headers[i] # print the corresponding header
print $i # followed by the field
}
}
Addendum: Obligatory, crazy sed solution (not recommended for productive use; written for fun, not profit):
sed 's/$/,/; 1 { h; d; }; G; :a s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/; ta; s/^\n\n//' filename
That works as follows:
s/$/,/ # Add a comma to all lines for more convenient processing
1 { h; d; } # first line: Just put it in the hold buffer
G # all other lines: Append hold bufffer (header fields) to the
# pattern space
:a # jump label for looping
# isolate the first fields from the data and header lines,
# move them to the end of the pattern space
s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/
ta # do this until we got them all
s/^\n\n// # then remove the two newlines that are left as an artifact of
# the algorithm.

Here is one awk
awk -F, 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} {for (i=1;i<=NF;i++) print a[i] RS $i}' file
Age
25
Name
Anand
Salary
32000
First for loop store the header in array a
Second for loop prints header from array a with corresponding data.

Using GNU awk 4.* for 2D arrays:
$ awk -F, '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) print a[j][i]}' file
Age
25
Name
Anand
Salary
32000
In general to transpose rows and columns:
$ cat file
11 12 13
21 22 23
31 32 33
41 42 43
with GNU awk:
$ awk '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43
or with any awk:
$ awk '{for (i=1;i<=NF;i++) a[NR][i]=$i} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43

Related

Splitting a large, complex one column file into several columns with awk

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time.
Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it.
Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.
With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
Could you please try following once, considering that your actual Input_file is same as shown samples.
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(above script is considering that numbers inside (...) will be constant, now adding script which will working even field numbers of not equal inside (....).
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
Output will be as follows.
1; 11; 111
2; 22; 222
3; 33; 333
Explanation: Adding explanation for above code here.
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
NOTE: This should work for more than 3 values inside (...) brackets also.
awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
output
1;11;111;
2;22;222;
3;33;333;
...;...;...;
Change FS= "\\s*" to FS = "\n*" when you also want to allow spaces inside your fields.
This script supports columns of different lengths.
When benchmarking also consider replacing [i,j] with [i][j] for GNU awk. I'm unsure which one is faster and did not benchmark the script myself.
Here is the Perl one-liner solution
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { #t=#{$val[$x]};push(#t,$1);$val[$x++]=[#t] } END { print join(";",#{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c
I would convert each section to a row and then transpose after, e.g. assuming you are using GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
Output:
1;11;111
2;22;222
3;33;333
...;...;...

Extract desired column with values

Please help me with this small script I am making I am trying to grep some columns with values from a big file (tabseparated) (mainFileWithValues.txt) which has this format:
A B C ......... (total 700 columns)
80 2.08 23
14 1.88 30
12 1.81 40
Column names are in column.nam
cat columnnam.nam
A
B
.
.
.
till 20 nmes
I am first taking column number from a big file using:
sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c
Then using cut I am extracting values
I have made a for loop
#/bin/bash
for i in `cat columnnam.nam`
do
cut -f`sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c` mainFileWithValues.txt > test.txt
done
cat test.txt
A
80
14
12
B
2.08
1.88
1.81
my problem is I want output test.txt to be in columns like main file.
i.e.
A B
80 2.08
How can I fix this in this script?
Here is one-liner:
awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
Explanation:
[ note : case insensitive header match, remove tolower(), if you want strict match ]
awk '
FNR==NR{ # Here we read columns.nam file
h[NR]=$1; # h -> array, NR -> as array key, $1 -> as array value
next # go to next line
}
{ # Here we read second file
for(i=1; i in h; i++) # iterate array h
{
if(FNR==1) # if we are reading 1st row of second file, will parse header
{
for(j=1; j<=NF; j++) # iterate over fields of 1st row fields
{
# if it was the field we are looking for
if(tolower(h[i])==tolower($j))
{
# then
# d -> array, i -> as array key which is column order number
# j -> as array value which is column number
d[i]=j;
break
}
}
}
# for all records
# if field we searched was found then print such field
# from d[i] we access, column number
printf("%s%s",i>1 ? OFS:"", i in d ? $(d[i]): "");
}
# print newline char
print ""
}
' columns.nam mainfile
Test Results:
$ cat mainfile
A B C
80 2.08 23
14 1.88 30
12 1.81 40
$ cat columns.nam
A
C
$ awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
A C
80 23
14 30
12 40
You can also make script and run
akshay#db-3325:/tmp$ cat col_parser.awk
FNR == NR {
h[NR] = $1;
next
}
{
for (i = 1; i in h; i++) {
if (FNR == 1) {
for (j = 1; j <= NF; j++) {
if (tolower(h[i]) == tolower($j)) {
d[i] = j;
break
}
}
}
printf("%s%s", i > 1 ? OFS : "", i in d ? $(d[i]) : "");
}
print ""
}
akshay#db-3325:/tmp$ awk -v OFS="\t" -f col_parser.awk columns.nam mainfile
A C
80 23
14 30
12 40
Similar Answer
AWK to display a column based on Column name and remove header and last delimiter
Another awk approach:
awk 'NR == FNR {
hdr[$1]
next
}
FNR == 1 {
for (i=1; i<=NF; i++)
if ($i in hdr)
h[i]
}
{
s=""
for (i in h)
s = s (s == "" ? "" : OFS) $i
print s
}' column.nam mainFileWithValues.txt
A B
80 2.08
14 1.88
12 1.81
To get formatted output pipe above command to column -t

Find positions of all occurrences of a pattern in a string when every line have different patterns defined in other column (UNIX)

I have this tabulated file as shown:
1 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK V
2 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK M
.
.
And so on...
The first column is the number, second column corresponds to protein sequence and third column is the last character and the pattern to find in the corresponding sequence for each case.
Thus, the desired output will be something like that:
1:positions:4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions:1 18 22 110 134
I have tried with awk and index function.
nawk -F'\t' -v p=$3 'index($2,p) {printf "%s:positions:", NR; s=$2; m=0; while((n=index(s, p))>0) {m+=n; printf "%s ", m; s=substr(s, n+1)} print ""}' "file.tsv"
However it works only specifying the variable -v as a character or string but not $3. How can I get it in unix environment? Thanks in advance
You can do:
awk -F'\t' '{ len=split($2,arr,""); printf "%s:positions:",$1 ; for(i=0;i<len;i++) { if(arr[i] == $3 ) { printf "%s ",i } }; print "" }' file.tsv
First split the subject $2 entirely into an array, then loop it, check for occurances of $3 and print the array index when found
Perl to the rescue:
perl -wane '
print "$F[0]:positions:";
$i = 0;
print " ", $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
print "\n";
' -- file
If the space after : is a problem, you can complicate it to
$i = $f = 0;
$f = print " " x $f, $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
gawk solution:
awk -v FPAT="[[:digit:]]+|[[:alpha:]]" '{
r=$1":positions:"; for(i=2;i<NF;i++) { if($i==$NF) r=r" "i-1 } print r
}' file.tsv
FPAT="[[:digit:]]+|[[:alpha:]]" - regex pattern defining field value
for(i=2;i<NF;i++) - iterating though the fields (letters of the 2nd column)
The output:
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134
awk '{
str=$1":positions:";
n=0;split($2,a,$3); # adopt $3 as the delimeter to split $2
for(i=1;i<length(a);i++){ # save the result to a
n+=length(a[i])+1;str=str" "n # locate the delimeter $3 by compute n+length(a[i])+1
}
print str
}' file.tsv
$ awk '{out=$1 ":positions:"; for (i=1;i<=length($2);i++) { c=substr($2,i,1); if (c == $3) out = out " " i}; print out}' file
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134
Simple perl solution
use strict;
use warnings;
while( <DATA> ) {
chomp;
next if /^\s*$/; # just in case if you have empty line
my #data = split "\t"; # record is tabulated
my %result; # hash to store result
my $c = 0; # position in the string
map { $c++; push #{$result{$data[0]}}, $c if $_ eq $data[2] } split '', $data[1];
print "$data[0]:position:"
. join(' ', #{$result{$data[0]}}) # assemble result to desired form
. "\n";
}
__DATA__
1 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK V
2 MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK M
I would use a small script, which goes through every line of your file, gets the last field as search_string and then use grep to get the position of the search_string. All you have to do now is shift the result, since you have an offset of 1. The sed command removes new lines from the grep output.
while read p; do
search_string=`echo $p |awk '{print $NF}'`
echo $p |grep -aob $search_string | sed ':a;N;$!ba;s/\n/ /g'
done < file.tsv

awk compare fields from two different files

Here I have tried awk script to compare fields from two different files.
awk 'NR == FNR {if (NF >= 4) a[$1] b[$4]; next} {for (i in a) for (j in b) if (i >= $2 && i <=$3 && j>=$2 && j<=$3 ) {print $1, $2, $3, i, j; next}}' file1 file2
Input files:
File1:
24926 17 206 25189 5.23674 5.71882 4.04165 14.99721 c
50760 17 48 50874 3.49903 4.25043 7.66602 15.41548 c
104318 15 269 104643 2.94218 5.18301 5.97225 14.09744 c
126088 17 70 126224 3.12993 5.32649 6.14936 14.60578 c
174113 16 136 174305 4.32339 2.36452 8.60971 15.29762 c
196474 14 89 196626 2.24367 5.16966 7.33723 14.75056 c
......
......
File2:
GT_004279 1 280
GT_003663 19891 20217
GT_003416 22299 23004
GT_003151 24916 25391
GT_001715 39470 39714
GT_001585 40896 41380
....
....
The output which I got is:
GT_004279 1 280 2465483 2639576
GT_003663 19891 20217 2005645 2005798
GT_003416 22299 23004 2291204 2269898
GT_003151 24916 25391 2501183 25189
GT_001715 39470 39714 3964440 3950417
......
......
The desired output should be 1st and 4th field values from file1 lies in between 2nd and 3rd field values from file2. For example, If I have taken above given lines as INPUT files, the output must be..
GT_003151 24916 25391 24926 25189
If I guess correctly the problem is within the If loop. So, Could someone help to rectify this problem.
Thanks
You need to make composite keys and iterate through them. When you create such composite keys they are separated by SUBSEP variable. So you just split based on that and do the check.
awk '
NR==FNR{ flds[$1,$4]; next }
{
for (key in flds) {
split (key, fld, SUBSEP)
if ($2<=fld[1] && $3>=fld[2])
print $0, fld[1], fld[2]
}
}' file1 file2
GT_003151 24916 25391 24926 25189

awk with duplicate values

File:
22 Hello
22 Hi
1 What
34 Where
21 is
44 How
44 are
44 you
Desired Output:
22 HelloHi
1 What
34 Where
21 is
44 Howareyou
If there are duplicate values in first field($1) the second field should have appended text
How to achieve this using awk?
Thanks
$ awk '
!seen[$1]++ { keys[++numKeys] = $1 }
{ str[$1] = str[$1] $2 }
END{
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
print key, str[key]
}
}
' file
22 HelloHi
1 What
34 Where
21 is
44 Howareyou
Using awk:
awk '!($1 in a){a[$1]=$2;next} $1 in a{a[$1]=a[$1] $2} END{for (i in a) print i, a[i]}' file
22 HelloHi
44 Howareyou
34 Where
21 is
1 What
EDIT: To preserve the order:
awk '!($1 in a){b[++n]=$1; a[$1]=$2;next} $1 in a{a[$1] = a[$1] $2}
END{for (i=1; i<=n; i++) print b[i], a[b[i]]}' file
22 HelloHi
1 What
34 Where
21 is
44 Howareyou
To maintain the order, you need to keep track of it:
awk '
! seen[$1]++ {order[++n] = $1}
{value[$1] = value[$1] $2}
END {for (i=1; i<=n; i++) print order[i], value[order[i]]}
' <<END
22 Hello
22 Hi
1 What
34 Where
21 is
44 How
44 are
44 you
END
22 HelloHi
1 What
34 Where
21 is
44 Howareyou
If you know the values in the 1st column are contiguous, as in your sample text, then:
awk '
prev != $1 {printf "%s%s ", sep, $1; sep=RS}
{printf "%s", $2; prev = $1}
END {print ""}
'
A couple of other approaches:
perl -lane '
push #keys, $F[0] unless grep {$_ eq $F[0]} #keys;
$val{$F[0]} .= $F[1]
} END {
print "$_ $val{$_}" for #keys
' file
and, reaching way into the niche zone
#!/usr/bin/env tclsh
while {[gets stdin line] != -1} {dict append val {*}$line}
dict for {k v} $val {puts "$k $v"}
Here is an alternate solution in Python, as requested by #shellter:
from collections import defaultdict
with open("file") as infile:
d = defaultdict(str)
#Build dictionary of values
for line in infile:
line = line.strip()
k, _, v = line.partition(" ")
d[k] += v
#Print everything
for k, v in d.iteritems():
print k,v
Note that the ordering is not preserved in this solution. Here is an alternate solution that provides exactly the desired output:
from collections import defaultdict
with open("file") as infile:
d = defaultdict(str)
orig_order = []
#Build dictionary of values
for line in infile:
line = line.strip()
k, _, v = line.partition(" ")
d[k] += v
#Add to original order if not seen yet
if not k in orig_order:
orig_order.append(k)
#Print everything
for k in orig_order:
print k, d[k]
Note that these are quickly-crafted solution, I am sure it is possible without too much effort to either make them shorter or more flexible.
if the order is not important, this will work:
awk '{a[$1]=a[$1]$2}; END {for (i in a) {print a[i]}}' file
.. and if order is important:
awk '{if (!a[$1]) b[++i]=$1;a[$1]=a[$1]$2}; END {for (j=1;j<i;j++) {print a[b[j]]}}' file

Resources