Improving a bash script for csv files - bash

I have a bunch of CSV files in a folder. All of them on the same structure. more than 2k columns. The first column is ID.
I need to do the following for each file:
For each n odd column (except the first column), do the following:
If n value is 0, for all of the rows, then delete the n column and also n-1 column
If n value is 100, for all of the rows, then delete the n column
print the indexes of the removed columns
I have the following code:
for f in *.csv; do
awk 'BEGIN { FS=OFS="," }
NR==1 {
for (i=3; i<=NF; i+=2)
a[i]
}FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} {
for (i=1; i<=NF; i++)
if (sums[i] > 0 && sums[i+1]>0 && sums[i] != 100*r)
printf "%s%s", (i>1)?OFS:"", $i;
else print "removed index: " i > "removed.index"
print ""
}' "$f" "$f" > "new_$f"
done
For some reason the ID column (first column) is been removed.
Input:
23232,0,0,5,0,1,100,3,0,33,100
21232,0,0,5,0,1,100,3,0,33,100
23132,0,0,5,0,1,100,3,0,33,100
23212,0,0,5,0,1,100,3,0,33,100
24232,0,0,5,0,1,100,3,0,33,100
27232,0,0,5,0,1,100,3,0,33,100
Current output (bad):
,1,33
,1,33
,1,33
,1,33
,1,33
,1,33
Expected output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Can anyone check what is the issue?

You need to skip first column from the logic to check for 0 in previous column:
awk 'BEGIN{FS=OFS=","; out=ARGV[1] ".removed.index"}
FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} FNR==1 {
for (i=3; i<=NF; i++) {
if (sums[i] == 0) {
if (i-1 in sums) {
delete sums[i-1];
print "removed index: " (i-1) > out
}
delete sums[i];
print "removed index: " i > out
} else if (sums[i] == 100*r) {
delete sums[i];
print "removed index: " i > out
}
}
} {
printf "%s", $1
for (i=2; i<=NF; i++)
if (i in sums)
printf "%s%s", OFS, $i;
printf "%s", ORS
} END{close(out)}' file file
Output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Also removed indices is:
cat file.removed.index
cat removed.index
removed index: 2
removed index: 3
removed index: 4
removed index: 5
removed index: 7
removed index: 8
removed index: 9
removed index: 11

Related

bash - select columns based on values

I am new to bash and have the below requirement:
I have a file as below:
col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2
If you notice the values of these columns can be of 3 types only: s1,s2,s3
I can extract the last 2rows from the given file which gives me:
col1,col2,col3....col25
s3,s1,s2..........s2
I want to further parse the above lines so that I get only the columns with say value s1.
Desired output:
say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:
col3,col25
Can someone please help?
P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed?
Checked URLs:
awk one liner select only rows based on value of a column
Assumptions:
there are 2 input lines
each input line has the same number of comma-separated items
We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.
$ cat col.awk
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
/col1/ : for the line that contains col1, store the fields in array arr_c
n=NF : grab our max array index value (NF=number of fields)
! /col1/ : for line that does not contain col1, store the fields in array arr_s
END ... : executed once the arrays have been loaded
sep="" : set our initial output separator to a null string
for (...) : loop through our array indexes (1 to n)
if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
sep=", " : set our separator for the next match in the loop
We use printf because without specifying '\n' (a new line), all output goes to one line.
Example:
$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out
col2, col4
-F, : define the input field separator as a comma
here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)
If you want to do the whole thing at the command line:
$ awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
' smatch=s1 col.out
col2, col4
Or collapsing the END block to a single line:
awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4
I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :
#<yourTwoLines> |
tac |
awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'
It works in the following way :
reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).
when handling the first line (now values) with awk, store in an array which ones are s1
when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.
solution in awk that prints a resulting row after parsing each set of 2 rows.
$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
for (i=1; i<NF; i++) {
if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
};
p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }
Rationale: build up your resultstring str when you're looping through the value-row.
whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
if p = 0 build up column array
if p = 1 print resultstring str
With input:
$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
The result is:
$ awk -f tst.awk input.txt
col2, col3
col3
col3
Notice the empty 3rd line: no s2's for that one.
Let's say you have this:
cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2
Then you can use this awk:
awk -F, -v val='s2' '{
s="";
for (i=1; i<=NF; i++)
if (NR==1)
hdr[i]=$i
else if ($i==val)
s=s hdr[i] FS;
if (s) {
sub(/,$/, "", s);
print s
}
}' file
col3,col25
If order of the columns returned is not a concern
awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]","; gsub(/,$/,"", m); print m }'

Print columns in for loop in awk wrt variable value

I am trying to count number of occurrences of a word "is" using awk through sample program below:
awk '
BEGIN { count = 0; word="is"; out=$ }
/word/ {
for (i=1; i<=NR; i++) {
if ($(i) == word) count++;
}
}
END {print "Found word " word count " no of times"}
' data.txt
But here the problem is $(i) is not being interpreted as column number.
Can you please suggest what should be written in place of $(i) to reference the column number (dynamic) as per value of i in that line?
i is the field or column number (1, 2, 3, ...) and $i is the value in that field:
$ echo This is it|awk '{for(i=1; i<=NF; i++) print i" "$i}'
1 This
2 is
3 it
So your program:
$ cat test.awk
BEGIN {
count = 0
word="is"
}
{
for (i=1; i<=NF; i++)
if ($(i) == word)
count++
}
END {
print "Found word " word" "count " no of times"
}
$ echo This is it|awk -f test.awk
Found word is 1 no of times

Awk script within shell script

I wrote some awk script to be executed while looping over {a..z}.txt files. I've been staring at this code for 30 minutes, but I just can't find what's wrong. The terminal complains that there is some syntax error around >, but I don't think that's where the bug is.
Basically, what I'm trying to do is this:
Each line contains a string and a following set of numbers. I want to re-print the numbers so that the first number is the smallest one of them.
input: a 1125159 2554 290 47364290 47392510 48629708 68
60771
output:a 290 1125159 2554 47364290 47392510 48629708 68
60771
Could anyone help me find what is wrong with the below code?
for alphabet in {a..z}
do
awk -F$'\t' "NF>2{maxId=\$2;maxIndex=2;
for(i=2; i<=NF; i++){
if(maxId>\$i){maxId=\$i; maxIndex=i}
};
printf \"%s \t %s \t\",\$1, maxId;
for(i=2; i<=NF; i++){
if(i!=maxIndex)
printf \"%d \t\", \$i};
printf \"\n\";
}" $alphabet.merged > $alphabet.out
done
Here's how your script should really be written:
awk 'BEGIN { FS=OFS="\t" }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i
}
}
print ""
}' file
a 68 2554 290 47364290 47392510 48629708 1125159 60771
Don't shy away from white space and brackets as they help readability. I don't understand the purpose of the surrounding shell loop in your question though - I suspect all you really need is:
awk 'BEGIN { FS=OFS="\t" }
FNR==1 { close(out); out=FILENAME; sub(/merged/,"out",out) }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex > out
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i > out
}
}
print "" > out
}' *.merged

Finding the minimum and maximum length of columns in a CSV file using shell script

I have several CSV files with multiple columns, and I want to get the max length, min length of individual columns and diff (max -min) for each column in the same CSV file. Example:
File:
abc 1234 4
bcd 23644 534
c 3232 6
Expected output:
abc 1234 4
bcd 23644 534
c 3232 6
Max Length 3 5 3
Min Length 1 4 1
Diff 2 1 2
The following script for computing the MAX column length is producing the expected output:
awk -F, '
{ for (i=1;i<=NF;i++)l[i]=((x=length($i))>l[i]?x:l[i])}
END {for(i=1;i<=NF;i++) print "Column"i":",l[i]} '
but there is problem with MIN Length script:
awk -F"," 'BEGIN {
for (i=1;i<=NF;i++) {
cur = length($i)
if ( (min == 0) || (cur < min) ) {
minlength = i
min = cur
}
} ;
for (i=1;i<=NF;i++) print $minlength}'
Any help would be greatly appreciated.
You just need to set the starting values for the min and max arrays based on the first line of the file:
awk '
NR==1 {for (i=1; i<=NF; i++) maxlen[i] = minlen[i] = length($i)}
{
for (i=1; i<=NF; i++) {
len = length($i)
if (len > maxlen[i]) maxlen[i] = len
if (len < minlen[i]) minlen[i] = len
}
}
END {
printf "Max Length"
for (i=1; i<=NF; i++) printf " %d", maxlen[i]
print ""
printf "Min Length"
for (i=1; i<=NF; i++) printf " %d", minlen[i]
print ""
printf "Diff"
for (i=1; i<=NF; i++) printf " %d", maxlen[i]-minlen[i]
print ""
}
' file

Rearranging a csv file

I have a file with contents similar to the below
Boy,Football
Boy,Football
Boy,Football
Boy,Squash
Boy,Tennis
Boy,Football
Girl,Tennis
Girl,Squash
Girl,Tennis
Girl,Tennis
Boy,Football
How can I use 'awk' or similar to rearrange this to the below:
Football Tennis Squash
Boy 5 1 1
Girl 0 3 1
I'm not even sure if this is possible, but any help would be great.
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
genders[$1]
sports[$2]
count[$1,$2]++
}
END {
printf ""
for (sport in sports) {
printf "%s%s", OFS, sport
}
print ""
for (gender in genders) {
printf "%s", gender
for (sport in sports) {
printf "%s%s", OFS, count[gender,sport]+0
}
print ""
}
}
$ awk -f tst.awk file
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
In general when you know the end point of the loop you put the OFS or ORS after each field:
for (i=1; i<=n; i++) {
printf "%s%s", $i, (i<n?OFS:ORS)
}
but if you don't then you put the OFS before the second and subsequent fields and print the ORS after the loop:
for (x in array) {
printf "%s%s", (++i>1?OFS:""), array[x]
}
print ""
I do like the:
n = length(array)
for (x in array) {
printf "%s%s", array[x], (++i<n?OFS:ORS)
}
idea to get the end of the loop too, but length(array) is gawk-specific.
Another approach to consider:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
for (i=1; i<=NF; i++) {
if (!seen[i,$i]++) {
map[i,++num[i]] = $i
}
}
count[$1,$2]++
}
END {
for (i=0; i<=num[2]; i++) {
printf "%s%s", map[2,i], (i<num[2]?OFS:ORS)
}
for (i=1; i<=num[1]; i++) {
printf "%s%s", map[1,i], OFS
for (j=1; j<=num[2]; j++) {
printf "%s%s", count[map[1,i],map[2,j]]+0, (j<num[2]?OFS:ORS)
}
}
}
$ awk -f tst.awk file
Football Squash Tennis
Boy 5 1 1
Girl 0 1 3
That last will print the rows and columns in the order they were read. Not quite as obvious how it works though :-).
I would just loop normally:
awk -F, -v OFS="\t" '
{names[$1]; sport[$2]; count[$1,$2]++}
END{printf "%s", OFS;
for (i in sport)
printf "%s%s", i, OFS;
print "";
for (n in names) {
printf "%s%s", n, OFS
for (s in sport)
printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""
}
}' file
This keeps track of three arrays: names[] for the first column, sport[] for the second column and count[name,sport] to count the occurrences of every combination.
Then, it is a matter of looping through the results and printing them in a fancy way and making sure 0 is printed if the count[a,b] does not exist.
Test
$ awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) printf "%s%s", i, OFS; print ""; for (n in names) {printf "%s%s", n, OFS; for (s in sport) printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""}}' a
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
Format is a bit ugly, there are some trailing OFS.
To get rid of trailing OFS:
awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) {cn++; printf "%s%s", i, (cn<length(sport)?OFS:ORS)} for (n in names) {cs=0; printf "%s%s", n, OFS; for (s in sport) {cs++; printf "%s%s", count[n,s]?count[n,s]:0, (cs<length(sport)?OFS:ORS)}}}' a
You can always pipe to column -t for a nice output.

Resources