Case/if-else statement to create new column in new csv - bash

I'm trying to do a case/if-else statement on a CSV file (e.g., myfile.csv) that analyzes a column, then creates a new column in a new csv (e.g., myfile_new.csv).
The source data (myfile.csv) looks like this:
unique_id,variable1,variable2
1,,C
2,1,
3,,A
4,,B
5,1,
I'm trying to do two transformations:
For the second field, if the input file has any data in the field, have it be 1, otherwise 0.
The third field is flattened into three fields. If the input file has an A in the third field, the third output field has 1, and 0 otherwise; the same for B and C and the fourth/fifth field in the output file.
I want the result (myfile_new.csv) to look like this:
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0
I'm trying to do the equivalent of this in SQL
select unique_id,
case when len(variable1)>0 then 1 else 0 as variable1,
case when variable2 = 'A' then 1 else 0 end as variable2_A,
case when variable2 = 'B' then 1 else 0 end as variable2_B,
case when variable2 = 'C' then 1 else 0 end as variable2_C, ...
I'm open to whatever, but CSV files will be 500GB - 1TB in size so it needs to work with that size file.

Here is an awk solution that would do it:
awk 'BEGIN {
FS = ","
OFS = ","
}
NR == 1 {
$3 = "variable2_A"
$4 = "variable2_B"
$5 = "variable2_C"
print
next
}
{
$2 = ($2 == "") ? 0 : 1
$3 = ($3 == "A" ? 1 : 0) "," ($3 == "B" ? 1 : 0) "," ($3 == "C" ? 1 : 0)
print
}' myfile.csv > myfile_new.csv
In the BEGIN block, we set input and output file separator to a comma.
The NR == 1 block creates the header for the output file and skips the third block.
The third block checks if the second field is empty and stores 0 or 1 in it; the $3 statement concatenates the result of using the ternary operator ?: three times, comma separated.
The output is
unique_id,variable1,variable2_A,variable2_B,variable2_C
1,0,0,0,1
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,1,0,0,0

Quick and dirty solution using a while loop.
#!/bin/bash
#Variables:
line=""
result=""
linearray[0]=0
while read line; do
unset linearray #Clean the variables from the previous loop
unset result
IFS=',' read -r -a linearray <<< "$line" #Splits the line into an array, using the comma as the field seperator
result="${linearray[0]}""," #column 1, at index 0, is the same in both files.
if [ -z "${linearray[1]}" ]; then #If column 2, at index 1, is empty, then...
result="$result""0""," #Pad empty strings with zero
else #Otherwise...
result="$result""${linearray[1]}""," #Copy the non-zero column 2 from the old line
fi
#The following read index 2, for column 3, and add on the appropriate text. Only one can ever be true.
if [ "${linearray[2]}" == "A" ]; then result="$result""1,0,0"; fi
if [ "${linearray[2]}" == "B" ]; then result="$result""0,1,0"; fi
if [ "${linearray[2]}" == "C" ]; then result="$result""0,0,1"; fi
if [ "${linearray[2]}" == "" ]; then result="$result""0,0,0"; fi
echo $result >> myfile_new.csv #append the resulting line to the new file
done <myfile.csv

Related

Count occurrences in a csv with Bash

I have to create a script that given a country and a sport you get the number of medalists and medals won after reading a csv file.
The csv is called "athletes.csv" and have this header
id|name|nationality|sex|date_of_birth|height|weight|sport|gold|silver|bronze|info
when you call the script you have to add the nationality and sport as parameters.
The script i have created is this one:
#!/bin/bash
participants=0
medals=0
while IFS=, read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
participants=$participants++
medals=$(($medals+${array[8]}+${array[9]}+${array[10]))
fi
done < athletes.csv
echo $participants
echo $medals
where array[3] is the nationality, array[8] is the sport and array[9] to [11] are the number of medals won.
When i run the script with the correct paramters I get 0 participants and 0 medals.
Could you help me to understand what I'm doing wrong?
Note I cannot use awk nor grep
Thanks in advance
Try this:
#! /bin/bash -p
nation_arg=$1
sport_arg=$2
declare -i participants=0
declare -i medals=0
declare -i line_num=0
while IFS=, read -r _ _ nation _ _ _ _ sport ngold nsilver nbronze _; do
(( ++line_num == 1 )) && continue # Skip the header
[[ $nation == "$nation_arg" && $sport == "$sport_arg" ]] || continue
participants+=1
medals+=ngold+nsilver+nbronze
done <athletes.csv
declare -p participants
declare -p medals
The code uses named variables instead of numbered positional parameters and array indexes to try to improve readability and maintainability.
Using declare -i means that strings assigned to the declared variables are treated as arithmetic expressions. That reduces clutter by avoiding the need for $(( ... )).
The code assumes that the field separator in the CSV file is ,, not | as in the header. If the separator is really |, replace IFS=, with IFS='|'.
I'm assuming that the field delimiter of your CSV file is a comma but you can set it to whatever character you need.
Here's a fixed version of your code:
#!/bin/bash
participants=0
medals=0
{
# skip the header
read
# process the records
while IFS=',' read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
(( participants++ ))
medals=$(( medals + array[8] + array[9] + array[10] ))
fi
done
} < athletes.csv
echo "$participants" "$medals"
remark: As $1 and $2 are left unquoted they are subject to glob matching (right side of [[ ... == ... ]]). For example you'll be able to show the total number of medals won by the US with:
./script.sh 'US' '*'
But I have to say, doing text processing with pure shell isn't considered a good practice; there exists dedicated tools for that. Here's an example with awk:
awk -v FS=',' -v country="$1" -v sport="$2" '
BEGIN {
participants = medals = 0
}
NR == 1 { next }
$3 == country && $8 == sport {
participants++
medals += $9 + $10 + $11
}
END { print participants, medals }
' athletes.csv
There's also a potential problem remaining: the CSV format might need a real CSV parser for reading it accurately. There exists a few awk libraries for that but IMHO it's simpler to use a CSV‑aware tool that provides the functionalities that you need.
Here's an example with Miller:
mlr --icsv --ifs=',' filter -s country="$1" -s sport="$2" '
begin {
#participants = 0;
#medals = 0;
}
$nationality == #country && $sport == #sport {
#participants += 1;
#medals += $gold + $silver + $bronze;
}
false;
end { print #participants, #medals; }
' athletes.csv

Retrive entire column to a new file if it matches from list of another file

I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.

Awk script to provide sum of certain columns in another column basis on criteria

Need your help,
have one file where data is as shown below.
Data for both the below scenarios is present in only 1(single) file and want the expected output in same file only if possible
Scenario 1:
If the value in the first column DocumentNo is appearing once and
the second column Line has the value 10, then I would like to sum columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue:
example data:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||
Senario 2:
If we have multiple rows with identical values in the first column DocumentNo and unique value in the second column LineN, then I would like to sum all value of columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue of each line.
example data:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
Expected output Scenario 1:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||11800
Expected output Scenario 2:
Invoice Value = 10000+15000+20000+0+0+0+900+1350+1800+900+1350+1800 =
53100
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
Below is the code tried, but not able to figure out how to put added values in lastcolumn(InvoValue)
awk '{a[$1]+=$3;b[$1]+=$4;c[$1]+=$5;d[$1]+=$6;}
END {for(i in a) { print " " a[i] " " b[i] " " c[i] " " d[i];}}' File
Below is output of code that I'm getting. Sadly it is not matching my expected output :
0 0 0 0
I would do it in two passes.
On the first pass I would create a dictinary s that would hold the sum of columns 3, 4, 5 and 6 for any specific document number.
On the second pass I would replace the value in InvoValue column.
Here's an example input data.txt:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
262881895|10|10000|0|900|900||
Here is the command:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt
Here is the output:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
262881895|10|10000|0|900|900||11800
Note that I ignored column 2 altogether. You might need to modify my answer if you want to account for the LineN.
To ensure that all pairs (DocumentNo, LineN) are unique and occur only once, you could add this error detection:
if (met[$1 FS $2]) print "ERROR: " $1 " " $2;
met[$1 FS $2] = 1;
So the updated command with error detection would be:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { if (met[$1 FS $2]) print "ERROR: " $1 " " $2; met[$1 FS $2] = 1; s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt

unix shell: replace by dictionary

I have file which contains some data, like this
2011-01-02 100100 1
2011-01-02 100200 0
2011-01-02 100199 3
2011-01-02 100235 4
and have some "dictionary" in separate file
100100 Event1
100200 Event2
100199 Event3
100235 Event4
and I know that
0 - warning
1 - error
2 - critical
etc...
I need some script with sed/awk/grep or something else which helps me receive data like this
100100 Event1 Error
100200 Event2 Warning
100199 Event3 Critical
etc
will be grateful for ideas how to do this in best way, or for working example
update
sometimes I have data like this
2011-01-02 100100 1
2011-01-02 sometext 100200 0
2011-01-02 100199 3
2011-01-02 sometext 100235 4
where sometext = any 6 characters (maybe this is helpful info)
in this case I need whole data:
2011-01-02 sometext EventNameFromDictionary Error
or without "sometext"
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
print $2, evt[$2], lvl[$3]
}' dictionary infile
Adding a new answer for the new requirement and because of the limited formatting options inside a comment:
awk 'BEGIN {
lvl[0] = "warning"
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR {
evt[$1] = $2; next
}
{
if (NF > 3) {
idx = 3; $1 = $1 OFS $2
}
else idx = 2
print $1, $idx in evt ? \
evt[$idx] : $idx, $++idx in lvl ? \
lvl[$idx] : $idx
}' dictionary infile
You won't need to escape the new lines inside the tertiary operator if you're using GNU awk.
Some awk implementations may have problems with this part:
$++idx in lvl ? lvl[$idx] : $idx
If you're using one of those,
change it to:
$(idx + 1) in lvl ? lvl[$(idx + 1)] : $(idx + 1)
OK, comments added:
awk 'BEGIN {
lvl[0] = "warning" # map the error levels
lvl[1] = "error"
lvl[2] = "critical"
}
NR == FNR { # while reading the first
# non-empty input file
evt[$1] = $2 # build the associative array evt
next # skip the rest of the program
# keyed by the value of the first column
# the second column represents the values
}
{ # now reading the rest of the input
if (NF > 3) { # if the number of columns is greater than 3
idx = 3 # set idx to 3 (the key in evt)
$1 = $1 OFS $2 # and merge $1 and $2
}
else idx = 2 # else set idx to 2
print $1, \ # print the value of the first column
$idx in evt ? \ # if the value of the second (or the third,
\ # depeneding on the value of idx), is an existing
\ # key in the evt array, print its value
evt[$idx] : $idx, \ # otherwise print the actual column value
$++idx in lvl ? \ # the same here, but first increment the idx
lvl[$idx] : $idx # because we're searching the lvl array now
}' dictionary infile
I hope perl is ok too:
#!/usr/bin/perl
use strict;
use warnings;
open(DICT, 'dict.txt') or die;
my %dict = %{{ map { my ($id, $name) = split; $id => $name } (<DICT>) }};
close(DICT);
my %level = ( 0 => "warning",
1 => "error",
2 => "critical" );
open(EVTS, 'events.txt') or die;
while (<EVTS>)
{
my ($d, $i, $l) = split;
$i = $dict{$i} || $i; # lookup
$l = $level{$l} || $l; # lookup
print "$d\t$i\t$l\n";
}
Output:
$ ./script.pl
2011-01-02 Event1 error
2011-01-02 Event2 warning
2011-01-02 Event3 3
2011-01-02 Event4 4

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources