Bash, csv file: get number of rows by column name - bash

What I have:
file.csv
car, speed, gas, color
2cv, 120, 8 , green
vw, 80, , yellow
Jaguar, 250, 15 , red
Benz, , , silver
What I have found:
This script returns exactly what I need by column number:
#!/bin/bash
awk -F', *' -v col=3 '
FNR>1 {
if ($col)
maxc=FNR
}
END {
print maxc
}
' file.csv
read -p "For End press Enter or Ctl + C"
I get exactly the output which I need (the number of the last line of column):
* for "col=1" ("car" column), the answer: 5
* for "col=2" ("speed" column), the answer: 4
* for "col=3" ("gas" column), the answer: 4
* for "col=4" ("color" column), the answer: 5
What I am looking for:
I am looking for a way to get the same not by "vol=volumnumber p.e. vol=3" but by "vol=columnheadlinevalue p.e. vol=gas".
It can be, it need additional like:
col_name=gas # selected column headline
col=get column number from $col_name # not working part

Here's one way to do exactly what you do but which finds the column name when FNR==1:
#!/bin/bash
columns=(car speed gas color)
for col in "${columns[#]}"
do
LINE_CNT=$(awk '-F[\t ]*,[\t ]*' -vcol=${col} '
FNR==1 {
for(i=1; i<=NF; ++i) {
if($i == col) {
col = i;
break;
}
}
if(i>NF) {
exit 1;
}
}
FNR>1 {
if($col) maxc=FNR;
}
END{
print maxc;
}' file.csv)
echo "$col $LINE_CNT"
done
Output:
car 5
speed 4
gas 4
color 5

Related

Awk script to provide sum of certain columns in another column basis on criteria

Need your help,
have one file where data is as shown below.
Data for both the below scenarios is present in only 1(single) file and want the expected output in same file only if possible
Scenario 1:
If the value in the first column DocumentNo is appearing once and
the second column Line has the value 10, then I would like to sum columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue:
example data:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||
Senario 2:
If we have multiple rows with identical values in the first column DocumentNo and unique value in the second column LineN, then I would like to sum all value of columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue of each line.
example data:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
Expected output Scenario 1:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||11800
Expected output Scenario 2:
Invoice Value = 10000+15000+20000+0+0+0+900+1350+1800+900+1350+1800 =
53100
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
Below is the code tried, but not able to figure out how to put added values in lastcolumn(InvoValue)
awk '{a[$1]+=$3;b[$1]+=$4;c[$1]+=$5;d[$1]+=$6;}
END {for(i in a) { print " " a[i] " " b[i] " " c[i] " " d[i];}}' File
Below is output of code that I'm getting. Sadly it is not matching my expected output :
0 0 0 0
I would do it in two passes.
On the first pass I would create a dictinary s that would hold the sum of columns 3, 4, 5 and 6 for any specific document number.
On the second pass I would replace the value in InvoValue column.
Here's an example input data.txt:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
262881895|10|10000|0|900|900||
Here is the command:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt
Here is the output:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
262881895|10|10000|0|900|900||11800
Note that I ignored column 2 altogether. You might need to modify my answer if you want to account for the LineN.
To ensure that all pairs (DocumentNo, LineN) are unique and occur only once, you could add this error detection:
if (met[$1 FS $2]) print "ERROR: " $1 " " $2;
met[$1 FS $2] = 1;
So the updated command with error detection would be:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { if (met[$1 FS $2]) print "ERROR: " $1 " " $2; met[$1 FS $2] = 1; s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt

Values based comparison in Unix

I have two variables like below .
a=rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
b=bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock
If condition is failing as it's looking for rw value from a variable at position 1 in b variable but it's in position 2 in variable b.
How can I compare the two lines even though the order of the fields is not the same?
This script seems to work:
a="rw,bg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
b="bg,rg,hard,timeo=600,wsize=1048576,rsize=1048576,nfsvers=3,tcp,actimeo=0,noacl,lock"
{ echo "$a"; echo "$b"; } |
awk -F, \
'NR == 1 { for (i = 1; i <= NF; i++) a[$i] = 1 }
NR == 2 { for (i = 1; i <= NF; i++)
{
if ($i in a)
delete a[$i]
else
{
mismatch++
print "Unmatched item (row 2):", $i
}
}
}
END {
for (i in a)
{
print "Unmatched item (row 1):", i
mismatch++
}
if (mismatch > 0)
print mismatch, "Mismatches"
else
print "Rows are the same"
}'
Example runs:
$ bash pairing.sh
Unmatched item (row 2): rg
Unmatched item (row 1): rw
2 Mismatches
$ sed -i.bak 's/,rg,/,rw,/' pairing.sh
$ bash pairing.sh
Rows are the same
$
There are undoubtedly ways to make the script more compact, but the code is fairly straight-forward. If a field appears twice in the second row and appears once in the first row, the second one will be reported as an unmatched item. The code doesn't check for duplicates while processing the first row — it's an easy exercise to make it do so. The code doesn't print the input data for verification; it probably should.

i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format
uniprotID\space\sequence
sequence is a string of any length but with only 20 allowed letters i.e.
ARNDCQEGHILKMFPSTWYV
Example of 1 record
Q5768D AKCCACAKCCAC
I want to create a csv file in the following format
Q5768D
12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2
This is what I'm currently trying:
#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g' > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[#]})
for (( X=0; X <= "$SQL"; X++ ))
do
Y=$(expr $X + 1)
Z=$(expr $X + 2)
echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
done | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1
in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.
If this is what you want:
$ cat file
Q5768D AKCCACAKCCAC
OTHER FOOBARFOOBAR
$
$ awk -f tst.awk file
Q5768D OTHER
12 12
AKC 2 FOO 2
KCC 2 OOB 2
CCA 2 OBA 2
CAC 2 BAR 2
ACA 1 ARF 1
CAK 1 RFO 1
This will do it:
$ cat tst.awk
BEGIN { OFS="\t" }
{
colNr = NR
rowNr = 0
name[colNr] = $1
lgth[colNr] = length($2)
delete name2nr
for (i=1;i<=(length($2)-2);i++) {
trimer = substr($2,i,3)
if ( !(trimer in name2nr) ) {
name2nr[trimer] = ++rowNr
nr2name[colNr,rowNr] = trimer
}
cnt[colNr,name2nr[trimer]]++
}
numCols = colNr
numRows = (rowNr > numRows ? rowNr : numRows)
}
END {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", name[colNr], (colNr<numCols?OFS:ORS)
}
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s%s", lgth[colNr], (colNr<numCols?OFS:ORS)
}
for (rowNr=1;rowNr<=numRows;rowNr++) {
for (colNr=1;colNr<=numCols;colNr++) {
printf "%s %s%s", nr2name[colNr,rowNr], cnt[colNr,rowNr], (colNr<numCols?OFS:ORS)
}
}
}
If instead you want output like in #rogerovo's perl answer that'd be much simpler than the above and more efficient and use far less memory:
$ cat tst2.awk
{
delete cnt
for (i=1;i<=(length($2)-2);i++) {
cnt[substr($2,i,3)]++
}
printf "%s;%s", $1, length($2)
for (trimer in cnt) {
printf ";%s=%s", trimer, cnt[trimer]
}
print ""
}
$ awk -f tst2.awk file
Q5768D;12;ACA=1;KCC=2;CAK=1;CAC=2;CCA=2;AKC=2
OTHER;12;RFO=1;FOO=2;OBA=2;OOB=2;ARF=1;BAR=2
This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)
output:
Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
code:
#!/usr/bin/perl
use strict;
$|=1;
my $c;
# process each line on input
while (readline STDIN) {
$c++; chomp;
# is it a valid line? has the format and a sequence to process
if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
print join ";",($1,length($2));
my %trimdb;
my $seq=$2;
#split the sequence into chars
my #a=split //,$seq;
my #trimmer;
# while there are unprocessed chars in the sequence...
while (scalar #a) {
# fill up the buffer with a char from the top of the sequence
push #trimmer, shift #a;
# if the buffer is full (has 3 chars), increase the trimer frequency
if (scalar #trimmer == 3 ) {
$trimdb{(join "",#trimmer)}++;
# drop the first letter from buffer, for next loop
shift #trimmer;
}
}
# we're done with the sequence - print the sorted list of trimers
foreach (sort keys %trimdb) {
#print in a csv (;) line
print ";$_=$trimdb{$_}";
}
print"\n";
}
else {
#the input line was not valid.
print STDERR "input error: $_\n";
}
# just a progress counter
printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";
if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv

{awk} How to read a line and compare a $ with its next/previous line?

The command below is used to read an input file containing 7682 lines:
I use the --field-separator then converted some fields into what I need, and the grep got rid of the 2 first lines I do not need.
awk --field-separator=";" '($1<15) {print int(a=(($1-1)/480)+1) " " ($1-((int(a)-1)*480)) " " (20*log($6)/log(10))}' 218_DW.txt | grep -v "0 480 -inf"
I used ($1<15) so that I only print 14 lines, better for testing. The output I get is exactly what I want, but, there is more I need to do on that:
1 1 48.2872
1 2 48.3021
1 3 48.1691
1 4 48.1502
1 5 48.1564
1 6 48.1237
1 7 48.1048
1 8 48.015
1 9 48.0646
1 10 47.9472
1 11 47.8469
1 12 47.8212
1 13 47.8616
1 14 47.8047
From above, $1 will increment from 1-16, $2 from 1-480, it's always continuous,
so when it gets to 16 480 47.8616 it restarts from 2 1 47.8616 until last line is 16 480 10.2156
So I get 16*480=7680 lines
What I want to do is simple, but, I don't get it :)
I want to compare the current line with the next one. But not all fields, only $3, it's a value in dB that decreases when $2 increases.
In example:
The current line is 1 1 48.2872=a
Next line is 1 2 48.3021=b
If [ (a - b) > 6 ] then print $1 $2 $3
Of course (a - b) has got to be an absolute value, always > 0.
The beast will be to be able to compare the current line (the $3 only) with it's next and previous line ($3).
Something like this:
1 3 48.1691=a
1 4 48.1502=b
1 5 48.1564=c
If [ ABS(b - a) > 6 ] OR If [ ABS(b - c) > 6 ] then print $1 $2 $3
But of course first line can only be compared with its next one and the last one with its previous one. Is it possible?
Try this:
#!/usr/bin/awk -f
function abs(x) {
if (x >= 0)
return x;
else
return -1 * x;
}
function compare(a,b) {
return abs(a - b) > 6;
}
function update() {
before_value = current_value;
current_line = $0;
current_value = $3;
}
BEGIN {
line_n = 1;
}
#Edit: added to skip blank lines and differently formatted lines in
# general. You could add some error message and/or exit function
# here to detect badly formatted data.
NF != 3 {
next;
}
line_n == 1 {
update();
line_n += 1;
next;
}
line_n == 2 {
if (compare(current_value, $3))
print current_line;
update();
line_n += 1;
next;
}
{
if (compare(current_value, before_value) && compare(current_value, $3))
print current_line;
update();
}
END {
if (compare(current_value, before_value)) {
print current_line;
}
}
The funny thing is that I had this code lying around from a old project where I had to do basically the same thing. Adapted it a little for you. I think it solves your problem (how I understood it, at least). If it doesn't, it should point you in the right direction.
Instructions to run the awk script:
Supposing you saved the code with the name "awkscript", the data file is named "datafile" and they are both in the current folder, you should first mark the script as executable with chmod +x awkscript and then execute it passing the data file as parameter with ./awkscript datafile or use it as part of a sequence of pipes as in cat datafile | ./awkscript.
Comparing the current line to the previous one is trivial, so I think the problem you're having is that you can't figure out how to compare the current line to the next one. Just keep 2 previous lines instead of 1 and always operate on the line before the one that's actually being read as $0, i.e. the line stored in the array p1 in this example (p2 is the line before it and $0 is the line after it):
function abs(val) { return (val > 0 ? val : -val) }
NR==2 {
if ( abs(p1[3] - $3) > 6 ) {
print p1[1], p1[2], p1[3]
}
}
NR>2 {
if ( ( abs(p1[3] - p2[3]) > 6 ) || ( abs(p1[3] - $3) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}
{ prev2=prev1; prev1=$0; split(prev2,p2); split(prev1,p1) }
END {
if ( ( abs(p1[3] - p2[3]) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources