shell script: count lines and sum numbers from multiple files - bash

I want to count the number of lines in a .CSV file, and sum the last numeric column. I need to make this for a crontab, so, it has to be a script.
This code counts the number of lines:
egrep -i String file_name_201611* | \
egrep -i "cdr,20161115" | \
awk -F"," '{print $4}' | sort | uniq | wc -l
This code sums the last column:
egrep -i String file_name_201611* | \
egrep -i ".cdr,20161115"| \
awk -F"," '{print $8}' | paste -s -d"+" | bc
Lines looks like:
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400594.dat,processed_cdr_20161117100941_00627727.cdr,20161117095940,20161117,18,46521
The expected output:
CGSCO05,sum_#_lines, Sum_$8
CGSCO05, 225, 1500

This should work...
#!/usr/bin/awk -f
BEGIN{
k=0;
FS=","
}
{
if ($2 in counter){
counter[$2] = counter[$2] + 1;
sum_8[$2] = sum_8[$2] + $8;
}else{
k = k + 1;
counter[$2] = 1;
sum_8[$2] = $8;
name[k] = $2;
}
}
END{
for (i=1; i<=k; i++)
printf "%s, %i, %i\n", name[k], counter[name[k]], sum_8[name[k]];
}

sort a CSV file by field #2, keeping only lines with unique entries, then print the number of unique lines, and the sum of column #8 from those unique lines:
sort -t, -k2 -u foo.csv | datamash -t, count 2 sum 8

Related

awk: division by zero input record number 1, file source line number 1

Im trying to get the signed log10-transformed t-test P-value by using the sign of the log2FoldChange multiplied by the inverse of the pvalue,
cat test.xlx | sort -k7g \
| cut -d '_' -f2- \
| awk '!arr[$1]++' \
| awk '{OFS="\t"}
{ if ($6>0) printf "%s\t%4.3e\n", $1, 1/$7; else printf "%s\t%4.3e\n", $1, -1/$7 }' \
| sort -k2gr > result.txt
text.xls =
ID baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000037692-Ahdc1 2277.002091 1.742481553 0.170388822 10.22650154 1.51e-24 2.13e-20
ENSMUSG00000035561-Aldh1b1 768.4504879 -2.325533089 0.248837002 -9.345608047 9.14e-21 6.45e-17
ENSMUSG00000038932-Tcfl5 556.1693605 -3.742422892 0.402475728 -9.298505809 1.42e-20 6.71e-17
ENSMUSG00000057182-Scn3a 1363.915962 1.621456045 0.175281852 9.250564289 2.23e-20 7.89e-17
ENSMUSG00000038552-Fndc4 378.821132 2.544026087 0.288831276 8.808000721 1.27e-18 3.6e-15
but getting error awk: division by zero
input record number 1, file
source line number 1
As #jas points out in a comment, you need to skip your header line but your script could stand some more cleanup than that. Try this:
sort -k7g test.xlx |
awk '
BEGIN { OFS="\t" }
{ sub(/^[^_]+_/,"") }
($6~/[0-9]/) && (!seen[$1]++) { printf "%s\t%4.3e\n", $1, ($7?($6>0?1:-1)/$7:0) }
' |
sort -k2gr
ENSMUSG00000035561-Aldh1b1 1.550e+16
ENSMUSG00000037692-Ahdc1 4.695e+19
ENSMUSG00000038552-Fndc4 2.778e+14
ENSMUSG00000038932-Tcfl5 1.490e+16
ENSMUSG00000057182-Scn3a 1.267e+16
The above will print a result of zero instead of failing when $7 is zero.
What's the point of the cut -d '_' -f2- in your original script though (implemented above with sub()? You don't have any _s in your input file.

Sorting dates by groups

Here is a sample of my data with 4 columns and comma delimiter.
1,A,2009-01-01,2009-07-15
1,A,2009-07-10,2009-07-12
2,B,2009-01-01,2009-07-15
2,B,2009-07-10,2010-12-15
3,C,2009-01-01,2009-07-15
3,C,2009-07-15,2010-12-15
3,C,2010-12-15,2014-07-07
4,D,2009-06-01,2009-07-15
4,D,2009-07-21,2012-12-15
5,E,2011-04-23,2012-10-19
The first 2 columns are grouped. I want the minimum date from the third column, and the maximum date from the fourth column, for each group.
Then I will pick the first line for each first 2 column combination.
Desired output
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
I have tried the following code, but not working. I get close, but not the max date.
cat exam |sort -t, -nk1 -k2,3 -k4,4r |sort -t, -uk1,2
Would prefer an easy one-liner like above.
sort datafile |
awk -F, -v OFS=, '
{key = $1 FS $2}
key != prev {prev = key; min[key] = $3}
{max[key] = ($4 > max[key]) ? $4 : max[key]}
END {for (key in min) print key, min[key], max[key]}
' |
sort
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
When you pre-sort, you are guaranteed that the minimum col3 date will occur on the first line of a new group. Then you just need to find the maximum col4 date.
The final sort is required because iterating over the keys of an awk hash is unordered. You can do this sorting in (g)awk with:
END {
n = asorti(min, sortedkeys)
for (i=1; i<=n; i++)
print sortedkeys[i], min[sortedkeys[i]], max[sortedkeys[i]]
}
#!/usr/bin/awk -f
BEGIN { FS = OFS = "," }
{
sub(/[[:blank:]]*<br>$/, "")
key = $1 FS $2
if (!a[key]) {
a[key] = $3
b[key] = $4
keys[++k] = key
} else if ($3 < a[key]) {
a[key] = $3
} else if ($4 > b[key]) {
b[key] = $4
}
}
END {
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
}
Usage:
awk -f script.awk file
Output:
1,A,2009-01-01,2009-07-15 <br>
2,B,2009-01-01,2010-12-15 <br>
3,C,2009-01-01,2014-07-07 <br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19 <br>
Of course you can add print statements before and after the loop to print the other two <br>'s:
END {
print "<br>"
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
print "<br>"
}
You want a "one liner" ?
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
uniq -w4
The key idea is to sort the data once by field 3 asc, and independently by field 4 desc. Then you just have to merge corresponding lines (cut and paste). Finally uniq is used to keep only the first row for each pair of identical first two columns. This is the weak point here, as I assume 4 characters max for comparison. You either have to adjust to your needs, or somehow normalize data for those two columns in order to have a fixed width here when using your actual data.
EDIT: A probably better option is to replace uniq by a simple awk filter:
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
awk -F , '$1","$2 != last { print; last=$1","$2 }'
On my system (GNU Linux Debian Wheezy), both produce the same result:
1,A,2009-01-01,2009-07-15<br>
2,B,2009-01-01,2010-12-15<br>
3,C,2009-01-01,2014-07-07<br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19<br>

Switching the format of this output?

I have this script written to print the distribution of words in one or more files:
cat "$#" | tr -cs '[:alpha:]' '\n' |
tr '[:upper:]' '[:lower:]' | sort |
uniq -c | sort -n
Which gives me an output such as:
1 the
4 orange
17 cat
However, I would like to change it so that the word is listed first (I'm assuming sort would be involved so its alphabetical) , not the number, like so:
cat 17
orange 4
the 1
Is there just a simple option I would need to switch this? Or is it something more complicated?
Pipe the output to
awk '{print $2, $1}'
or you can use awk for the complete task:
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
usage:
awk -f wordfreq.awk input

Cut | Sort | Uniq -d -c | but?

The given file is in the below format.
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.
3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.
By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...
What I have done is..
awk '{printf("%5d,%s\n", NR,$0)}' renewstatus_2012-04-19.txt > n_renewstatus_2012-04-19.txt
cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c
but this needs a point back again to the original file to get the lines for the dup occurences. ..
let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
sort has an option -k
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
uniq has an option -f
-f, --skip-fields=N
avoid comparing the first N fields
so sort and uniq with field numbers(count NUM and test this cmd yourself, plz)
awk -F"," '{print $0,$1,$2,...}' file.txt | sort -k NUM,NUM2 | uniq -f NUM3 -c
Using awk's associative arrays is a handy way to find unique/duplicate rows:
awk '
BEGIN {FS = OFS = ","}
{
key = $1 FS $2 FS $5 FS $14
if (key in count)
count[key]++
else {
count[key] = 1
line[key] = $0
}
}
END {for (key in count) print count[key], line[key]}
' filename
SYNTAX :
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt
Calculation:
sym#localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym#localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS
sym#localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

How to sort the columns of a CSV file by the ratio of two columns?

I have a CSV file like this:
bear,1,2
fish,3,4
cats,1,5
mice,3,3
I want to sort it, from highest to lowest, by the ratio of column 2 and 3. E.g.:
bear,1,2 # 1/2 = 0.5
fish,3,4 # 3/4 = 0.75
cats,1,5 # 1/5 = 0.2
mice,3,3 # 3/3 = 1
This would be sorted like this:
mice,3,3
fish,3,4
bear,1,2
cats,1,5
How can I sort the columns from highest to lowest by the ratio of the two numbers in column 2 and 3?
awk 'BEGIN { FS = OFS = ","} {$4 = $2/$3; print}' | sort -k4,4nr -t, | sed 's/,[^,]*$//' inputfile
or using GNU AWK (gawk):
awk -F, '{a[$3/$2] = $3/$2; b[$3/$2] = $0} END {c = asort(a); for (i = 1; i <= c; i++) print b[a[i]]}' inputfile
The methods above are better than the following, but this is more efficient than another answer which uses Bash and various utilities:
while IFS=, read animal dividend divisor
do
quotient=$(echo "scale=4; $dividend/$divisor" | bc)
echo "$animal,$dividend,$divisor,$quotient"
done < inputfile | sort -k4,4nr -t, | sed 's/,[^,]*$//'
As a one-liner:
while IFS=, read animal dividend divisor; do quotient=$(echo "scale=4; $dividend/$divisor" | bc); echo "$animal,$dividend,$divisor,$quotient"; done < inputfile | sort -k4,4nr -t | sed 's/,[^,]*$//'
Why not just create another column that holds the ratio of the second and third columns and then sort on that column?
bash is not meant for stuff like that - pick your own favorite programming language, and do it there.
If you insist... here is an example:
a=( `cut -d "," -f 2 mat.csv` ); b=( `cut -d "," -f 3 mat.csv` );for i in {0..3};do (echo -n `head -n $((i+1)) mat.csv|tail -1`" "; echo "scale=4;${a[i]}/${b[i]}"|bc) ;done|sort -k 2 -r
Modify filename and length.

Resources