Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24
I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182
I have output in bash that I would like to format. Right now my output looks like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
abac.com 1
abbotsleigh.nsw.edu.au 1
abc.mre.gov.br 1
ableland.freeserve.co.uk 1
academicplanet.com 1
access-k12.org 1
acconnect.com 1
acconnect.com 1
accountingbureau.co.uk 1
acm.org 1
acsalaska.net 1
adam.com.au 1
ada.state.oh.us 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
aecom.yu.edu 1
aecon.com 1
aetna.com 1
agedwards.com 1
ahml.info 1
The problem with this is none of the numbers on the right line up. I would like them to look like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
Would there be anyway to make them look like this without knowing exactly how long the longest domain is? Any help would be greatly appreciated!
The code that outputted this is:
grep -E -o -r "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" $ARCHIVE | sed 's/.*#//' | uniq -ci | sort | sed 's/^ *//g' | awk ' { t = $1; $1 = $2; $2 = t; print; } ' > temp2
ALIGNMENT:
Just use cat with column command and thats it:
cat /path/to/your/file | column -t
For more details on column command refer http://manpages.ubuntu.com/manpages/natty/man1/column.1.html
EDITED:
View file in terminal:
column -t < /path/to/your/file
(as noted by anishsane)
Export to a file:
column -t < /path/to/your/file > /output/file
I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!
This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.
I have file like this
1814 1
2076 2
2076 1
3958 1
2076 2
2498 3
2858 2
2858 1
1818 2
1814 1
2423 1
3588 12
2026 2
2076 1
1814 1
3576 1
2005 2
1814 1
2107 1
2810 1
I would like to generate report like this
1814 3
2076 6
3958 1
2858 3
Basically calculate the total for each unique value in column 1
Using awk:
awk '{s[$1] += $2} END{ for (x in s) print x, s[x] }' input
Pure Bash:
declare -a sum
while read key val ; do
((sum[key]+=val))
done < "$infile"
for key in ${!sum[#]}; do
printf "%4d %4d\n" $key ${sum[$key]}
done
The output is sorted:
1814 4
1818 2
2005 2
2026 2
2076 6
2107 1
2423 1
2498 3
2810 1
2858 3
3576 1
3588 12
3958 1
Perl solution:
perl -lane '$s{$F[0]} += $F[1] }{ print "$_ $s{$_}" for keys %s' INPUT
Note that the output is different from the one you gave.
sum totals for each primary key (integers only)
for key in $(cut -d\ -f1 test.txt | sort -u)
do
echo $key $(echo $(grep $key test.txt | cut -d\ -f2 | tr \\n +)0 | bc)
done
simply sum a column of integers
echo $(cut -d\ -f2 test.txt | tr \\n +)0 | bc