Calculate numbers in file by bash script - bash

I have file with the following content.
e.g. 2 images with FRAGMENT size. I want to calculate the fragments total size in bash script.
IMAGE admindb1 8 admindb1_1514997916 bus 4 Default-Application-Backup 2 3 1 1517676316 0 0
FRAG 1 1 10784 0 2 6 2 HSBRQ2 fuj 65536 329579 1514995208 60 0 *NULL* 1517676316 0 3 1 *NULL*
IMAGE admindb1 8 admindb1_1514995211 bus 4 Default-Application-Backup 2 3 1 1517673611 0 0
FRAG 1 1 13168256 0 2 6 12 HSBQ8I fuj 65536 173783 1514316708 52 0 *NULL* 1517673611 0 3 1 *NULL*
FRAG 1 2 24288384 0 2 6 1 HSBRJ7 fuj 65536 2 1514995211 65 0 *NULL* 0 0 3 1 *NULL*
FRAG 1 3 24288384 0 2 6 1 HSBRON fuj 65536 2 1514995211 71 0 *NULL* 0 0 3 1 *NULL*
FRAG 1 4 13806752 0 2 6 1 HSBRRK fuj 65536 2 1514995211 49 0 *NULL* 0 0 3 1 *NULL*
Output should be like this:
For Image admindb1_1514997916 total size is 10784
For Image admindb1_1514995211 total size is 75551776
4th column at line which is beginning with FRAG should be calculated.
My script is not working:
#!/bin/bash
file1=/home/turgun/Desktop/IBteck/script-last/frags
imagelist=/home/turgun/Desktop/IBteck/script-last/imagelist
counter=1
for counter in `cat $imagelist`
do
n=`awk '/'$counter'/{ print NR; exit }' $file1`
for n in `cat $file1`
do
if [[ ! $n = 'IMAGE' ]]; then
echo "For Image $counter total size is " \
`grep FRAG frags | awk '{print total+=$4}'`
fi
done
done

awk 'function summary() {print "For Image",image,"total size is",sum}
$1=="IMAGE" {image=$4}
$1=="FRAG" {sum+=$4}
$1=="" {summary(); sum=0}
END{summary()}' file
Output:
For Image admindb1_1514997916 total size is 10784
For Image admindb1_1514995211 total size is 75551776
I assume that the last line is not empty.

cat -s file.txt | sed -n '/^IMAGE /{s:^[^ ]* *[^ ]* *[^ ]* *\([^ ]*\).*$:echo -n "For image \1 total size is "; echo `echo ":;s:\*::g;p};/^$/{s:^:0"`|bc:;p};/^FRAG /{s:^[^ ]* *[^ ]* *[^ ]* *\([^ ]*\).*$:\1\+:;s:\*::g;p};' | bash
Output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776

Awk solution:
awk '/^IMAGE/{
if (t) { printf "For image %s total size is %d\n",img,t; t=0 } img=$4
}
/^FRAG/{ t+=$4 }
END{ if (t) printf "For image %s total size is %d\n",img,t }' file
The output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776

Gnarly combo of cut, GNU sed, (with a careless use of evaluate), and datamash:
cut -d' ' -f4 file | datamash --no-strict transpose |
sed 's#\tN/A\t#\n#g;y#\t# #' |
sed 's/ \(.*\)/ $((\1))/;y/ /+/;s/+/ /;
s/\(.*\) \(.*\)/echo For image \1 total size is \2/e'
Output:
For image admindb1_1514997916 total size is 10784
For image admindb1_1514995211 total size is 75551776
Cyrus's answer is just better. This answer shows some limits of sed. Also if the data file is huge, (say millions of numbers to sum), the evaluate used to farm out the addition to the shell would probably exceed its command line length limit.

Related

Multiply all values in txt file in bash

I have a file that I need to multiply each number with -1. I have tried some commands but the result I get every time is only the first column multiplied with -1. Please help!
The file is as follows:
-1 2 3 -4 5 -6
7 -8 9 10 12 0
The expected output would be
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
Commands I have tried are:
awk '{print $0*-1}' file
sed 's/$/ -1*p /' file | bc (syntax error)
sed 's/$/ * -1 /' file | bc (syntax error)
numfmt --from-unit=-1 < file (error: numfmt: invalid unit size: ‘-1’)
With bash and an array:
while read -r -a arr; do
declare -ia 'arr_multiplied=( "${arr[#]/%/*-1}" )';
echo "${arr_multiplied[*]}";
done < file
Output:
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
I got this idea from this Stack Overflow answer by j4x.
One awk approach:
$ awk '{for (i=1;i<=NF;i++) $i=$i*-1} 1' file
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
Using the <var_or_field><op>=<value> construct:
$ awk '{for (i=1;i<=NF;i++) $i*=-1} 1' file
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
Using perl and its autosplit mode:
perl -lane 'print join(" ", map { $_ * -1 } #F)' file
To multiply every number in the file with -1, you can use the following 'awk'command:
`awk '{ for (i=1; i<=NF; i++) $i=$i*-1; print }' file`
This command reads each line of the file, and for each field (number) in the line, it multiplies it by -1. It then prints the modified line.
The output will be as follows:
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
Alternatively, you can use the following 'sed' command:
sed 's/-\([0-9]*\)/\1/g; s/\([0-9]*\)/-\1/g' file
This command replaces all negative numbers with their positive equivalent, and all positive numbers with their negative equivalent. The output will be the same as above.
For completeness an approach with ruby.
-l Line-ending processing
-a Auto-splitting, provides $F (field, set with -F)
-p Auto-prints $_ (line)
-e Execute code
ruby -lape '$_ = $F.map {|x| x.to_i * -1}.join " "' file
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
Just switching the signs ought to do.
$: cat file
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
$: sed 's/^ */ /; s/ */ /g; s/ -/ /g; s/ / -/g; s/-0/0/g; s/^ *//;' file
-1 2 3 -4 5 -6
7 -8 9 10 12 0
If you don't care about leading spaces or signs on your zeros, you can drop some of that. The logic is flexible, too...
$: sed 's/ *-/+/g; s/ / -/g; s/+/ /g;' x
1 2 3 -4 5 -6
7 -8 9 10 12 -0
There are multiple ways we can do this.
I can think of the following 2 ways
cat file | awk '{for (i=1;i<=NF;i++){ $i*=-1} print}'
This will give out
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
In this method we overwrite the $i value and print $0
Another way
cat random.xml | awk '{for (i=1;i<=NF;i++){printf("%d ",$i*-1)} printf("\n") }'
Gives the output
1 -2 -3 4 -5 6
-7 8 -9 -10 -12 0
In this method we print the value $i*-1 and so we need to use printf() function
don't "do math" and actually multiply by -1 -
just use regex to flip the signs, and process thousands or even millions of numbers with 3 calls to gsub()

Count overlapping occurrences of a substring *in a very large file* using Bash

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.
Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?
For example,
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
222
333
444
555
666
must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)
Find 6 overlapping substrings aaa in the string
line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"
You don't want to see the strings, you want to count them.
When you try
# wrong
grep -o -F "aaa" <<< "${line}" | wc -l
you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with
grep -Eo "a{3,}" <<< "${line}"
Result
aaa
aaaa
aaaaa
Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa.
Compare the total count of characters with the number of lines (wc):
match lines chars add_to_total
aaa 1 4 1
aaaa 1 5 2
aaaaa 1 6 3
For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate
15 characters - (3 lines * 3 characters) = 15 - 9 = 6
In code:
read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
Or for a file
read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are
abcdabc
abcdabcdabc
abcdabcdabcdabc
Use testline
line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"
you need "abc(dabc)+" and have
match lines chars add_to_total
abcdabcdabcdabc 1 16 3
abcdabcdabcdabc 1 16 3
abcdabc 1 8 1
abcdabcdabc 1 12 2
For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate
(52 characters / fixed 4) / 4 lines = 13 - 4 = 9
In code:
read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"
When you have a large file, you might want to split it first.
I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):
Use the summation method :
# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
${input……} | WHINY_USERS=1 {m,g}awk '
BEGIN {
1 FS = "[^a]+(aa?[^a]+)*"
1 OFS = "|"
1 PROCINFO["sorted_in"] = "#ind_str_asc"
} {
2 _ = ""
2 OFS = "|"
2 gsub("^[|]*|[|]*$",_, $!(NF=NF))
2 split(_,__)
split($-_,___,"[|]+")
12 for (_ in ___) {
12 __[___[_]]++
}
2 _____=____=_<_
2 OFS = "\t"
2 print " -- line # "(NR)
7 for (_ in __) {
7 print sprintf(" %20s",_), __[_], \
______=__[_] * (length(_)-2),\
"| "(____+=__[_]), _____+=______
}
print "" }'
|
-- line # 1
aaa 3 3 | 3 3
aaaa 2 4 | 5 7
aaaaa 3 9 | 8 16
aaaaaaaaaaaaaaa 1 13 | 9 29
-- line # 2
aaa 1 1 | 1 1
aaaa 1 2 | 2 3
aaaaa 1 3 | 3 6
Print out all the copies of that substring :
{m,g}awk' {
2 printf("%s%.*s",____=$(_=_<_),_, NF=NF)
9 do { _+=gsub(__,_____)
} while(index($+__,__))
2 if(_) {
2 ____=substr(____,-_<_,_)
2 gsub(".", (":")__, ____)
2 print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
} else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'
|
aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
aaa:aaa:aaa:aaa:aaa:aaa

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

sum of column in text file using shell script

I have file like this
1814 1
2076 2
2076 1
3958 1
2076 2
2498 3
2858 2
2858 1
1818 2
1814 1
2423 1
3588 12
2026 2
2076 1
1814 1
3576 1
2005 2
1814 1
2107 1
2810 1
I would like to generate report like this
1814 3
2076 6
3958 1
2858 3
Basically calculate the total for each unique value in column 1
Using awk:
awk '{s[$1] += $2} END{ for (x in s) print x, s[x] }' input
Pure Bash:
declare -a sum
while read key val ; do
((sum[key]+=val))
done < "$infile"
for key in ${!sum[#]}; do
printf "%4d %4d\n" $key ${sum[$key]}
done
The output is sorted:
1814 4
1818 2
2005 2
2026 2
2076 6
2107 1
2423 1
2498 3
2810 1
2858 3
3576 1
3588 12
3958 1
Perl solution:
perl -lane '$s{$F[0]} += $F[1] }{ print "$_ $s{$_}" for keys %s' INPUT
Note that the output is different from the one you gave.
sum totals for each primary key (integers only)
for key in $(cut -d\ -f1 test.txt | sort -u)
do
echo $key $(echo $(grep $key test.txt | cut -d\ -f2 | tr \\n +)0 | bc)
done
simply sum a column of integers
echo $(cut -d\ -f2 test.txt | tr \\n +)0 | bc

Resources