sum of column in text file using shell script - bash

I have file like this
1814 1
2076 2
2076 1
3958 1
2076 2
2498 3
2858 2
2858 1
1818 2
1814 1
2423 1
3588 12
2026 2
2076 1
1814 1
3576 1
2005 2
1814 1
2107 1
2810 1
I would like to generate report like this
1814 3
2076 6
3958 1
2858 3
Basically calculate the total for each unique value in column 1

Using awk:
awk '{s[$1] += $2} END{ for (x in s) print x, s[x] }' input

Pure Bash:
declare -a sum
while read key val ; do
((sum[key]+=val))
done < "$infile"
for key in ${!sum[#]}; do
printf "%4d %4d\n" $key ${sum[$key]}
done
The output is sorted:
1814 4
1818 2
2005 2
2026 2
2076 6
2107 1
2423 1
2498 3
2810 1
2858 3
3576 1
3588 12
3958 1

Perl solution:
perl -lane '$s{$F[0]} += $F[1] }{ print "$_ $s{$_}" for keys %s' INPUT
Note that the output is different from the one you gave.

sum totals for each primary key (integers only)
for key in $(cut -d\ -f1 test.txt | sort -u)
do
echo $key $(echo $(grep $key test.txt | cut -d\ -f2 | tr \\n +)0 | bc)
done
simply sum a column of integers
echo $(cut -d\ -f2 test.txt | tr \\n +)0 | bc

Related

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

merge two files having the same value in bash

I am trying to merge 2 files in one single.
FILE1
2015-09-30T13:30:57+01:00 6 1
2015-09-30T13:30:58+01:00 6 1
2015-09-30T13:30:59+01:00 6 1
2015-09-30T13:31:00+01:00 6 1
2015-09-30T13:31:01+01:00 6 1
2015-09-30T13:31:02+01:00 6 1
2015-09-30T13:31:04+01:00 6 1
FILE2
2015-09-30T13:16:19+01:00 4
2015-09-30T13:16:20+01:00 7
2015-09-30T13:16:21+01:00 7
2015-09-30T13:16:22+01:00 8
2015-09-30T13:16:23+01:00 8
2015-09-30T13:16:24+01:00 7
2015-09-30T13:16:25+01:00 2
2015-09-30T13:16:26+01:00 4
2015-09-30T13:16:27+01:00 1
2015-09-30T13:30:58+01:00 1
The result that I am trying to get is to add the column 2 from FILE2 being added to FILE1 as fourth columns as the time match:
2015-09-30T13:30:57+01:00 6 1 4
2015-09-30T13:16:23+01:00 8 3 1
Thank you for your help,
Al.
Use cut to find the first column and nested while loop to compare the firsts columns:
#!/usr/bin/bash
printf "" > FILE3
while read line1; do
file1_first_col=$(printf "${line1}" | cut -f1 -d' ')
printf "${line1}" >> FILE3
while read line2; do
file2_first_col=$(printf "${line2}"| cut -f1 -d' ')
if [[ "${file1_first_col}" == "${file2_first_col}" ]]; then
file2_second_col=$(printf "${line2}" | cut -f2 -d' ')
printf " ${file2_second_col}" >> FILE3
fi
done < FILE2
printf "\n" >> FILE3
done < FILE1
Then print the result to a file called FILE3.
NOTE that for large files this may be very slow.

Sum of Columns for multiple variables

Using Shell Script (Bash), I am trying to sum the columns for all the different variables of a list. Suppose I have the following input of a Test.tsv file
Win Lost
Anna 1 1
Charlotte 3 1
Lauren 5 5
Lauren 6 3
Charlotte 3 2
Charlotte 4 5
Charlotte 2 5
Anna 6 4
Charlotte 2 3
Lauren 3 6
Anna 1 2
Anna 6 2
Lauren 2 1
Lauren 5 5
Lauren 6 6
Charlotte 1 3
Anna 1 4
And I want to sum up how much each of the participants have won and lost. So I want to get this as a result:
Sum Win Sum Lost
Anna 57 58
Charlotte 56 57
Lauren 53 56
What I would usually do is take the sum per person and per column and repeat that process over and over. See below how I would do it for the example mentioned:
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f2 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f3 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
However I would need to repeat this line for every participant. This becomes a pain when you have to many variables you want to sum it up for.
What would be the way to write this script?
Thanks!
This is pretty straightforward with awk. Using GNU awk:
awk -F '\t' 'BEGIN { OFS = FS } NR > 1 { won[$1] += $2; lost[$1] += $3 } END { PROCINFO["sorted_in"] = "#ind_str_asc"; print "", "Sum Win", "Sum Lost"; for(p in won) print p, won[p], lost[p] }' filename
-F '\t' makes awk split lines at tabs, then:
BEGIN { OFS = FS } # the output should be separated the same way as the input
NR > 1 { # From the second line forward (skip header)
won[$1] += $2 # tally up totals
lost[$1] += $3
}
END { # When done, print the lot.
# GNU-specific: Sorted traversal or player names
PROCINFO["sorted_in"] = "#ind_str_asc"
print "", "Sum Win", "Sum Lost"
for(p in won) print p, won[p], lost[p]
}

How to grep two column from a single file

cat Error00
4 0 375
4 2001 21
4 2002 20
cat Error01
4 0 465
4 2001 12
4 2002 40
4 2016 1
I want output as below
4 0 375 465
4 2001 21 12
4 2002 20 20
4 2016 - 1
i am using the below query. here problem is i m not able to handle grep for two field because space is coming.
please suggest how can to get rid of this.
keylist=$(awk '{print $1,$2'} Error0[0-1] | sort | uniq)
for key in ${keylist} ; do
echo ${key}
val_a=$(grep "^${key}" Error00 | awk '{print $3}') ;val_a=${val_a:---}
val_b=$(grep "^${key}" Error01 | awk '{print $1,$2}') ; val_b=${val_b:--- --}
echo $key ${val_a} >>testreport
done
i m geting the oputput as below
4 375 465
0
4 21 12
2001
4 20 20
2002
4 - 1
2016
A single awk one liner can handle this easily:
awk 'FNR==NR{a[$1,$2]=$3;next}{print $1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3}' err0 err1
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
For formatted output you can use printf instead of print. Like Jonathan Leffler suggest:
printf "%s %-6s %-6s %s\n",$1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
However a general solution is to use column -t for a nice table output:
awk '{....}' err0 err1 | column -t
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
grep is not really the right tool for this job. You can either play with awk or Perl (or Python, or …), or you can use join. However, join only joins on a single column at a time, and you appear to need to join on two columns. So, we're going to have to massage the data so that it will work with join. I'm about to assume you're using bash and so have process substitution available. You can do the job without, but it is fiddlier and involves temporary files (and traps to clean them up, etc).
The key to the join will be to replace the blank between the first two columns with a colon (or any other convenient character — control-A would work fine too), then join the files on column 1 with a replacement character. The inputs must be sorted; the output must have the colon replaced with a blank.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$
The 's/ */:/' operation replaces the first sequence of one or more blanks with a colon; the input data has two blanks between the 4 and the 0 in the first line of Error00. The input to join must be in sorted order of the joining field, here the first field. The output is the join field, the second column of Error00 and the second column of Error01 (remembering that means the second column after the first two have been fused by the colon). If there's an unmatched line in the first file, generate an output line (-a 1); ditto for the second file; and for the missing fields, insert a dash (-e '-'). The final sed removes the colon that was added.
If you want the data formatted, pipe it through awk.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /' |
> awk '{printf("%s %-6s %-6s %s\n", $1, $2, $3, $4)}'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$

Resources