Bash script: suming columns by unique name {name:number} [duplicate]

Bash script: suming columns by unique name {name:number} [duplicate] - bash

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 5 years ago.
not a native speaker so the best way to explain is to give example, of what I have to do.
name1: 15
name2: 20
name1: 8
name3: 30
Now this is a short example of the output I get when greping from a file.
Now I'm not sure how to handle suming of those numbers, so that the final solution is
name1: 23
name2: 20
name3: 30
There are several ways to solve this, and the only way I currently see is something involving Arrays, which I was told is not the best way to think about in Bash.
Thank you for your help and sorry if the question has been asked before.

awk 'NF{a[$1]+=$NF} END{for(i in a)print i, a[i]}' File
This would work for all non-empty lines.
Example:
$ cat File
name1: 15
name2: 20
name1: 8
name3: 30
Sample:
$ awk 'NF{a[$1]+=$NF} END{for(i in a)print i, a[i]}' File
name1: 23
name2: 20
name3: 30

Apologies for previous answer.. I didn't read your question properly (not enough coffee yet).
This might do what you want.
declare -A group_totals
while read -r group value ; do
group_totals[$group]=$(( group_totals[group] + value ))
done < <(grep command_here input_file)
for group in "${!group_totals[#]}" ; do
echo "$group: ${group_totals[$group]}"
done

Related

bash print only last 7 fields

I have hundreds of thousands of files with several hundreds of thousands of lines in each of them.
2022-09-19/SALES_1.csv:CUST1,US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20/SALES_2.csv:CUST2,NA,2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
The lines may have different number of fields. NO matter how many fields are present, I'm wanting to extract just the last 7 fields.
I'm trying with cut & awk but, have been only able to prit a range of fields but not last 'n' fields.
Please could I request guidance.

$ rev file | cut -d, -f1-7 | rev
will give the last 7 fields regardless of varying number of fields in each record.

Using any POSIX awk:
$ awk -F',' 'NF>7{sub("([^,]*,){"NF-7"}","")} 1' file
US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20,12.4,16.08,48.08,18.9,15.9,3517

1 {m,g}awk' BEGIN { _+=(_+=_^= FS = OFS = ",")+_
2 ___= "^[^"(__= "\5") ("]*")__
3
4 } NF<=_ || ($(NF-_) = __$(NF-_))^(sub(___,"")*!_)'
US,
2022-09-19,
43.31,
17.56,
47.1,
154.48,
154. 114
2022-09-20,
12.4,
16.08,
48.08,
18.9,
15.9,
3517

In pure Bash, without any external processes and/or pipes:
(IFS=,; while read -ra line; do printf '%s\n' "${line[*]: -7}"; done;) < file

Prints the last 7 fields:
sed -E 's/.*,((.*,){6}.*)/\1/' file

awk: line 1: syntax error at or near , need some help for bash [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
here is the code, the CN part is working but awk...
I run function outside and it seems really clear. I just meet the bash :(
windowsearch()
{
starting_line_number=$1
ending_line_number=$2
position=$3
file_name=$4
CN=$(head -40 "$4" | sed -n "$starting_line_number","$ending_line_number p" )
awk -v CN="$CN" -F "\t" '{ print CN }' "$file_name" | sort -n -k"$position"
}
windowsearch 10 20 2 $imdbdir/tsv2/title.principals.tsv
desired output is like:
should yield:
tt0000009 nm0085156,nm0063086,nm1309758,nm0183823
tt0000014 nm0166380,nm0525910,nm0244989
tt0000010 nm0525910
tt0000016 nm0525910
tt0000012 nm0525910,nm0525908
tt0000015 nm0721526
tt0000018 nm0804434,nm3692071
tt0000019 nm0932055
tt0000013 nm1715062,nm0525910,nm0525908
tt0000017 nm3691272,nm0804434,nm1587194,nm3692829
tt0000011 nm3692297,nm0804434
but my output is giving me all data in the file. So, I think my filter doesn't work.
edit: sorry for the misunderstanding, this is my first question.

Your question lacks a description of the task and, ideally, examples of input data and desired output. It is hard to guess someone’s intentions from a completely broken script snippet. A possible wild guess might be:
windowsearch() {
awk "NR > ${2} {exit}
NR >= ${1}" < "$4" | sort -k "$3"
}
The awk code exits after it exceeds the upper limit on line numbers and prints entire lines after it reaches the lower limit. (NR is the current line number.) The output from awk (which is the interval of lines between the lower and upper limit) gets sorted (which awk itself can do as well, but using sort was shorter in this case).
Example (sort /etc/fstab lines 9 through 13 by mount point (field 2)):
windowsearch 9 13 2 /etc/fstab

My interpretation of your intention
you want to sort a range of lines from a file based on a given column.
$ awk -v start=10 -v end=20 'start<=NR && NR<=end' | sort -n k2
just parametrize input values in your script

How to reduce run time of shell script? [duplicate]

This question already has answers here:
Take nth column in a text file
(6 answers)
Closed 2 years ago.
I have written a simple code that takes data from a text file( which has space-separated columns and 1.5 million rows) gives the output file with the specified column. But this code takes more than an hr to execute. Can anyone help me out to optimize runtime
a=0
cat 1c_input.txt/$1 | while read p
do
IFS=" "
for i in $p
do
a=`expr $a + 1`
if [ $a -eq $2 ]
then
echo "$i"
fi
done
a=0
done >> ./1.c.$2.column.freq
some lines of sample input:
1 ib Jim 34
1 cr JoHn 24
1 ut MaRY 46
2 ti Jim 41
2 ye john 6
2 wf JoHn 22
3 ye jOE 42
3 hx jiM 21
some lines of sample output if the second argument entered is 3:
Jim
JoHn
MaRY
Jim
john
JoHn
jOE
jiM

I guess you are trying to print just 1 column, then do something like
#! /bin/bash
awk -v c="$2" '{print $c}' 1c_input.txt/$1 >> ./1.c.$2.column.freq

If you just want something faster, use a utility like cut. So to
extract the third field from a single space delimited file bigfile
do:
cut -d ' ' -f 3 bigfile
To optimize the shell code in the question, using only builtin shell
commands, do something like:
while read a b c d; echo "$c"; done < bigfile
...if the field to be printed is a command line parameter, there are
several shell command methods, but they're all based on that line.

AWK - using substr integer as part of if condition

I'm attempting to use awk one liner to print lines of a file in which the substring is less than a defined variable. Also the line must start with the letter E. The E condition is working, but not the result for the simple if 'less than' I'm looking for. What am I doing wrong here?? It is incorporated into a larger bash script. Thanks in advance.
#!/bin/bash
minimum_dpt=50
awk -v depth="$minimum_dpt" '{if (/^E/ && int(substr($0,65,6)<depth)) print "Shot: ",substr($0,21,5)," has depth below minimum. Value: ",substr($0,65,6)'}
Input:
E1985020687 1 1 2942984632.99S 88 354.60E 596044.16185585.10000.9 44 826 9
E1985020687 1 1 2943264732.95S 88 359.24E 595917.26185461.80000.5 44 82727
E1985020687 1 1 2944264741.97S 88 450.86E 594520.36185751.92445.3 44 82846
E1985020687 1 1 2945264741.97S 88 450.86E 594520.36185751.90045.3 44 82846
Output:
Shot: 2942 has depth below minimum. Value: 0000.9
Shot: 2943 has depth below minimum. Value: 0000.5
Shot: 2945 has depth below minimum. Value: 0045.3

You probably intended:
int(substr($0,65,6))<depth
or even just:
(substr($0,65,6)+0)<depth
instead of what you have:
int(substr($0,65,6)<depth)
There's probably a better way to do this but without seeing your input and output idk...

A possible solution for the task like that:
$ cat input
102030405060
102030405060
203050601070
904050308090
104030607040
406080903040
$ awk -v dpt=50 '/^1/ && (int(substr($0, 9, 2)) > int(dpt))' <input
104030607040
(edited according to Ed's comment, thanks ;)

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!

That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.

I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash script: suming columns by unique name {name:number} [duplicate] - bash

awk 'NF{a[$1]+=$NF} END{for(i in a)print i, a[i]}' File This would work for all non-empty lines. Example: $ cat File name1: 15 name2: 20 name1: 8 name3: 30 Sample: $ awk 'NF{a[$1]+=$NF} END{for(i in a)print i, a[i]}' File name1: 23 name2: 20 name3: 30

Related

bash print only last 7 fields

awk: line 1: syntax error at or near , need some help for bash [closed]

How to reduce run time of shell script? [duplicate]

AWK - using substr integer as part of if condition

Using awk with Operations on Variables

Categories

Resources