Average of first ten numbers of text file using bash - bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?

Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4

$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.

updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

Sort & Uniq the values on specific column

I am having a data separated by : delimeted
AA:w_c;w_c;r_c:1;3
BB:sync;sync:4
CC:t_wak;t_wak:6;7;8
I need to print only one value in column 2 that to unique value. If there are more than one unique value then it need to print in another file.
I tried this:
#!/bin/bash
sort -u -t : -k2,2 file >> txt
awk -F: '{gsub(";"," ",$3)}1' txt
Output:
BB:sync;sync:4
CC t_wak;t_wak 6 7 8
AA w_c;w_c;r_c 1 3
Actually I am trying to to do sort and uniq the values in column 2 and copying that output to another file called "txt". Then I am using AWk to replace the ; with space in column 3 seems above code is not working.
Desired Output 1:
BB:sync:4
CC:t_wak:6 7 8
The above two values are the actual output we need to get to print because in column 2 it contains only one value.
The below one needs to print in another file because in column 2 it contains more than one value.
Desired output 2:
AA:w_c;r_c:1;3
w_c
r_c
In column 2 it should have only one value, if there are more than one then need to print in another file by stating them as shown above.
This quick solution should work for the example:
awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' input
Let's do a test:
kent$ awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' f
BB:sync:4
CC:t_wak:6;7;8
kent$ cat output2.txt
AA:w_c;r_c:1;3
If you want to have each value in col2 in the output2.txt:
awk 'BEGIN{FS=OFS=":";out2="output2.txt"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > out2
for(x in u)
print x > out2
next
}
}7' input
Then you'll get:
kent$ cat output2.txt
AA:w_c;r_c:1;3
w_c
r_c

bash cycle - output according to string from file

How to call the output file as the string in 4th column of output (or according to 4th column of ith row of the input)?
I tried:
for i in {1..321}; do
awk '(FNR==i) {outfile = $4 print $0 >> outfile}' RV1_phase;
done
or
for i in {1..321}; do
awk '(FNR==i) {outfile = $4; print $0}' RV1_phase > "$outfile";
done
input file:
1 2 2 a
4 5 6 f
4 4 5 f
....
....
desired input i=1
name: a
1 2 2 a
The aim: I have data that I plotted in gnuplot and I would like to plot set of figures named after string to know which point come from which file. The point will be coloured. I need to get files for plotting in gnuplot so I would like to create them using the cycle from my question.
Simply
for i in {1..321}; do
awk '(FNR==i) {print $0 >> $4}' RV1_phase;
done
The problem with your first attempt was that you didn't use a ; to separate the assignment to outfile from the print command. The separate variable isn't necessary, though.
You don't need a bash loop, either:
awk '1 <= FNR && FNR <= 321 {print $0 >> $4}' RV1_phase;

Shell command to sum up numbers across similar lines of text in a file

I have a file with thousands of lines, each containing a number followed by a line of text. I'd like to add up the numbers for the lines whose text is similar. I'd like unique lines to be output as well.
For example:
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
The output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Any suggestions how this could be achieved in unix shell?
I looked at Shell command to sum integers, one per line? but this is about summing up a column of numbers across all lines in a file, not across similar text lines only.
There is no need for multiple processes and pipes. awk alone is more than capable of handling the entire job (and will be orders of magnitude faster on large files). With awk simply append each of the fields 2-NF as a string and use that as an index to sum the numbers in field 1 in an array. Then in the END section, simply output the contents of the array, e.g. presuming your data is stored in file, you could do:
awk '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
str=""
}
END {
for (i in a) print a[i], i
}' file
Above, the first for loop simply appends all fields from 2-NF in str, a[str] += $1 sums the values in field 1 into array a using str as an index. That ensures the values for similar lines are summed. In the END section, you simply loop over each element of the array outputting the element value (the sum) and then the index (original str for fields 2-NF).
Example Use/Output
Just take what is above, select it, and then middle-mouse paste it into a command line in the directory where your file is located (change the name of file to your data file name)
$ awk '{
> for (i=2; i<=NF; i++)
> str = str " " $i
> a[str] += $1
> str=""
> }
> END {
> for (i in a) print a[i], i
> }' file
30 take a test
37 cup of coffee
75 sign on the dotted
If you want the lines sorted in a different order, just add | sort [options] after the filename to pipe the output to sort. For example for output in the order you show, you would use | sort -k 2 and the output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Preserving Original Order Of Strings
Pursuant to your comment regarding how to preserve the original order of the lines of text seen in your input file, you can keep a second array where the strings are stored in the order they are seen using a sequential index to keep them in order. For example the o array (order array) is used below to store the unique string (fields 2-NF) and the variable n is used as a counter. A loop over the array is used to check whether the string is already contained, and if so, next is used to avoid storing the string and jump to the next record of input. In END the loop then uses a for (i = 0; i < n; i++) form to output the information from both arrays in the order the string were seen in the original file, e.g.
awk -v n=0 '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
for (i = 0; i < n; i++)
if (o[i] == str) {
str=""
next;
}
o[n++] = str;
str=""
}
END {
for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file
Output
37 cup of coffee
75 sign on the dotted
30 take a test
Here is a simple awk script that do the task:
script.awk
{ # for each input line
inpText = substr($0, length($1)+2); # read the input text after 1st field
inpArr[inpText] = inpArr[inpText] + 0 + $1; # accumulate the 1st field in array
}
END { # post processing
for (i in inpArr) { # for each element in inpArr
print inpArr[i], i; # print the sum and the key
}
}
input.txt
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
running:
awk -f script.awk input.txt
output:
75 sign on the dotted
37 cup of coffee
30 take a test
Using datamash is relatively succinct. First use sed to change the first space to a tab, (for this job datamash must have one and only one tab separator), then use -s -g2 to sort groups by the 2nd field, (i.e. "cup" etc.), then use sum 1 to add up the first column numbers by group, and it's done. No, not quite -- the number column migrated to the 2nd field for some reason, so reverse migrates it back to the 1st field:
sed 's/ /\t/' file | datamash -s -g2 sum 1 | datamash reverse
Output:
37 cup of coffee
75 sign on the dotted
30 take a test
You can do the following (assume the name of the file is file.txt):
for key in $(sort -k2 -u file.txt | cut -d ' ' -f2)
do
cat file.txt|grep $key | awk '{s+=$1} END {print $2 "\t" s}'
done
Explanation:
1. get all unique keys (cup of coffee, sign on the dotted, take a test):
sort -k2 -u file.txt | cut -d ' ' -f2
2. grep all lines with unique key from the file:
cat file.txt | grep $key
3. Sum the lines using awk where $1=number column and $2 = key
awk '{s+=$1} END {print $2 "\t" s}'
Put everything in for loop and iterate over the unique keys
Note: If a key can be a sub-string of another key, for example "coffee" and "cup of coffee" you will need to change step 2 to grep with regex
you mean something like this?
#!/bin/bash
# define a dictionary
declare -A dict
# loop over all lines
while read -r line; do
# read first word as value and the rest as text
IFS=' ' read value text <<< "$line"
# use 'text' as key, get value for 'text', default 0
[ ${dict[$text]+exists} ] && dictvalue="${dict[$text]}" || dictvalue=0
# sum value
value=$(( $dictvalue + value ))
# save new value in dictionary
dict[$text]="$value"
done < data.txt
# loop over dictionary, print sum and text
for key in "${!dict[#]}"; do
printf "%s %s\n" "${dict[$key]}" "$key"
done
output
37 cup of coffee
75 sign on the dotted
30 take a test
Another version based on the same logic as mentioned here #David.
Changes: It omits loops to speed up the process.
awk '
{
text=substr($0, index($0,$2))
if(!(text in text_sums)){ texts[i++]=text }
text_sums[text]+=$1
}
END {
for (i in texts) print text_sums[texts[i]],texts[i]
}' input.txt
Explanation:
substr returns the string starting with field 2. i.e. text part
array texts stores text on integer index, if its not present in text_sums array.
text_sums keep adding field 1 for a corresponding text.
Reason behind a separate array to store text as value backed by consecutive integer as index, is to assures the order of value (text) while accessing in same consecutive order.
See Array Intro
Footnotes says:
The ordering will vary among awk implementations, which typically use hash tables to store array elements and values.

Resources