Sum values of specific column in multiple files, considering ranges defined in another file - bash

I have a file (let say file B) like this:
File B:
A1 3 5
A1 7 9
A2 2 5
A3 1 3
The first column defines a filename and the other two define a range in that specific file. In the same directory, I have also three more files (File A1, A2 and A3). Here is 10 sample lines from each file:
File A1:
1 0.6
2 0.04
3 0.4
4 0.5
5 0.009
6 0.2
7 0.3
8 0.2
9 0.15
10 0.1
File A2:
1 0.2
2 0.1
3 0.2
4 0.4
5 0.2
6 0.3
7 0.8
8 0.1
9 0.9
10 0.4
File A3:
1 0.1
2 0.2
3 0.5
4 0.3
5 0.7
6 0.3
7 0.3
8 0.2
9 0.8
10 0.1
I need to add a new column to file B, which in each line gives the sum of values of column two in the defined range and file. For example, file B row 1 means that calculate the sum of values of line 3 to 5 in column two of file A1.
The desired output is something like this:
File B:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
All files are in tabular text format. How can I perform this task? I have access to bash (ubuntu 14.04) and R but not an expert bash or R programmer.
Any help would be greatly appreciated.
Thanks in advance

Given the first file fileB and 3 input files A1, A2 and A3, each with two columns, this produces the output that you want:
#!/bin/bash
while read -r file start end; do
sum=$(awk -vs="$start" -ve="$end" 'NR==s,NR==e{sum+=$2}END{print sum}' "$file")
echo "$file $start $end $sum"
done < fileB
This uses awk to sum up the values between lines between the range specified by the variables s and e. It's not particularly efficient, as it loops through $file once per line of fileB but depending on the size of your inputs, that may not be a problem.
Output:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
To redirect the output to a file, just add > output_file to the end of the loop. To overwrite the original file, you will need to first write to a temporary file, then overwrite the original file (e.g. > tmp && mv tmp fileB).

Related

Divide an output into multiple variables using shell script

So I have a C program that outputs many numbers. I have to check them all. The problem is, each time I run my program, I need to change seeds. In order to do that, I've been doing it manually and was trying to make a shell script to work around this.
I've tried using sed but couldn't manage to do it.
I'm trying to get the output like this:
a=$(./algorithm < input.txt)
b=$(./algorithm2 < input.txt)
c=$(./algorithm3 < input.txt)
The output of each algorithm program is something like this:
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
So the variable a has all this output, and what I need is
variable a to contain this whole string
and variable a1 to contain only the third number, in this case, 315.
Another example:
2 3 712
1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
echo $b should give this output:
2 3 712
1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
and echo $b1 should give this output:
712
Thanks!
Not exactly what you are asking, but one way to do this would be to store the results of your algorithm in arrays, and then dereference the item of interest. You'd write something like:
a=( $(./algorithm < input.txt) )
b=( $(./algorithm2 < input.txt) )
c=( $(./algorithm3 < input.txt) )
Notice the extra () that encloses the statements. Now, a, b and c are arrays, and you can access the item of interest like ${a[0]} or $a[1].
For your particular case, since you want the 3rd element, that would have index = 2, hence:
a1=${a[2]}
b1=${b[2]}
c1=${c[2]}
Since you are using the Bash shell (see your tags), you can use Bash arrays to easily access the individual fields in your output strings. For example like so:
#!/bin/bash
# Your lines to gather the output:
# a=$(./algorithm < input.txt)
# b=$(./algorithm2 < input.txt)
# c=$(./algorithm3 < input.txt)
# Just to use your example output strings:
a="$(printf "12 13 315 \n 1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5")"
b="$(printf "2 3 712 \n 1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21")"
# Put the output in arrays.
a_array=($a)
b_array=($b)
# You can access the array elements individually.
# The array index starts from 0.
# (The names a1 and b1 for the third elements were your choice.)
a1="${a_array[2]}"
b1="${b_array[2]}"
# Print output strings.
# (The newlines in $a and $b are gobbled by echo, since they are not quoted.)
echo "Output a:" $a
echo "Output b:" $b
# Print third elements.
echo "3rd from a: $a1"
echo "3rd from b: $b1"
This script outputs
Output a: 12 13 315 1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
Output b: 2 3 712 1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
3rd from a: 315
3rd from b: 712
Explanation:
The trick here is that array constants (literals) in Bash have the form
(<space_separated_list_of_elements>)
for example
(1 2 3 4 a b c nearly_any_string 99)
Any variable that gets such an array assigned, automatically becomes an array variable. In the script above, this is what happens in a_array=($a): Bash expands the $a to the <space_separated_list_of_elements> and reads the whole expression again interpreting it as an array constant.
Individual elements in such arrays can be referenced like variables by using expressions of the form
<array_name>[<idx>]
like a variable name. Therein, <array_name>is the name of the array and <idx> is an integer that references the individual element. For arrays that are represented by array constants, the index counts elements continuously starting from zero. Therefore, in the script, ${a_array[2]} expands to the third element in the array a_array. If the array would have less elements, a_array[2] would be considered unset.
You can output all elements in the array a_array, the corresponding index array, and the number of elements in the array respectively by
echo "${a_array[#]}"
echo "${!a_array[#]}"
echo "${#a_array[#]}"
These commands can be used to track down the fate of the newline: Given the script above, it is still in $a, as can be seen by (watch the quotes)
echo "$a"
which yields
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
But the newline did not make it into the array a_array. This is because Bash considers it as part of the whitespace that separates the third and the fourth element in the array assignment. The same applies if there are no extra spaces around the newline, like here:
12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
I actually assume that the output of your C program comes in this form.
This will store the full string in a[0] and the individual fields in a[1-N]:
$ tmp=$(printf '12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5\n')
$ a=( $(printf '_ %s\n' "$tmp") )
$ a[0]="$tmp"
$ echo "${a[0]}"
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
$ echo "${a[3]}"
315
Obviously replace $(printf '12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5\n') with $(./algorithm < input.txt) in your real code.

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

Matching numbers in two different files using awk

I have two files (f1 and f2), both made of three columns, of different lengths. I would like to create a new file of four columns in the following way:
f1 f2
1 2 0.2 1 4 0.3
1 3 0.5 1 5 0.2
1 4 0.2 2 3 0.6
2 2 0.5
2 3 0.9
If the numbers in the first two columns are present in both files, then we print the first two numbers and the third number of each file (e.g. in both there is 1 4, in f3 there should be 1 4 0.2 0.3; otherwise, if the two first numbers are missing in f2 just print a zero in the fourth column.
The complete results of these example should be
f3
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6
The script that I wrote is the following:
awk '{str1=$1; str2=$2; str3=$3;
getline < "f2";
if($1==str1 && $2==str2)
print str1,str2,str3,$3 > "f3";
else
print str1,str2,str3,0 > "f3";
}' f1
but it just looks if the same two numbers are in the same row (it does not go through all the f2 file) giving as results
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0
2 2 0.5 0
2 3 0.9 0
This awk should work:
awk 'FNR==NR{a[$1,$2]=$3;next} {print $0, (a[$1,$2])? a[$1,$2]:0}' f2 f1
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6

reading vaues from table and editing text file with that values using bash

hi I am a beginner and learning bash scripting..
I have two files one is data table data.dat and another text file input.in
data.dat looks like this
a b c d
1 2 3 4
5 6 7 8
1 3 5 7
2 4 6 8
and input.in looks like
rc duct.gz
fi as df 500
def bc pff p 1 n 2 n 3 n 4 n n n
def bc po p 1 n 2 y n n n
now i want to replace the values in text file such as 1 2 3 4 with that of 5 6 7 8 from the table and save text file with some other name input2.in
and next time 1 2 3 4 should replace with 1 3 5 7 and save with other name input3.in
like this till it completes the table
Hint:
a=4
line="this is line with parameter a=$a"
eval echo ${line}
EDIT: extended answer.
You need two nested loops. In external one you read values from data.dat into a,b,c and d variables. The internal one reads the file input.in and for each read line you need to display it with eval echo. Just make sure you put $a, $b, $c and $d in proper places, like:
def bc pff p $a n $b n $c n $d n n n

how to display a specific value a(col#,row#) on a gnuplot chart?

I realize gnuplot 4.6 does not have a specific data point addressing capability and I would have to use a script to extract a given value and store it as a variable (for example, to extract a value in the 7th column in the 4th row from the last, I simplistically could use 'tail -4 data.out | head -1 | awk '{print $7}'). How could I store/assign that value as a gnuplot variable and then display it on a chart with the set label 1 sprintf("a = %3.4f",a) at x,y command?
Gnuplot understands backtics the same as your shell. So, to get at the particular value in your datafile:
a=`tail -4 data.dat | head -1 | awk '{print $7}'`
set label 1 sprintf("a=%3.4f",a) at x,y
When reading something like "can't be done with gnuplot", it "hurts" and encourages me
to nevertheless find a gnuplot-only solution which consequently is platform-independent. (See the above comments about Linux and Windows (in)compatibility issues.)
Although, sometimes it's getting cumbersome and less efficient, but sometime it is not much longer than the solution with external tools.
Basically, you can use stats and every for this (check help stats and help every), however, if you need the mth row from the last, you first need to know how many lines you have in total. That's why you have to run stats twice. Check the following example:
Data: SO11560130.dat
1 21 3 4 5 6 78
2 25 3 4 5 6 72
3 23 3 4 5 6 73
4 29 3 4 5 6 74
5 27 3 4 5 6 77
6 28 3 4 5 6 75
7 22 3 4 5 6 73
8 24 3 4 5 6 78
9 26 3 4 5 6 78
Script: (works with gnuplot 4.6.0, March 2012)
### extract specific value from given row/column
reset
FILE = "SO11560130.dat"
M = 4 # row from last
COL = 7 # column no.
stats FILE u 0 nooutput # get total number of lines in variable STATS_records
n = STATS_records - M # index 0-based
stats FILE u (x0=$1,y0=$2,a=column(COL)) every ::n::n nooutput # get the value and coordinates
set label 1 sprintf("a=%3.4f",a) at x0,y0 offset 0,1
plot FILE u 1:2 w lp pt 7 lc rgb "red" notitle
### end of script
Result:

Resources