Cell-wise summation of tables in a linux shell script - bash

I have a set of tables in the following format:
1000 3 0 15 14
2000 3 0 7 13
3000 2 3 14 12
4000 3 1 11 14
5000 1 1 9 14
6000 3 1 13 11
7000 3 0 10 15
They are in simple text files.
I want to merge these files into a new table in the same format, where each cell (X,Y) is the sum of all cells (X,Y) from the original set of tables. One slightly complicating factor is that the numbers from the first column should not be summed, since these are labels.
I suspect this can be done with AWK, but I'm not particularly versed in this language and can't find a solution on the web. If someone suggests another tool, that's also fine.
I want to do this from a bash shell script.

Give this a try:
#!/usr/bin/awk -f
{
for (i=2;i<=NF; i++)
a[$1,i]+=$i
b[$1]=$1
if (NF>maxNF) maxNF=NF
}
END {
n=asort(b,c)
for (i=1; i<=n; i++) {
printf "%s ", b[c[i]]
for (j=2;j<=maxNF;j++) {
printf "%d ", a[c[i],j]
}
print ""
}
}
Run it like this:
./sumcell.awk table1 table2 table3
or
./sumcell.awk table*
The output using your example input twice would look like this:
$ ./sumcell.awk table1 table1
1000 6 0 30 28
2000 6 0 14 26
3000 4 6 28 24
4000 6 2 22 28
5000 2 2 18 28
6000 6 2 26 22
7000 6 0 20 30

Sum each line, presuming at least one numeric column on each line.
while read line ; do
label=($line)
printf ${label[0]}' ' ;
expr $(
printf "${label[1]}"
for c in "${label[#]:2}" ; do
printf ' + '$c
done
)
done < table
EDIT: Of course I didn't see the comment about combining based on the label, so this is incomplete.

perl -anE'$h{$F[0]}[$_]+=$F[$_]for 1..4}{say$_,"#{$h{$_}}"for sort{$a<=>$b}keys%h' file_1 file_2

Related

How do I create a bash script, using loops, to create a Multiplication Table with 5 column/10 row format

Here is what I have:
#!/bin/bash
#create a multiplication table 5 columns 10 rows
echo " Multiplication Table "
echo "-----+-------------------------"
for x in {0..5}
do
for y in {0..10}
do
echo -n "$(( $x * $y )) "
done
echo
echo "-----+--------------------------"
done
This is my Output:
Multiplication Table
-----+-------------------------
0 0 0 0 0 0 0 0 0 0 0
-----+--------------------------
0 1 2 3 4 5 6 7 8 9 10
-----+--------------------------
0 2 4 6 8 10 12 14 16 18 20
-----+--------------------------
0 3 6 9 12 15 18 21 24 27 30
-----+--------------------------
0 4 8 12 16 20 24 28 32 36 40
-----+--------------------------
0 5 10 15 20 25 30 35 40 45 50
-----+--------------------------
This is the Needed Output:
Multiplication Table
----+-------------------------------------
| 0 1 2 3 4
----+-------------------------------------
0 | 0 0 0 0 0
1 | 0 1 2 3 4
2 | 0 2 4 6 8
3 | 0 3 6 9 12
4 | 0 4 8 12 16
5 | 0 5 10 15 20
6 | 0 6 12 18 24
7 | 0 7 14 21 28
8 | 0 8 16 24 32
9 | 0 9 18 27 36
----+-------------------------------------
I've tried to write this many different ways, but I'm struggling with finding a way to format it correctly. The first is pretty close, but I need it to have the sequential numbers being multiplied on the top and left side. I'm not sure how to use, or if I can use, the seq command to achieve this or if there is a better way. I also need to have straight columns and rows with the defining lines setting the table layout, but my looking up the column command hasn't produced the right output.
Here was my final output and code:
#!/bin/bash
#create a multiplication table 5 columns 10 rows
#Create top of the table
echo " Multiplication Table"
echo "----+------------------------------"
#Print the nums at top of table and format dashes
echo -n " |"; printf '\t%d' {0..5}; echo
echo "----+------------------------------"
#for loops to create table nums
for y in {0..9}
do
#Print the side nums and |
echo -n "$y |"
#for loop to create x
for x in {0..5}
do
#Multiply vars, tab for spacing
echo -en "\t$((x*y))"
done
#Print
echo
done
#Print bottom dashes for format
echo "----+------------------------------"
I changed a bit of Armali's code just to make it more appealing to the eye, and the echo was moved to the bottom (out of the loop) so it didn't print as many lines. But again, thank you Armali, as I would've spent a lot more time figuring out exactly how to write that printf code to get the format correct.
I'm not sure how to use, or if I can use, the seq command to achieve this …
seq offers no advantage here over bash's sequence expression combined with printf.
This variant of your script produces (with the usual 8-column tabs) the needed output:
#!/bin/bash
#create a multiplication table 5 columns 10 rows
echo " Multiplication Table"
echo "----+-------------------------------------"
echo -n " |"; printf '\t%d' {0..4}; echo
echo "----+-------------------------------------"
for y in {0..9}
do echo -n "$y |"
for x in {0..4}
do echo -en "\t$((x*y))"
done
echo
echo "----+-------------------------------------"
done

Shell/awk script to read a column of files and combining columns to make a TSV file

I have over 600 files and I need to extract single column from each of the files and write them in a output file. My current code does this work and it takes column from all files and write the columns one after another in output file. However, I need two thing in my output file:
In the output file, instead of adding columns one after another, I need each column from the input files will be added as a new column in the output file (preferably as a TSV file).
The column name will be replaced by the file name.
My example code:
for f in *; do cat "$f" | tr "\t" "~" | cut -d"~" -f2; done >out.txt
Example input:
file01.txt
col1 col2 col3
1 2 3
4 5 6
7 8 9
10 11 12
file02.txt
col4 col5 col6
11 12 13
14 15 16
17 18 19
110 111 112
My current output:
col2
2
5
8
11
col5
12
15
18
111
Expected output:
file01.txt file02.txt
2 12
5 15
8 18
11 111
You can use awk like this:
awk -v OFS='\t' 'BEGIN {
for (i=1; i<ARGC; i++)
printf ARGV[i] OFS;
print ARGV[i];
}
FNR==1 { next }
{
a[FNR]=(a[FNR]==""?"":a[FNR] OFS) $2
}
END {
for(i=2; i<=FNR; i++)
print a[i];
}' file*.txt
file01.txt file02.txt
2 12
5 15
8 18
11 111

Write the number of elements per line of a file and its repetitions with awk

I have a file with all different integer in which each line may have different lenghts, like this:
1 2 3 4 5
16 7 8
9 10 101 102 13 14
15 6 17
24 28 31 30 18
I would like to print in output the number of elements that a line presents and the number of times there is the same number of elements per lines; the output of this example should be:
3 2
5 2
6 1
In the first column there are the number of elements per line, in the second the number of lines that presents the same number of elements.
The first line in the file has 5 elements and also the 5th one etc etc.
Print the count for the number of fields:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file
5 2
6 1
3 2
Pipe to sort for ordered output:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file | sort
3 2
5 2
6 1

Add columns to a field in shell script

How can I add 6 more single space separated columns to a file.
The input file that looks like this:
-11.160574
...
-11.549076
-12.020907
...
-12.126601
...
-11.93235
...
-8.297653
Where ... represents 50 more lines of numbers.
The output I want is this:
-11.160574 1 1 1 1 1 14
...
-11.549076 51 51 1 1 1 14
-12.020907 1 1 2 2 1 14
...
-12.126601 51 51 2 2 1 14
...
-11.93235 1 1 51 51 1 14
...
-8.297653 51 51 51 51 1 14
The 2nd and 3rd columns are loops for 1 to 51.
The 4th and 5th columns are also loops for 1 to 51, but at the upper level from above.
The last two ones constants columns of 1 and 14.
Use a loop to read the file line-by-line and maintain counters to keep track of the field numbers as shown below:
#!/bin/bash
field1=1
field2=1
while read line
do
echo "$line $field1 $field1 $field2 $field2 1 14"
(( field1++ ))
if (( $field1 == 52 )); then
field1=1
(( field2++ ))
fi
done < file
Here you go, an awk script:
{
mod = 51
a = (NR - 1) % mod + 1
b = int((NR - 1) / mod) + 1
c = 1
d = 14
print $0,a,a,b,b,c,d
}
Run it with something like awk -f the-script.awk in-file.txt. Or make it executable and add #!/usr/bin/awk -f at the top, and you can run it directly without typing awk -f.

Converting a series of matrix files into an index of coordinates in awk

I have a time series of files 0000.vx.dat, 0000.vy.dat, 0000.vz.dat; ...; 0077.vx.dat, 0077.vy.dat, 0077.vz.dat... Each file is a space-separated 2D matrix. I would like to take each triplet of files and combine them all into a coordinate-based data format, i.e.:
[timestep + 1] [i] [j] [vx(i,j)] [vy(i,j)] [vz(i,j)]
Each file number corresponds to a particular time step. Given the amount of data I have in this time series (~ 4 GB), bash wasn't cutting it so it seemed to be time to head over to awk... specifically mawk. It was pretty stupid to try this in bash but here is
my ill-fated attempt:
for x in $(seq 1 78)
do
tfx=${tf[$x]} # an array of padded zeros
for y in $(seq 1 1568)
do
for z in $(seq 1 1344)
do
echo $x $y $z $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vx.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vy.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vz.dat) >> $file
done
done
done
edit: Thank you, ruakh, for pointing out that I had kept j in shell variable format with a $ in front! This is just a snippet of the original script, but I guess would be considered the guts of it!
Suffice it to say this would have taken about six months because of all the memory overhead in bash associated with O(MxN) algorithms, subshells and pipes and whatnot. I was looking for more along the lines of a day at most. Each file is around 18 MB, so it should not be that much of a problem. I would be happy with doing this one timestep at a time in awk provided that I get one output file per timestep. I could just cat them all together without much issue afterwords, I think. It is important, though, that the time step number be the first item on the coordinate list. I could achieve this with an awk -v argument (see above) in with a bash routine. I do not know how to look up specific elements of matrices in three separate files and put them all together into one output. This is the main hurdle I would like to overcome. I was hoping mawk could provide a nice balance between effort and computational speed. If this seems to be too much for an awk script, I could go to something lower level, and would appreciate any of those answering letting me know I should just go to C instead.
Thank you in advance! I really like awk, but am afraid I am a novice.
The three files, 0000.vx.dat, 0000.vy.dat, and 0000.vz.dat would read as follows (except huge and of the correct dimensions):
0000.vx.dat:
1 2 3
4 5 6
7 8 9
0000.vy.dat:
10 11 12
13 14 15
16 17 18
0000.vz.dat:
19 20 21
22 23 24
25 26 27
I would like to be able to input:
awk -v t=1 -f stackoverflow.awk 0000.vx.dat 0000.vy.dat 0000.vz.dat
and get the following output:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27
edit: Thank you, shellter, for suggesting I put the desired input and output more clearly!
Personally, I use gawk to process most of my text files. However, since you have requested a mawk compatible solution, here's one way to solve your problem. Run, in your present working directory:
for i in *.vx.dat; do nawk -f script.awk "$i" "${i%%.*}.vy.dat" "${i%%.*}.vz.dat"; done
Contents of script.awk:
FNR==1 {
FILENAME++
c=0
}
{
for (i=1;i<=NF;i++) {
c++
a[c] = (a[c] ? a[c] : FILENAME FS NR FS i) FS $i
}
}
END {
for (j=1;j<=c;j++) {
print a[j] > sprintf("%04d.dat", FILENAME)
}
}
When you run the above, the results should be a single file for each set of three files containing your coordinates. These output files will have the filenames in the form: timestamp + 1 ".dat". I decided to pad these filenames with four 0's for your convenience. But you can change this to whatever format you like. Here's the results I get from the sample data you've posted. Contents of 0001.dat:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27

Resources